Discover more from Modern Data Democracy
Down the semantic rabbit hole
Thoughts on semantic layers, data transformations, data apps and the future of the data stack.
What is a semantic layer? How would it ideally work? Is it a definition store or an interface? What is it for? Who will own it? What are its main properties? How would a successful semantic layer change the way we think about the modern data stack?
I've been thinking about these questions a lot. And I have answers. All kinds of answers. Contradicting ones, even. This post started with a personal attempt to put order into the chaos of my thoughts on an ideal semantic layer. Down this rabbit hole I went and I invite you in. Proceed at your own risk.
"Do cats eat bats?" and sometimes "Do bats eat cats?" for, you see, as she couldn't answer either question, it didn't much matter which way she put it.
Alice in Wonderland - Down the rabbit hole
But first, a quick note: this post is about semantic layers, not metrics layers. Metrics are just a concept within the semantic layer. The idea of “metrics layer is all you need” that originally spun this movement would ultimately lead in my opinion to a bad design, not grounded on the DRY principle (Don’t Repeat Yourself).
Why bother?
Why do we even need a semantic layer?
The purpose of a semantic layer is to close the gap between the “business language” and the “data language” and offer a unified and consistent view of the business domain as represented in the data. The world of data is raw and influenced by decisions related to costs, storage vs. compute, performance considerations, the tools we use to query the data and their limitations, the way data is represented in a database system, etc. These considerations are not necessarily aligned with how a business domain is best modeled, although classic data warehouse people would say data warehouses should already model the business domain faithfully. But even they would agree that at least the kind of metadata in the data layer (technical column names, primitive data types) is not the same as the one relevant for a business domain (business-friendly names and descriptions, semantic types). Furthermore, there are concepts that usually live completely outside of the data layer, e.g. metrics or certain business logic. This last point (consistent metrics definition across tools) is what drove the resurgence in interest in the semantic layer among modern data stack enthusiasts.
To close the gap between the “business language” and the “data language” we need to assign meaning to our data: name it, describe it, expose its dependencies and relationships to other concepts, put it in context in its domain.
The idea of a semantic layer has evolved through time. Certain properties have remained the same, while others have grown in importance. I would like to go over what I believe are the main properties that have been discussed or not discussed enough on semantic layers:
Makes data understandable for business users
Provides a DRY framework to define higher-level concepts
Works across tools (and beyond them)
Built for change
Provides a shared semantic type system
Allows for iterative modelling with short feedback cycles
1. Makes data understandable for Business Users
One of the original purposes of a semantic layer was to make data understandable for business users. This meant, according to a Kimball Group’s article from 2013, organizing “the data elements in a way that’s intuitive to business people”, renaming them “so they make sense” and providing “an interface to hold business-oriented descriptions”.
Also, whenever we denormalize our data into “big wide tables”, it loses semantic information on the dependency between columns. If I have a big table with columns like ‘airport’, ‘city’, ‘state’, how do I know that ‘city’ is the one from the airport and that each airport is in one city, which is in one state? Sure you could go with ‘airport_city’ and ‘airport_city_state’, but in the end you are using a (bad) naming to represent domain knowledge.
Semantic layers offer us the opportunity to enrich our data with domain knowledge that will help our business users understand it.
2. Provides a DRY framework to define higher-level concepts
A core motivation for a semantic layer is to provide a place to define concepts, incl. (but not limited to) metrics, in a way that does not require you to repeat yourself in the definitions. Modelling semantically involves defining the right building blocks that allow you to define higher-level concepts based on them.
This may sound like I’m mixing two different things here: the engineering convenience of DRY (Don’t Repeat Yourself) definitions and the semantic modelling perspective. But in the end it is hard to separate these two ideas. If you find yourself not being DRY in your definitions, it means that you are not capturing the underlying dependency between concepts. In other words, you are not fully capturing its semantics.
Let’s say we have a table called customers
, which includes current ones and past ones, identified by an is_active
flag. The concept of “Customers” for a business person (and most applications) would definitely mean (if not otherwise specified) only those with the active flag set, so this should be represented in the semantic layer. This also means that if you want to know about the “number of customers”, it should also have the same flag set. Should you explicitly define both the metric “number of customers” and the concept “customer” to have the flag set? If you do, then you are not fully capturing the semantics that one is just counting the other.
The principle of defining things once and reusing them in other places is such an important idea that you could think about extending it not only within a tool, but also across tools. This brings us to the next point…
3. Works across tools
Historically this was not a requirement for a semantic layer. With the growth in the importance of data for businesses, companies have moved away from the idealized scenario of one BI/Analytics tool for the whole organization. There are tools for dashboarding, ad-hoc analysis, data notebooks, SQL/Python interfaces, spreadsheets, business observability, Reverse ETL, not to mention “data apps” - whatever that means. Suddenly, we found ourselves copying over and over the same definitions. Because the semantic layers we had available were not built with interoperability in mind, e.g. LookML, we’ve looked elsewhere to keep our definitions: the data warehouse. And that’s how dbt became the center of the modern data stack. It provided us with a framework to define dimensions and models in a centralized manner for all our tools. And the trick was that it did not rely on these tools’ explicit involvement in supporting dbt: it was pure SQL stored in the data warehouse.
But our dbt models did not provide us with the answer to consistent (non-pre-aggregated) metrics definitions, which is what dbt Labs is going for now with the introduction of the dbt server.
Common definition store
Interoperability can be achieved in different ways. One way is for tools to agree on a definition store and language, read the definitions and implement their own functionality based on it, e.g. query generation or displaying business-friendly descriptions. In the case of dbt, we have seen this happen to varying degrees with tools like Lightdash, Metriql, Cube, Transform, FlexIt and Veezoo. These tools read the definition of a metric and represent it in their own internal way. This requires the explicit adoption of a standard by different tools, while still relying on the correct implementation from every one. May be a bit too much to ask sometimes. And as the scope of what we want to model with the semantic layer increases, the complexity of the query generation implementation may increase as well. Finally, it doesn’t solve the case where users access the data programmatically, e.g. SQL / Python notebooks, which leads us to the idea of a query interface.
Query interface
As mentioned, the success of dbt was possible (or at least more likely) because it did not rely on the explicit adoption of it as a standard by any other tool in the stack. So a universal query interface is more likely to be established if it is based on other existing standards, the most common of them being SQL (and JDBC interfaces). This is the direction we see tools like Cube, dbt server and Transform going, apart from providing a REST API interface.
However, not all SQL interfaces are created equal. There could be restrictions on what the SQL query can do or it may require a different syntax. Some of them may look like a JDBC wrapper around an API (e.g. SELECT * FROM metric_abc(group_by=[’something’], time=’2020-02-01’)
), which doesn't mean it can be supported by any BI tool off-the-shelf. This could possibly require investment from either the semantic layer vendor to make it work with certain tools, from the BI vendor itself or from the end user.
Yet, focusing on supporting SQL may turn out to be a trap. Benn Stancil already talked about how SQL may not be the best solution to model business concepts:
On the edges, there are surely business concepts that can’t be expressed in SQL at all, no matter how many clever window functions and self-joins we use. Or, there are processes that can be defined in this way, but can’t be queried performantly.
Though he talks about the definition side, one could question as well the usage of SQL on the querying side. By using SQL as the lingua franca to query the semantic layer, we may be adding unnecessary constraints and not allowing ourselves to operate on the right abstraction level: the business domain. There are certain expectations on what SQL should support and these expectations may not be on a “semantic level”. Example: should a semantic layer support things like coalesce(…)
or sum(case when …)
for its querying interface or is it a “smell” that something is not properly modeled semantically?
In my opinion, the ideal semantic layer will encompass the definition store and query interface. And if not SQL, the question arises of how this language would look like and how the adoption of such a semantic layer would be. For the time being, that’s a hole for another rabbit.
4. Built for Change
Semantic layers should be built for change. One of the key advantages of providing a layer of indirection is to handle changes more easily. Why? Because the concepts that the semantic layer exposes should change less often than the underlying data.
The move to data warehouses as a “platform” means more tools accessing tables, writing SQL queries and saving these queries in their own silos. Especially if you are building one-off raw SQL queries inside your BI tool for visualizations (please don’t), these queries are completely outside your data lineage graph and you may only find out something is broken when users tell you. Certain data teams then try to go to the other extreme: every SQL query used by whatever application on top of the DWH should be in a layer like dbt and part of its data lineage graph. With the advent of tools like Datafold, some of the issues when changing the underlying data are caught earlier and can be manually resolved one dbt model at a time. Yet, creating a dependency on dbt skills for whatever answer that one needs from the data seems like a sure path to unsatisfied business users. I don’t think “democratizing SQL knowledge” or “hiring more data people” is the answer here.
Nick Handel from Transform argued that DAGs are going to become shallower with a proper semantic layer, because denormalized models would be created programmatically by the semantic layer based on core models. A shallower DAG means definitely less maintenance work. But what happens when the concepts exposed by the semantic layer change?
Sometimes the answer I see is that, if say a column got renamed, semantic layers should still expose the old name, maybe as an alternative “alias”. Or that, when a table gets completely refactored, the semantic layer should adapt its mapping to reconstruct the original view of the table. Why, if this is the intended behavior, should it happen on a semantic layer-level and not on the data layer-level? This already points us to the big questions we will face in our current understanding of the Modern Data Stack. More on that later.
Still, when reasonable changes are propagated in the semantic layer, I believe it should encode these changes in a way that can be consumable downstream. My analogy here would be the idea behind ‘evolutions’ for operational databases. The changes that were done to the semantic layer are expressed in a way that consumers (other applications) can evolve their queries, especially in the case of a common definition store.
Example (toy syntax to illustrate):
evolution:
id: 1
changes:
- type: rename
from: Orders.Category
to: Orders.Super_Category
In the case where the semantic layer acts as an interface, maybe it would make sense for it to allow queries from consumers to be registered and managed by the semantic layer.
Let’s say a BI tool uses a semantic layer as an interface to query the data and create a widget for a dashboard. This BI tool could register the resulting query with the semantic layer, instead of simply saving it in its own silo. For instance, the BI tool calls an API endpoint of the semantic layer to save it and gets a URL back that will always return the most up-to-date SQL query for it with some metadata:
GET https://semantic-layer.com/project/0/queries/asdf-1234-qwerty-5678
{
"sql": "SELECT ... FROM ... WHERE ...",
"metasql": {
"columns": [ {
"id": "customer_id",
"datatype": "varchar",
"semantic_type": "Customer.ID"
} , {
"id": "total_revenue",
"datatype": "decimal",
"semantic_type": "Sales.Revenue"
} ]
"version": ...,
"last_updated_at": ...
}
}
So whenever the dashboard gets shown, the BI tool gets the latest SQL query version from the semantic layer and executes it. No need to keep old modelling decisions to not break dashboards. Perhaps it could even contain a “change_log” note to be displayed to the widget owner, in case the change is more semantic and needs manual re-evaluation of the content of the widget. Or even a “review_required” flag to make sure the query is not run without explicit manual approval.
Sure, tools in this case would still need to explicitly support this kind of API, but it would allow for better lineage by design, it would keep queries managed and evolvable without requiring explicit new manual “models”. For tools that don’t have a great interface to generate the SQL query, queries could be generated in other tools, registered in the semantic layer and shared across them. The metadata in the response would also provide context about what the query is returning. This could open doors for real interoperability across tools: BI tool generates query A and allows user to open it in a separate data notebook for further analysis or send it to a data activation tool for a “reverse ETL” job.
But for this contextual information to be really possible, we need semantic types.
5. Provides a shared semantic type system
What does “semantic” mean? Assigning meaning to something involves naming it, describing it, explaining what this concept is and how it relates to other concepts. Part of this is accomplished through types. A rich, shared, semantic type system is in my opinion very important if semantic layers are supposed to enable the creation of data apps built by external parties.
A “modern data” customer analytics software should be able to quickly find out customer information in the data. Unless you want to manually define this every time a new data app wants to access your data, it makes sense to operate on a shared semantic type system. The semantic layer would annotate your data with types like “Customer” or “Sales.Channel” and data apps would require these annotations to work. Of course, they could also help detect unannotated data for you.
The idea is to establish a shared language across data tools. This would not only help to run new data apps and analysis created by people you’ve never met and who’ve never seen your schema before, but would also allow your existing tools to speak to one another, as hinted to in the previous section.
This shared semantic type system (or ontologies) could be lightweight and open source. Not suggesting here to necessarily use them for fancy reasoning systems with rules written in OWL syntax and what not. Just a shared annotation language to help tools find out what is what. This could enable off-the-shelf analyses, metrics libraries, an interoperability protocol and a real data app ecosystem.
6. Allows for iterative modelling with short feedback cycles
Benn Stancil talked about how, for it to succeed, the semantic layer should be optional. Agreed, but it needs to go one step further. For semantic modelling to succeed it cannot follow a purely “waterfall” approach with significant investment upfront. It should as much as possible offer a way to gradually develop the semantic layer, iterating over it with short feedback cycles.
Yes, it’s true: the current state of the modern data stack usually errs on the side of lack of upfront modelling. But, in our path to fix that, we should not throw the baby with the bathtub inside.
The short feedback cycle is the tricky part. Modern data stack’s approach to data has been unfortunately one of increasing technical dependence, rather than of business empowerment, which is why people ask where the ‘modern data experience’ is. There is an often unspoken aversion to (or misconception on) the idea of self-service analytics, combined with the belief that “data analysts” should be the ones finding the business insights and more people should just learn SQL. So what you will hear next is probably not a very popular take.
Who owns the definitions in the semantic layer? Who owns the metrics? Is it really the analytics engineer? The data engineer? The data analyst? Or is it Sally from Sales? Mark from Marketing?
Ideally, business should own the definitions. For some reason we don’t think this is likely. Defining metrics is a very precise endeavour. The issue is when we mix technical complexity with business complexity. Semantic layers should allow us to focus on business complexity only and they should enable business people to define their own metrics, their own concepts. (Call me naive, go ahead.)
My fear is that, if that is not the goal, we will end up creating tools that don’t have the short feedback cycle needed for a semantic layer to evolve and capture the business domain faithfully (keeping it up-to-date). It will always require more humanware. And — here is where it gets even more unpopular — a semantic layer far away from the consumption layer suffers the risk of lengthening this feedback cycle.
What happens to the modern data stack?
Let’s say miraculously we get all of this done and adopted. How would this impact our current understanding of the (modern) data stack?
Assuming that we have now a place to define semantically our concepts and which serves as a querying interface used by other tools, it becomes questionable to define dimensions anywhere else. Isn’t the transformation of cancellation_date is not null
into is_cancelled
purely semantic? We consider best practice to push down definitions to the data warehouse, because it is our current “semantic layer” - centralized and accessible to other tools and consumers. Yet, our metrics are not accessible over there. I find it unlikely that the best practice will still be to have one place to define metrics and one place to define dimensions, once the querying layer part is solved. In fact, lineage between concepts is part of “semantics”: it helps us understand how one concept is related to another — which makes you really question if ‘column-level lineage’ is the right level of abstraction we should talk about: columns are too raw.
And if queries from consumers are registered in the semantic layer (with maybe query evolutions in place), we have hopefully now a better answer to the lineage apocalypse of turning the data warehouse into a platform — or better said, the semantic layer into a platform. How will data warehouses see this? Is it a threat to their long-term vision of building out a data apps ecosystem? If semantic layers end up pulling the center of gravity of the data stack towards them, are they likely to enter the space of data warehouses long term? Or are they more likely to evolve into BI tools instead after all?
If we are going to define dimensions and metrics in one place, we will not want to lose out on all the progress we’ve made in data reliability in the modern data stack. This means we will need e.g. tests on our semantic layer. Now, this doesn’t mean necessarily that all data transformation and data tests happen on the semantic layer. Certain lines will have to be drawn.
Another interesting part is related to data catalogs. Isn’t adding semantics/context to your data part of the value that data catalogs aim to provide? I can see how part of the work that is being put into catalogs on top of data will shift towards catalogs on top of semantic layers. There is a lot of context regarding ownership, incidents and conversations to be had around metrics and entities in the semantic layer and they will need a place to call home1. Also it seems likely that we will still have a place to talk about data in its raw form, the source systems, the owners, etc, separate from the semantic layer. Will they be the same place? It could be, but catalogs will need to keep in mind that these are separate levels of abstractions with personas with different needs and skills. Treating all of it as “data” may overwhelm business users.
Will all of our data be accessible over the semantic layer? Probably not. This poses a question about how universal this interface will be if it aims to provide abstractions for the business, while still being used by technical users. If all data should be accessible over the same interface irrespective of it being semantically modelled or not, will the interface gravitate towards the technical side?
Is all of this realistic?
'Oh, I've had such a curious dream!' said Alice.
So is all (or any) of this realistic? There are certainly many design decisions to be made and the further you go down the rabbit hole, the more it feels like a bottomless pit.
I still have many question marks in my head: is it going to involve a querying interface or just a definition store? If a query interface, will it be SQL or some other higher-level language? Or will maybe BI tools prefer to use it as a definition store, but other tools and custom applications rely on the querying interface? Are existing semantic layers going to actually talk more about semantic types or is there little incentive for them to work on that if they are separate from the consumption layer (which interprets these types)? Will a semantic layer really become successful as a standalone product or is it too closely related to analytics to survive “headless”? Will we have successful universal querying interfaces that are indeed “universally” used and that integrates well with existing BI tools? Who will be the owner of the semantic layer: business teams or data teams? And, finally, will we have one semantic layer or will we have a semantic onion with different tools adding different layers to it, e.g. business descriptions, semantic types, natural language, reasoning rules, authorization rules, caching? Like I once overheard someone say at a contemporary art gallery: “it leaves you with more questions then answers”.
“I am under no obligation to make sense to you”, said the Mad Hatter.2
Transform’s Metrics Catalog shows a glimpse of what this could look like.
Subscribe to Modern Data Democracy
Personal, high-falutin' rants on data democratization, modern data stack and natural language interfaces by JP Monteiro.
Hi! Thanks for sharing your thoughts about the semantic layer. I am working on the same topic, and I think a semantic business glossary could be one of the digital commons. I've started a registry of semantic data types registry.apicrafter.io and looking for cooperation to make it worthwhile for the community.
In "Down the Semantic Rabbit Hole [1], JP Monteiro wrote:
#data #analytics #business #sql #language
'The purpose of a semantic layer is to close the gap between the “business language” and the “data language” and offer a unified and consistent view of the business domain as represented in the data.'
Arguably, Executable English, live online at [2], covers JP's detailed points.
Executable English is a platform for acquiring knowledge in the form of English syllogisms, for using the knowledge to answer questions, and for explaining the answers.
It's an outcome of many person-years of R&D.
It works with everyday English and jargons. The vocabulary is *open*, and so is most of the syntax, yet it needs no external grammar or dictionary maintenance, it supports non-programmer authors, and avoids ambiguities by means of context.
When needed, it automatically generates and runs complex networked SQL queries.
The platform is live online at [2], with many examples. You are invited to write and run your own examples too. If you are reading this, you already know most of the language!
So, what do you think? Have we here the Holy Grail of semantics for analytics? Mind the gap!
Adrian Walker
Executable English LLC
San Jose, CA, USA
USA 860-830-2085 (California time)
www.executable-english.com
[1] https://lnkd.in/gbhpk9vw
[2] www.executable-english.com