On Self-Service, Data Democratization and Language

It's all about questions and answers.

After my last (and first) post here1 was praised as "high-falutin", "speculative" and "as practical as blockchain" (I loved this last one; thanks, Benn), I decided to focus this post on a less controversial topic: self-service.

Is it a mirage? A feeling? A business state? Is it a dashboard in portrait mode? Some put it as: you know you don't have "self-service", when everyone in the data team is an English-to-SQL translator. Others as: it's that Nirvana state when our business users can create dashboards and reports by themselves. Or: when business users can select from a set of well-defined metrics and see for a specific time interval or broken down by a dimension. SQL, dashboards, reports, metrics. For one reason or another, we try to avoid the somewhat obvious definition of self-service as "when people don't need other people to answer their own business-relevant questions". (Too general? Not practical enough? It should not come as a surprise by now.)

To be honest with you, I don't even like the term "self-service" to start with. It focuses too much on "doing things by yourself". Which things? To achieve what? "To do analytics", you may say. I’ve always felt that the term "analytics" carries an opinion on how data should be used: "to be analyzed". Often, operationally, people may want data for more ordinary things, like getting a list of users to reach out to or seeing the last transactions of a specific customer.

"Data Democratization" as a term, on the other hand, may not be perfect, but it's the best we got. It manages at least to put "Data" in the title (even though a bit raw for my taste) and "Democratization", which is about accessibility, about empowering all kinds of people, about culture, processes, about decisions, and also about a system that is never finished and needs to be taken proper care of, otherwise it can turn into a centralized authoritarian system or decentralized anarchy. That's why this substack is called 'Modern Data Democracy' instead of 'Self-Service Blog'.

Now we have some overly general definitions. What's next?

Let's talk about abstractions.

Data Democratization is also about the right abstractions

Civilization advances by extending the number of important operations which we can perform without thinking about them.

A. N. Whitehead

The progress of technology (especially in the Information Age) is based on our ability to introduce abstractions that allow people from different domains of expertise to build upon each other's work. We went from hardware and microchips to operating systems to software, like your browser, which somehow led to you reading this article… All without you needing to know how to solder a circuit.

And the path to data democratization is no different. It is the process of abstracting away "physical" data (in data warehouses, tables, columns, and SQL queries) and bringing things down to a conceptual level of what data represents semantically and how our questions map to the data. Whenever we fail to simplify this abstraction and make it more accessible to others in different domains, we fail to make our organization less dependent on technical know-how, where technical know-how is not needed.

Too abstract? Ok, let's talk about dbt. The T in ELT means Transformation. We all know that. But what does Transformation here really mean? We are not just transforming data as a hobby. Dbt is fun, but not that fun. When we normalize data, define new dimensions with some complicated business logic, we do this to approximate our data to its real business meaning. The success of dbt is in part explained by this need of improving abstractions, bringing the data closer to its real meaning in a business sense and by consequence getting more consistent answers downstream.

When it comes to analytics on the other hand, I'm of the opinion that we are not going far enough with our abstractions. We focus too much on tables, dimensions, measures, cubes and... dashboards.

If all you have is a hammer, everything looks like a dashboard

(This is where I join late to the party of kicking the dashboard while it's down)

We were sold by the BI Gartner leaders the vision that we were artists painting beautiful data portraits, writers telling exciting data stories. So here we are, working on dashboards. Dashboards that are created to answer a one-off question, made out of one-off raw SQL queries. Dashboards that rot, once the assumptions used to create them silently deprecate. Dashboards that proliferate like rats2, since Mark from Marketing doesn't trust the old ones anymore or Sally from Sales can't find them anymore or "this time I want something slightly different..." Until at last, in a well-meaning attempt to build on top of existing dashboards, we pack them with filters upon filters. And we think "now we have it"... never realizing nobody is using them.

When did the data analyst become a glorified dashboard builder doing English-to-SQL translation on the side?

(Don't worry. This is not going to be one of those morbid articles about the "death of X", "Y is dead, long live Y". Dashboards are as dead as spreadsheets. Meaning: not dead.)

The Curiosity Tax

Absorbed in our routine, we may have forgotten this, but the job of data teams isn’t about creating the "perfect" dashboard or spoon-feeding data stories to the business. It's about empowering our colleagues to answer the questions they need to do their job better.

Unfortunately, this is easier said than done. We often impose inadvertently a Curiosity Tax. When someone has a question, they will either need to (A) find the right report that magically answers that exact question or (B) find their way through a next-gen pivot table (a.k.a. "self-service" analytics tool) or (C) ask the data team over Slack, or worse: (D) fill out a long form first justifying why they need this information in the first place.

This means that only questions that support a "one-way door" decision will be asked and many potentially impactful questions never get asked (or answered) at all. This is problematic because many business insights happen through spontaneous serendipity - one question leading to another and another, until something unexpected emerges. (Or maybe it doesn’t.)

Data Democratization is about creating an organization that doesn't tax curiosity, but encourages it. An organization that sees "asking questions" not as a burden, but as its modus operandi.

(Ok, go ahead. I know what you're thinking. "There goes JP again, talking about his ideal impractical world". Guilty as charged.)

Head of Questions & Answers

If you are a Head of BI & Data, in most cases you should really see yourself as a Head of Questions & Answers. You want to have a good grasp on which questions your organization is asking, you want to find ways for your organization to answer questions at scale, to enable people to interpret the answers and create a culture that encourages them to ask more and better questions.

You are not Head of Dashboarding & Reporting. Dashboards and reports are possible products, artifacts of your team's job, and data your raw material. But the essence of your job is answering questions.

Data Democratization is about bridging the gap between our language and the language of data

What I love about Natural Language Querying (NLQ)3 or rather Question Answering interfaces (and why I decided to build one at Veezoo) as a general paradigm for BI in contrast to dashboarding (or next-gen pivot tables) is that in some sense it reminds us of this truth: BI is all about answering questions. When we seek information, we first formulate it as a question. Language is the medium of thought4.

In my opinion, "to build the right abstractions" for Data Democratization is ultimately about bridging the gap between our language (the language of business) and the language of data. That's why we love purple people so much (they are polyglots) or why we are excited about our new data catalog (we add words to data).

To build this abstraction means that when we say "customers", we mean those users with an active subscription. And when we say "active subscription", we mean one that was not cancelled, which in turn means that "cancelled_at" is NULL. But if we’re being honest, most people usually just say "subscription" instead of "active subscription", so we need to account for that as well. Notice here that you're probably already doing parts of this on dbt (parts of this in your BI tool, other parts probably nowhere), and, as Tristan Handy raised the question, I believe this trend may indeed bring us closer to a future where our data products literally speak our language.

The Question Answering BI paradigm has other interesting consequences.

The focus on questions and language as an interface leads to an organization that is more aware of its internal (often inconsistent) language. This organization is therefore better prepared to promote a shared language with consistent definitions and metrics, which is essential to a data-driven organization.

A Question Answering BI paradigm also provides us with complete transparency on what business is looking for. Not clicks or clueless hovering, but actual questions. Data teams are able to investigate exactly how users interact with their data products - like which questions business users are asking or which data attribute most people ask for but is not yet available.

I have read countless great articles about how data teams try to go around their blindness on how their data products are (not) being used. Interviews, workshops, video tutorials, trying to analyze the metadata from Metabase etc. I believe this transparency needs to be a first-class citizen in a BI solution, so that you as a Head of Questions & Answers can really have a good understanding of which questions are being asked, which are being answered and which we need more data for. A solution that focuses on the interface between business users, data teams and data itself. Of course, it won't substitute a good interview with business users. But isn't it ironic that as data teams, gatekeepers of all the data on how the products we sell are being used, we are usually blind to how our own data products are being used?

This blindness leads us to overbuild reports with data we don't need, to answer questions we don’t have and to prioritize requests based on rank. Transparency allows us to start small and build based on real needs: the unbiased questions from our organization. It may assist us in the transition from a JIRA ticket-based, reactive, service team to a more proactive, missionary, data product team.

And if analytics is as much about asking questions as it is about answering them, then how can we democratize the "question asking" skill within our organization? Maybe a paradigm focused on questions could help us distribute better this knowledge. Maybe it could assist us in asking questions based on what was already asked on the same dataset or inside the organization, taking business users by the hand instead of just enabling them to "do things by themselves". Maybe it could encode best practices on what and how to ask. How would this impact our organization's "question asking" culture? Would it free our curiosity? Change our processes?

We may be at the start of the journey of Question Answering interfaces as a BI paradigm and there are still many challenges to solve, but when I look past them I see a paradigm that is not about "doing things by yourself" or Self-Service. It's about Data Democratization.

1

The semantic difference between "last and first post" vs "first and last post" is one of those textbook cases that bag-of-words NLP models would just not get it.

2

Not very fair to rats, I actually like them. But “multiply like bunnies” just sounds too positive and… fluffy?

3

Btw, NLQ is IMO a horrible, technical, academic term. I prefer calling them just "Question Answering interfaces".

4

That’s why we are so impressed with AI models like GPT-3 - we struggle to distinguish NLP from general AI. “And is there a difference?”, Turing would ask.