Codd's Dream
A tribute to the father of relational databases and his quest to bring access to information to the casual user
I started watching the Netflix series “The Billion Dollar Code” on the Google Earth vs. Terravision case and it got me thinking about how a Netflix series on the “data revolution” would look like.
If I would need to pick a starting date for the data revolution, I would probably go with the 1970s. Picking our hero would probably be also easy: Edgar F. Codd.
A Royal Air Force (RAF) volunteer in World War II, the man behind "A Relational Model of Data for Large Shared Data Banks" introduced the world to a theoretical model on how to think about data and queries. His belief as expressed in his seminal paper was that applications and database users should not need to know how the data was actually represented internally. If we wanted to fulfill the promise of databases, we needed better abstractions.
And so, using the concept of relations in its mathematical sense, the relational model was born, bringing with it a fresh new scientific rigor to the field. The introduction of this abstraction led us to SEQUEL from Ray Boyce and Donald Chamberlin, which stood for "Structured English Query Language" and was also meant to be the “sequel” to their previous data language SQUARE. The "English" was dropped soon enough and we got our SQL. And the rest is history.
Yet, below the surface of all these theoretical and technological breakthroughs, when you actually open to read those 70s papers like “SEQUEL” (1974) and Codd's "Seven Steps to Rendezvous with the Casual User" (1974), you notice an inspiring sense of purpose1: the dream of bringing access to information to the non-"trained computer specialist", the "casual user".
Boyce and Chamberlin wrote in "SEQUEL" (1974):
There is an increasing need to bring the non-professional user into effective communication with a formatted data base. Much of the success of the computer industry depends on developing a class of users other than trained computer specialists. [...] There are some users whose interaction with a computer is so infrequent or unstructured that the user is unwilling to learn a query language. For these users, natural language or menu selection seem to be the most viable alternatives.
And Edgar F. Codd wrote in "Seven Steps to Rendezvous with the Casual User" (1974):
If we are to satisfy the needs of casual users of data bases, we must break through the barriers that presently prevent these users from freely employing their native languages (e.g., English) to specify what they want.
Our database pioneers were imagining a world where access to information would be as easy as just asking for it. One where we would master complexity through the right abstractions and allow casual, untrained users to equally benefit from these technologies. Codd goes so far as saying that we should not "assume that [the user] remembers anything he may have learned from his previous interactions with the system". Sounds pretty accurate to me...
Even when Codd introduced the idea of normalizing the data and his famous 3NF, you could see his motivation again in making things easier to understand for the "casual user":
In this paper, second and third normal forms are defined with the objective of making the collection of relations easier to understand and [...] more informative to the casual user.
Fifty years later, did we fulfill Codd's dreams? Did we free the "casual user" from the shackles brought by lack of technical skills or did we get sidetracked?
The dream of the "casual data analyst"
The data community loves to discuss the merits of SQL and much has been written about it from either side of the ring. On one side, we have the technologists that question its lack of expressiveness and its idiosyncrasies, among other shortcomings. On the other side, we have the pragmatists (?) that praise SQL for its intuitiveness to get going, the "lingua franca" appeal and its battle-tested history.
Chamberlin and Boyce didn't see SQL as a tool for the casual user, but still for people like "accountants, engineers, architects and urban planners" as well as the "professional programmer". It is questionable if we did get architects and urban planners to learn SQL and even more questionable if they really should. On the other hand, we created this catch-all position called "the analyst". Our data analyst today is far from being a "casual user". On the contrary, if there is one person we expect to be trained and know how to deal with SQL and databases, that's the data analyst. The fact that SQL is easier to pick up than say Python makes the pool of potential future data analysts bigger, which is very important.
But what about those that don't want and neither should be data analysts? The people in the sales and marketing trenches, those supporting customers, the mid- and top-level managers, etc. A lot of their needs have been fulfilled with function-specific software like CRMs. And for their "analytics" needs, we created the whole area of Business Intelligence with dashboards, standard reports and self-serve functionalities. Still, 50 years after Codd's, Boyce's and Chamberlin's papers, we still get requests from our "casual users" asking us, the data team, to write SQL queries to answer their questions. Or to explain them how to get that information again from the report that we did one month ago.
The Doc, the Slide, the Spreadsheet and... where does SQL fit in?
I've come across the other day an interesting thread from one of the founders of Hyperquery. There Joseph Moon drew a parallel between data notebooks, dashboards and data apps to docs, slides and spreadsheets, respectively. And like that we got ourselves a way to justify the co-existence of many different interfaces to data, in a "distributed BI" scenario if you like. But that's not what caught my attention.
What got me wondering was that these were all mostly about presenting information. I say "mostly" because the spreadsheet part was put more as a pre-made spreadsheet-like report which allows you to interact with it and that's why it's the data app counterpart. I didn't get the impression that it was meant as being prepared by the same user as the one consuming it. But that's not the most important point here. I felt there was something missing. How do you actually find the information in the first place? If we are talking about interfaces here, shouldn’t we talk as well about SQL or a Search analogy?
Maybe the author of the thread saw this as more of an orthogonal discussion. Still, I believe it is very central. If the data team is always the one preparing the end result for the user, then it doesn't really matter if it is in a doc, a slide or a spreadsheet. The "casual user" is still in shackles. We are still the ones writing the SQL queries for them, even if they are just slightly templated ones.
The beauty of SQL is the freedom we have to find the answers we need, combining different data together. Codd's dream is about giving that kind of freedom to the "casual user" through the one skill that brings us all together: our natural language. And that's my dream too.
Sure. Maybe it wouldn’t be such a great Netflix series after all.
Not to be confused with a "sense of terrible purpose".
You wrote "Data Democratization is about bridging the gap between our language and the language of data"
Nicely put!
In this context, I believe you may be interested in test diving the Executable English system. It's live online at www.executable-english.com, with many examples that you can view, run and change using a browser. You are cordially invited to write and run your own examples too. If you are reading this, you already know most of the Executable English language !
Thanks for comments.