Imperative to relational
Databases occupy a weird position in the growth of a software engineer.
The young engineer spends a lot of effort getting good at expressing their intent in terms of loops and conditionals and organizing that into functions. They’re finally develop some fluency, and then they’re faced with a database. Perhaps for a while they dabble around the edges, with just enough knowledge to do a simple query and insert single rows. Then at some point they have to take the dive.
They learn SQL, or whatever the query language is for the database in question. Querying something like MongoDB provides an illusion of familiarity—it’s just JSON!—but it really is an illusion. At some point they try to do something more complicated and the painstakingly acquired skills of structured programming fail them.
There’s a deep reason for this. Loops and conditionals are a mathematically complete set of building blocks for programs, but they are not the only possible set of building blocks. There are three such sets:
- The familiar conditionals and loops of structured programming.
- Parameterizing expressions so they can be treated as functions and providing ways to compose those functions (mathematically, the λ-calculus). This is the basis of languages like Haskell and OCaml.
- Expressing facts as logical propositions and using automated inference to combine them. This is the basis of languages like Prolog…and modern databases.
The building blocks of each of these are simple. They can be learned in half an hour. But no one learns to effectively use loops and conditionals in half an hour, nor does knowing how to express your intent in one of them immediately translate to the others. Many beginning programmers are told, “It doesn’t matter what language you learn first, just learn one. The skills transfer.” That’s a lie. It’s a useful, important lie, because the most important thing a beginning programmer can do is pick a language and learn how to use loops and conditionals, but it’s a lie. And in the careers of most programmers, the database is the first place where that shows up.
And that means that many young engineers approach the database expecting it to be a few more bits to add to their toolbox, and instead they’re asked to rewire how they think. This is not what they were expecting, so it seems like a nightmare. Many people flee at this point, enough where being really fluent with databases is an incredibly valuable skill, not just for what it enables you to do, but because you will probably always have work.
It also changes how you design software and what kinds of software are in your reach.
Assignment: Reflect on learning to program. List out the basic building blocks you learned, but then think about the process of learning to use them. Use that to set your expectations around what it means to get really good with a database.
When all you have is a hammer…
My father is, among other things, a master carpenter. Probably the single largest category of mass and volume when my parents move house is his woodshop. He taught me a lot, and I have a decent home woodshop in my garage, but there’s a lot of stuff that I just don’t have the tools to do efficiently. It would take me hours to hand plane an 18” wide board to have a nice, flat surface. It would take less time to strap the board to the roof racks of my car, drive it to my father’s house, and spend less than a minute running it through his planer/jointer.
Tools change what is possible, both what you could do and what you could do in the time you have.
For software engineers, databases are one of the biggest power tools available to us, up there with compilers and full featured operating systems. Once you get comfortable with them it’s less work to write software for almost anything involving storing and querying data. There’s a reason every data scientist on the planet knows SQL.
It’s often more efficient, too. You could hand tune your own code to load exactly the right subset of data from disk or precompute summaries to reduce the amount of data sent over the network, but it’s going to take a lot of time. Just like decent compilers meant that no one hand tuned anything but the most crucial bits of assembly code anymore, databases mean that no one hand tunes anything but the most demanding data loading and processing code anymore. And even when they do, they typically end up reimplementing a relational database for their special case, such as the movement towards data-oriented design in computer games. (The book that link goes to is, despite never mentioning databases, one of the best books on working in a relational system I’ve ever seen.)
Once you’ve sunk the time into the mental shift of thinking in the way a database demands, it becomes a tool you reach for almost continuously when you’re designing software. In your head you confidently slot a database into a particular place in your design and shape things around it…even if you don’t actually use a database like MySQL or Cassandra in that slot. Carpenters still have lots of hand tools around alongside their big power tools. There have been plenty of cases where I ended up saying, “Nope, I just need files on a filesystem and a policy on how they’re named and accessed.” But that choice was about the implementation of a mental slot labelled “database.”
Assignment: Reflect on what other powertools you use in your programming without even thinking about them. File systems. Network stacks. HTTP servers. Isolated processes and virtual memory. Compilers. What was it like learning that they existed and adapting to use them?
They who can keep the database alive will never starve
This is a joking statement that I tell young engineers. There’s a related, more cynical corollary: “They who can keep the database alive will never escape it.” The underlying idea is true though:
Building basic database fluency is a disproportionately valuable way to direct your attention.
When I was working at Facebook, a friend of mine messaged me, saying, “This team I’m working with is having a terrible time with their queries. They’ve launched the product, but the performance has gotten unacceptably slow.” So I spent a couple hours in a conference room with a couple members of the team.
This is not one of my war stories that begins, “So this team was using their database as a work queue…” I have lots of those, because getting the locking right in that case is not obvious.
After about ten minutes of making sure I understood what was going on, it turned out that what they needed was an index on one column. Mind you, these were skilled programmers. They had finessed their way around much thornier performance problems in imperative settings. Their mapping from loops and conditionals to likely hunks of work done by a computer was solid, but they were missing the mapping from the building blocks of their database to work done by the computer.
So we spent forty minutes talking about that mapping and working through a few examples. Around that time once of them exclaimed, “Oh, so we just need to index this column and it will be O(1)!” Another muttered, “Oh, of course.” At that point what they wanted most was to get out of that conference room and go add that index.
A couple weeks later, after no further contact, I messaged one of them and said, “Hey, how did that work out?” The response I got was basically thanks, life is awesome, we resolved another slow query last week. I never heard from them again.
Tomorrow we’ll start diving into the biggest reshapings of thought that using the building blocks of a database effectively requires.
Assignment: What was the last piece of infrastructure or technology where you went from having to ask for help regularly to making decisions confidently? What knowledge and knowhow constituted that change?
Let go of your data
Today’s annoying Zen koan is:
Let go of your data
And now for a less obscure explanation.
The first big shift from imperative to relational programming is where you keep your data. When you write a function in Python, C, Java, etc., the piece of data you’re working on is right there. You have given it a name and you can work incantations on that name. It has a disturbing resemblance to demon summoning, if you think about it (or write light hearted fantasy about it).
def nth_word(s, n):
words = s.split(' ')
return words[n]
Or, if we have a collection of things, we work with it by assigning a name to each element of the collection, one at a time, and dealing with it.
for x in xs:
...
Either way, we deal with data by having a handle attached to it.
There is another assumption here: the data our variable is attached
to has its own identity. If I have the number 5 pointed to by two
different variables, they are two different instances of the number 5.
This is why Java programmers have to distinguish =
and
.equals()
for strings. The first checks if two variables
point to the same identity, the second if they point to identical
sequences of characters. Scheme gets even more in the weeds with
eq?
, eqv?
, equal?
, and
=
.
To summarize, we program imperatively with three key assumptions:
- We work with one piece of data at a time. Even if it is a collection, we treat the collection as one value and use loops to get single values out of it.
- Data has both its value and an independent identity apart from its value.
- We access data by having variables that give us handles to it.
When we come to the database, we invert all these assumptions.
- We work only with collections of data defined by shared properties. A table’s rows share the properties of having the same schema. Filtering the table is a collection of data with the additional property of satisfying the filter.
- Data is identified only by its value.
- Data is accessible only by its value.
If we want to reach a single value, we need to give it some aspect we can filter on to make it the only element of a collection. To check if a given piece of data has a certain property, we filter for the identifying aspect of the data plus the property it should satisfy and get zero or one rows.
Consider how you let go of data in the following tasks:
Verify a username and password
Imperative: We are passed variables with the username, the password, and a map of usernames to password hashes. We select the password hash corresponding to this username into a new variable, hash the provided password, and compare them.
Relational: We filter a table of usernames and password hashes for rows that match the provided username and where the password hash matches the hashed value of the provided password. We either get a matching row (the login is valid) or not (the login is invalid).
Find the name of a user that has logged in
Imperative: We are passed a variable referring to a structure describing the user. We select the field in that structure corresponding to the user’s name.
Relational: We are provided with an identifying aspect of the user (usually we assign a meaningless integer to each user or other entity for this exact purpose), and we filter the users table for the one row matching that identifier and the column representing the name.
Find the largest two numbers in a list
Imperative: Create two variables that refer to numbers, then for each number in the list temporarily give it a name. Compare it with the numbers in our two variables. If it is larger than either of them, discard the smaller of the two and replace it with this one.
Relational: Sort the table by size, filter it to have only the first two rows, and trust the database to do this in a sensible, performant way.
It can be hard to let go of having a reference to a particular piece of data, but it’s liberating once you get it. Any piece of data you care about is just floating there, accessible via whatever aspects of it you know.
Assignment: Take a few programming tasks you’ve done recently. Identify where you’re depending on single, named identities. Try to rethink them in terms of filtering a table in terms of aspects of the data.
You probably don’t want a JSON column
One of the biggest signs that you’re fighting to impose imperative techniques on the database is defining a column that contains an array or JSON or some other composite format that in imperative programming you would select values out of. This isn’t a foolproof sign, and we’ll discuss some caveats later, but let’s get the primary issue out of the way:
Putting a structured format like JSON or an array in a column usually means that you’re trying to provide a handle that you can manipulate as if it were a piece of an imperative program.
As an example, consider users with privileges assigned to them. For people new to relational programming, it might be tempting to have a table of users with columns for id, username, password, and an array of granted privileges. This hits all three of the assumptions we identified about data access in imperative programming:
- We work with one piece of data at a time.
- Data has both its value and an independent identity apart from its value.
- We access data by having variables that give us handles to it.
How would you use this configuration? You would query for the row to get a variable pointing to a piece of data, selecting the privileges out of that data, then iterating over the privileges looking for the desired one.
How would you do this relationally? Let’s recall the replacement assumptions:
- We work only with collections of data defined by shared properties.
- Data is identified only by its value.
- Data is accessible only by its value.
The collection we want is a table with two columns, user id and privilege. The presence of a row in that table indicates that user has that privilege. You filter the table for a particular user id and privilege and either have a row or not.
Some databases like PostgreSQL provide convenience functions to check for elements in an array or JSON object. This is tempting if you haven’t gotten completely comfortable with databases yet, but it is probably not what you want. It works, but it limits how efficient the database can make the check. With the array column, you always have to select a row and iterate over the array. With the separate table, the two columns form a compound primary key, and you check for presence or absence of a row by a direct index lookup, which is one of the most optimized operations in most databases.
I said there were caveats and times when you do want to shove a blob of JSON into your database:
“I’m running MongoDB. It’s all JSON.”
This isn’t actually one of the caveats. You interact with MongoDB using a JSON-like format, but that doesn’t mean you can’t treat the fields in a JSON-like thing as columns. You can still think of it as filtered collections.
“My database doesn’t have joins and I need to represent multiple values on this entity.”
In NoSQL databases like Cassandra that don’t support joins, you often have no alternative but to stick an array or other structured data in a row. In these cases, design as if you were in a relational setting with joins, then add structured fields as necessary to replace joins with structured column queries.
“I have JSON documents but I just need to store them as blobs, not select from them.”
Sometimes shoving opaque data into a database is a convenient thing to do, such as if you are storing a blob that has semantics in some other system but which you have no understanding of. In this case you don’t expect to be doing any selection of pieces of the structure, since you don’t know what the structure is anyway.
Assignment: Look at the data structures of a piece of code you worked with recently. How would you recast those data structures into relations in such a way that you replace collections you would select from or iterate over with tables you filter.
Don’t bucket your data
We already talked about not carrying over data organized for selecting single values into the database. There’s a more extreme case that you see sometimes where someone creates whole different tables for the same kind of data.
For example, you have users with different collections of books. The sane way to do this is to have a single table for users and another single table for books. If you were thinking in terms of handles to particular pieces of data, you might want to have a handle to a particular user’s books, and for each user create a different table of books.
Usually someone stops this before it happens, but I dealt with a small company once that captured data from collections of sensors and provided searching of the historical data. They created a separate table for each sensor. Whenever a sensor was moved or changed it was given a new id, the old table would be deleted, and a new one created. I pointed out that this meant that whenever a customer reconfigured a sensor, they deleted all historical data from that sensor. The response I got: “Yeah, it really reduces disk usage.”
I really didn’t know how to respond to that.
The same company also created a separate database for each customer. This isn’t always a bad idea. If you run a completely parallel, independent copy of your software, database and all, for each customer, it makes isolating customers from each other fairly easy. In this case they had only split the databases.
Even splitting only the databases is not always a bad idea. Splitting data across different machines is a typical way to scale a database past what fits on one machine. This is called “sharding.” But rarely is “one customer per shard” the right answer. In this company, about half of the company’s AWS bill was idle compute and empty disk for these databases that each served one customer.
This was all very distressing at the time, but it illustrates the point very nicely: don’t try to force a handle onto a piece of data in your database. Filter your way to the answer you want.
Assignment: Think about how you might shard data from all these sensors in an efficient way. What key might you use to decide where data goes? Can you still reason in terms of filtering collections when you shard data?
Foreign keys aren’t pointers
Let’s talk about another way that carrying over imperative ideas of holding named handles to specific piece of data leads people astray when they try to approach databases.
Many newcomers to databases see foreign keys, notated as
this_column REFERENCES other_table(other_column)
and think, “Ah! This is a reference like having a reference or pointer in an imperative language.” If your whole experience of programming is imperative, this is a sensible guess.
But: foreign keys have nothing to do with querying. They are a means
of way of constraining changes we can make. In the declaration above,
the only values that can be used in this_column
are ones in
other_column
in other_table
. We might have a
table of privileges a user can have, and a table of assignments of
privileges to users that has a foreign key referencing the privileges
table. That prevents us from assigning a nonexistent privilege to a
user.
This is a little thing that trips people up, but there’s a general heuristic that can prevent it and many similar confusions. When we write imperative code, we have three categories of “stuff” around values:
- The values themselves, in memory in some format.
- Types, which are statements of the form “
x
is an integer” or “x
is an instance ofFrobnicator
” which tell how to interpret the memory. - Code that operates on values, both to check constraints on them and to manipulate them.
The categories in databases are quite different. Start with the fact that everything you work with has the type ‘table.’ When you filter rows, calculate derived columns, or join tables, the result of those operations is another table. Given some table, you can (in theory) trace it back through the operations that produced it. For example, the result of the query
SELECT id, name, organization
FROM password_auth
LEFT JOIN users ON users.id = password_auth.id
WHERE password_auth.hash = '5gS32jaksk'
AND users.email = 'boris@madbaboon.com'
is a table, but we can trace it back to the password_auth and users tables. If one of those was a view, we could further trace it back to the tables the view reads from. At some point as we descend through tables, we hit the ground and reach tables that have their data inserted by users and aren’t defined in terms of other tables.
The two categories in databases, corresponding to the three categories above for imperative languages, are ground tables and non-ground tables. When thinking about non-ground tables, the “stuff” is all about manipulating tables to produce other tables. When thinking about ground tables, the “stuff” is about constraining and checking the correctness and integrity of the data. SQL commands generally fit into one of these two categories. So when we see foreign keys defined in a CREATE TABLE command, think:
- FOREIGN KEY appears in a CREATE TABLE command.
- CREATE TABLE applies to ground tables.
- Commands generally apply to ground or non-ground tables.
- FOREIGN KEY must be about constraints on the behavior and values in ground tables, not about the structure of any particular value.
At which point you can correct your imperative intuition and know that you need to go look up what foreign keys do.
As an aside, you won’t find the term ‘ground table’ in the database literature. It’s stolen from Prolog, which calls its pieces terms instead of tables and Prolog programmers speak of ground terms and non-ground terms. The terms you usually see in the database world are “data definition” for ground and “data manipulation” for non-ground.
Why would I borrow terms from another language instead of using the common terms? Because it makes it easier to learn. In particular, is an index part of data definition or data manipulation? Indexes don’t change the schema or allowed values in our tables, so they don’t feel like data definition. But they are not part of manipulation either, since they’re a thing our queries use. In contrast, indexes are clearly part of the ground category, since they’re defined entirely on and in terms of ground tables. In my head, I translate “data definition” to “ground” and “data manipulation” to “non-ground” to avoid confusing myself.
Assignment: Consider transactions and views. Are they ground or non-ground? For views, it may help to consider that they’re a global version of common table expressions.
Work backwards from your ideal table
When we write an imperative program, we do so by layering and sequencing mental templates. For example, take the task of summing all non-negative numbers in a list.
We’re doing an aggregation of some kind over a list. I know I’ll need a loop, so I scaffold that in
for x in xs:
...
and since I’m aggregating, I’ll need some variable to hold the aggregation across iterations, so add that.
agg = ...
for x in xs:
....
The aggregate is a sum, so let’s rename it to that.
sum = ...
for x in xs:
...
If the list we pass in is empty, the sum should be zero, so that sets the initial value.
sum = 0
for x in xs:
...
We want only non-negative integers, so we’re going to skip negative
elements. We scaffold that in as if ...: break
, and the
condition is that x
be at least zero.
sum = 0
for x in xs:
if x >= 0:
break
...
And there’s the step where we update the aggregate, which in this case is pretty simple.
sum = 0
for x in xs:
if x >= 0:
break
sum += x
This is pretty typical of how we program imperatively. We have a starting point, a desired endpoint, and various possible moves that we have accumulated in learning programming. Like a chess player we look head and choose moves accordingly. Unlike a chess player, we get to go back and try again.
When writing queries on a database, we go the other direction. We start with the final table of values we want and fill in gaps backwards until we reach ground. Let’s consider the same task.
Our final table that we want is a single row with one column containing the sum of all non-negative numbers in a column of a table. We get that by selecting the sum of a table.
SELECT sum(n) FROM ...
Where do the rows we want come from? A table with only non-negative rows.
SELECT sum(n) FROM (
SELECT n FROM ...
WHERE n >= 0
)
And there is probably some table xs
that we are drawing
our numbers from.
SELECT sum(n) FROM (
SELECT n FROM xs
WHERE n >= 0
)
in practice we would collapse this as we went to
SELECT sum(n)
FROM xs
WHERE n >= 0
But the underlying mental process is one of building up our directed, acyclic graph downwards towards ground as we look for sources to fill in our desired table’s structure.
Assignment: Take some programming tasks you have done recently and reflect on how you did them. What is your list of moves or templates in imperative programming? What is each template’s goal in improving your position? How many of those goals make sense in a relational setting, and for the ones that do, how would you express it?
Summarizing it all
Let’s recap all the stuff we’ve gone over.
First: relational programming is a big switch from imperative programming. You spent years learning the mental tools for imperative programming. Some of them transfer, so learning relational programming won’t take as long. Many of them don’t transfer, so it’s still going to be a learning curve.
As we make the switch, we have to replace three key assumptions about how we reach for data:
- Instead of working with one piece of data at a time with its own identity, we work with collections defined by the shared properties of their contents.
- Instead of data having an identity that we can refer to independent of its value, there is only the value.
- Instead of having variables that act as handles to get particular data, we access data only by asking for a particular value.
The biggest troubles that people learning to work with databases face come from holding on to imperative assumptions. These tend to show up as trying to hold onto individual pieces of data, either by trying to shove arrays or JSON into a column, or by splitting tables up in order to create a sense of having a handle onto particular pieces of data.
The categories of “stuff” we work with change as well. Instead of manipulating data by having types, values that fit into typed slots, and code that checks and manipulates values, we have ground tables and non-ground tables. In ground tables we care about data integrity and correctness. We construct non-ground tables from other tables to get the results we actually care about.
Those different categories show up in how we approach programming, too. Instead of moves like a chess player working from our initial state to our desired one, we start with our desired table and work backwards, filling in sources until we reach ground.
What’s next?
There are three big pieces to learning databases. The order that you usually learn them is:
- Getting your head around relational thinking.
- Learning how to reason about performance.
- Learning to do data modeling and schema design.
Most people who bounce off databases do so at the first one. At this point you should be in a position where you’ll need some practice, but you know what rewiring of your brain you’re going to have to do.
Congratulations. The biggest hurdle is down.
(Also, now that it’s down, the transition to other programming language families is going to be less painful. You’ve done one. You know what it’s supposed to feel like. Consider learning something really out there like Haskell or Forth on your next holiday.)
The second part, performance, is usually made unnecessarily complicated. 80% to 90% of it is common to all databases and straightforward when you get the right mental models in place. If you can filter and sum a list and take and release a lock in an imperative language, then you can learn how to reason about database performance.
The third part, data modeling, gets much more philosophical. There isn’t a single symbolic model for any part of our world, or even a more natural or more correct model. If you reflect you’ll find that you already know this. You have habitual models of your world, but sometimes they break down and you find yourself at a loss. In those situations you look for what is happening, cues of what you should be paying attention to, and your mind whispers discomfort where the model you were planning to use is not matching reality. Learning data modeling is taking this natural reaction and using tools to direct it towards a domain that isn’t right in front of you.