Designing Data-Intensive Applications by a University of Cambridge researcher Martin Kleppmann is a great book for those who want to understand how different databases function and how to choose between them. But there is more to it than technical details explained in a systematic way. The book also paints a high-level overview of the current state of tools and techniques for managing data and the emerging tendencies shaping the future.
In this article we will analyze some of the interesting thoughts presented there:
- how the distinction between non-relational and relational databases (and even messaging systems and databases) becomes less sharp
- what is the approach of modern databases to transactions and consistency in general
- which ways of handling Big Data are no longer recommended and what to do instead
- how to use polyglot persistence (different databases for different purposes) without running into problems
- how to organize client-server communication so that client applications are responsive and don’t present outdated information to the user
Converging data models
Originally the term “NoSQL” was understood simply as “no-SQL” or “no-relational”. You could think there is hardly any place for compromise. Currently it is rather advertised as “not only SQL”, and for a good reason from the marketing point of view: there is a growing interest in databases that support more than one way of accessing the data.
For example, RethinkDB, a document database, introduced a feature from the world of relational databases – table joins. The good old SQL standard gained JSON support and you can use it in both popular commercial and open-source (MySQL, PostgreSQL) database engines.
As a result, the distinction between relational and non-relational databases is no longer as sharp as before, which gives you more flexibility in modelling your data – if you have learned how those new capabilities can be applied. For a nice overview of practical case studies of storing JSON in PostgreSQL head over to a comprehensive
article by Leigh Halliday.
Convergence can be also seen when we look more broadly on data-processing systems, not just on databases. There are messaging systems that offer durability guarantees just as databases do (Kafka) and databases that can be treated as message queues (Redis).
In the world of messaging systems, Apache Pulsar (not mentioned in the book) claims to provide both high-performance event streaming (in a Kafka way) and traditional queueing.
Consistency is taken more seriously
Not long ago, when the hype around the term “NoSQL” was growing, you could easily find people blogging or tweeting how “obsolete” were relational databases. If you had asked where to find transactions, you might have heard you don’t understand the “modern” databases. And if you had had doubts whether the data would be correct under high traffic, the popular answer would have been that you need to learn to live with “eventual consistency”.
As Martin points out, times have changed. It occurred that with a database giving little consistency guarantees you end up solving difficult distributed programming problems on your own – and it’s too easy to make mistakes. The expectation now is that a database will take some of the problems away from you so that you can focus on your business model and leave distributed programming to people who specialize in it (i.e., database creators).
As the “eventual consistency” slogan ceased to be an acceptable excuse, database creators finally started to be more clear about their approach to consistency. Some, like Datomic and Fauna DB, highlight “transactions” and “consistency” when advertising their products. The first thing you could see in the announcement of
MongoDB version 4 was “Multi-Document ACID Transactions”.
New ways to think about consistency
There is a rise of new databases attempting to provide ACID guarantees without requiring too much coordination that slows down distributed systems. They try different approaches than traditional database engines, for example by ensuring that transactions are very short and deterministic, and executing them in a single thread, like VoltDB.
There are even changes in the way we discuss consistency. Researchers have deemed transaction isolation levels defined by SQL standard as flawed. The once-popular “CAP theorem” is also heavily criticized by Martin as being confusing (there is a post on his blog
Please stop calling databases CP or AP). Things are beginning to improve and perhaps soon database creators will adopt more precise vocabulary when describing their products, and we will be able to talk about “reading your own writes” consistency, causal consistency etc.
Trust but verify
Until recently the problem was not only that database producers weren’t clearly stating what consistency guarantees they provided, plus their claims weren’t verified. It may seem shocking that this situation changed only recently, as we are talking about systems that can cost lots of money.
The open-source tool Jepsen is used by creators of various distributed systems like Cassandra and Elasticsearch to hunt bugs in their implementations.
Reports from analyses are open to the public.
As mentioned before, databases are meant to take care of some coordination work away from us and be a proven solution to problems of data sharing. The reality is a bit horrifying: you don’t need to scatter your database over a network to run into bugs – some systems behave incorrectly even when running on a single machine.
Martin Kleppmann tested popular relational databases for transaction isolation (see results in
github.com/ept/hermitage) and found multiple issues. For example, that “repeatable read” isolation level means different concepts in different databases, or that some anomalies appear although by definition they shouldn’t.
You might think that the relational model is so mature that everything has already been found out and these times long-existing SQL databases are just polishing details and adding extra features like JSON integration. The book shows that research on the topic is still active and it has practical consequences. Not only can we learn about mistakes in transaction implementations in some products, but sometimes new ways of solving old problems are found.
In 2008 a publication introduced an improvement to the well-known snapshot isolation that fixed some anomalies without the performance penalty of locks used in typical serializable isolation. This “serializable snapshot isolation” was implemented in PostgreSQL three years later (and there are no other popular implementations as for now). The database has long advertised itself as “the World’s Most Advanced Open Source Relational Database” – but maybe now the reality is, if you remove “open source” from the statement, it will still be true.
Modern alternatives to MapReduce
Let’s leave “typical” databases for a moment to focus on processing big amounts of data. The book highlights the limitations of the MapReduce paradigm represented by tools like Hadoop. The whole idea originated from Google, which
admitted in 2014 that they no longer use MapReduce anymore. This way of processing forces storing each intermediate result on disk, which adds lots of overhead. A crazy thing is that MapReduce, still an immensely popular solution, has such a suboptimal design because it was created as an open-source implementation of an approach described in a Google paper, while not correcting for the very specific Google environment (where MapReduce-like jobs were frequently killed to give way to more important tasks, so dumping data on disk made lots of sense).
Modern Big Data processing should therefore choose more high-level and better-optimized solutions like
Spark and
Flink.
Handling derived data
When building a big system with various ways of accessing data, it’s hard to achieve high performance when using just one kind of database. It’s better to use several data systems, each optimized for a particular access pattern, for example a “normal” database to store all information and ElasticSearch for efficient full-text search on some parts.
The book strongly discourages using application code for updating various data stores. The risk is too big that with changes happening fast, one database will apply them in a different way than another type of database, creating two views that will contradict each other.
The suggested solution is to use Kafka and Change Data Capture to distribute updates. Durability and ordering guarantees from Kafka will ensure derived data will be kept consistent. As there are connectors to many popular databases, streaming changes is possible without writing any custom code.
Constructing derived data this way allows us to build less coupled and more performant systems in the microservices architecture by making the communication purely event-based and asynchronous. A service could then own a local database kept up-to-date by reading updates from Kafka. As a result, the service will be able to quickly query this local database and get data in the format ideally suited for it, instead of sending synchronous requests to other services. A nice example of this approach can be found in the presentation
The Database Unbundled: Commit Logs in an Age of Microservices, which features a “refactoring” of a synchronous microservices architecture.
Another benefit Martin mentions is better schema evolution. Traditionally, if a data format no longer fitted the purpose, migrating to a new one was painful. Client code had to be changed, also running the migration job sometimes required outages. To avoid this, we can create a new derived view by reprocessing a Kafka log and experiment with the view to see if it works better than the old one. Then we can gradually modify clients, finally removing the old view that is no longer used.
Pushing the state to the client
When your app gets data from a server, displays it and allows to modify it, it acts similarly to a database replica – and network connectivity issues create problems similar to those from the database world (welcome conflict resolution). Another new trend Martin mentions is applying the previous idea not just to “derived” databases, but also to client applications.
Instead of asking the server to do the heavy lifting of preparing a big bunch of data computed for the needs of the client, an alternative is to keep an open communication channel and send very basic data changes to the client. Just as in Change Data Capture, but with something like
WebSocket or Server-Sent Events instead of Kafka, and a Redux-like engine (see
our post on ngrx) instead of a database with derived data. The advantage: the client does not work on an outdated view of the system, but is frequently updated as the state changes on the server. In Nexocode we use the Change Streams feature of
MongoDB to send changes (added, modified or removed documents) to our Angular-based web application and it gives a great user experience.
Conclusion
We could assume that because databases have been developed for more than 50 years, we know well how to use them. But we are still learning and we make mistakes. Fortunately, it seems that academic research helps improve existing products, we are better informed about what different solutions are capable of, and old assumptions are revised so that systems work better in a distributed environment.
For those storing data in different kinds of databases, distributed log systems like
Apache Kafka emerge as a proper way of propagating changes. Similarly, Kafka-like platforms allow to cut synchronous calls between microservices and build more responsive systems. Relational databases continue to be an important building block. NoSQL databases may adopt results of ongoing research on offering meaningful consistency guarantees without sacrificing performance – this way existing engines could find wider adoption or completely new ones will gain popularity. If you want you can check the copmarison of Kafka with other big data architecture like Hadoop and Spark
HERE.
As always, there is no silver bullet. New tools and new features appear, but a solid understanding of data systems principles is needed in order to choose solutions appropriate for a concrete project.