Future According to Designing Data-Intensive Applications

Future According to Designing Data-Intensive Applications

Piotr Kubowicz - June 17, 2020

Designing Data-Intensive Applications by a University of Cambridge researcher Martin Kleppmann is a great book for those who want to understand how different databases function and how to choose between them. But there is more to it than technical details explained in a systematic way. The book also paints a high-level overview of the current state of tools and techniques for managing data and the emerging tendencies shaping the future. In this article we will analyze some of the interesting thoughts presented there:

  • how the distinction between non-relational and relational databases (and even messaging systems and databases) becomes less sharp
  • what is the approach of modern databases to transactions and consistency in general
  • which ways of handling Big Data are no longer recommended and what to do instead
  • how to use polyglot persistence (different databases for different purposes) without running into problems
  • how to organize client-server communication so that client applications are responsive and don’t present outdated information to the user

Converging data models

Originally the term “NoSQL” was understood simply as “no-SQL” or “no-relational”. You could think there is hardly any place for compromise. Currently it is rather advertised as “not only SQL”, and for a good reason from the marketing point of view: there is a growing interest in databases that support more than one way of accessing the data.

For example, RethinkDB, a document database, introduced a feature from the world of relational databases – table joins. The good old SQL standard gained JSON support and you can use it in both popular commercial and open-source (MySQL, PostgreSQL) database engines.

As a result, the distinction between relational and non-relational databases is no longer as sharp as before, which gives you more flexibility in modelling your data – if you have learned how those new capabilities can be applied. For a nice overview of practical case studies of storing JSON in PostgreSQL head over to a comprehensive article by Leigh Halliday.

Convergence can be also seen when we look more broadly on data-processing systems, not just on databases. There are messaging systems that offer durability guarantees just as databases do (Kafka) and databases that can be treated as message queues (Redis).

In the world of messaging systems, Apache Pulsar (not mentioned in the book) claims to provide both high-performance event streaming (in a Kafka way) and traditional queueing.

Consistency is taken more seriously

Not long ago, when the hype around the term “NoSQL” was growing, you could easily find people blogging or tweeting how “obsolete” were relational databases. If you had asked where to find transactions, you might have heard you don’t understand the “modern” databases. And if you had had doubts whether the data would be correct under high traffic, the popular answer would have been that you need to learn to live with “eventual consistency”.

As Martin points out, times have changed. It occurred that with a database giving little consistency guarantees you end up solving difficult distributed programming problems on your own – and it’s too easy to make mistakes. The expectation now is that a database will take some of the problems away from you so that you can focus on your business model and leave distributed programming to people who specialize in it (i.e., database creators).

As the “eventual consistency” slogan ceased to be an acceptable excuse, database creators finally started to be more clear about their approach to consistency. Some, like Datomic and Fauna DB, highlight “transactions” and “consistency” when advertising their products. The first thing you could see in the announcement of MongoDB version 4 was “Multi-Document ACID Transactions”.

New ways to think about consistency

There is a rise of new databases attempting to provide ACID guarantees without requiring too much coordination that slows down distributed systems. They try different approaches than traditional database engines, for example by ensuring that transactions are very short and deterministic, and executing them in a single thread, like VoltDB.

There are even changes in the way we discuss consistency. Researchers have deemed transaction isolation levels defined by SQL standard as flawed. The once-popular “CAP theorem” is also heavily criticized by Martin as being confusing (there is a post on his blog Please stop calling databases CP or AP). Things are beginning to improve and perhaps soon database creators will adopt more precise vocabulary when describing their products, and we will be able to talk about “reading your own writes” consistency, causal consistency etc.

Trust but verify

Until recently the problem was not only that database producers weren’t clearly stating what consistency guarantees they provided, plus their claims weren’t verified. It may seem shocking that this situation changed only recently, as we are talking about systems that can cost lots of money.

The open-source tool Jepsen is used by creators of various distributed systems like Cassandra and Elasticsearch to hunt bugs in their implementations. Reports from analyses are open to the public.

As mentioned before, databases are meant to take care of some coordination work away from us and be a proven solution to problems of data sharing. The reality is a bit horrifying: you don’t need to scatter your database over a network to run into bugs – some systems behave incorrectly even when running on a single machine.

Martin Kleppmann tested popular relational databases for transaction isolation (see results in github.com/ept/hermitage) and found multiple issues. For example, that “repeatable read” isolation level means different concepts in different databases, or that some anomalies appear although by definition they shouldn’t.

You might think that the relational model is so mature that everything has already been found out and these times long-existing SQL databases are just polishing details and adding extra features like JSON integration. The book shows that research on the topic is still active and it has practical consequences. Not only can we learn about mistakes in transaction implementations in some products, but sometimes new ways of solving old problems are found.

In 2008 a publication introduced an improvement to the well-known snapshot isolation that fixed some anomalies without the performance penalty of locks used in typical serializable isolation. This “serializable snapshot isolation” was implemented in PostgreSQL three years later (and there are no other popular implementations as for now). The database has long advertised itself as “the World’s Most Advanced Open Source Relational Database” – but maybe now the reality is, if you remove “open source” from the statement, it will still be true.

Modern alternatives to MapReduce

Let’s leave “typical” databases for a moment to focus on processing big amounts of data. The book highlights the limitations of the MapReduce paradigm represented by tools like Hadoop. The whole idea originated from Google, which admitted in 2014 that they no longer use MapReduce anymore. This way of processing forces storing each intermediate result on disk, which adds lots of overhead. A crazy thing is that MapReduce, still an immensely popular solution, has such a suboptimal design because it was created as an open-source implementation of an approach described in a Google paper, while not correcting for the very specific Google environment (where MapReduce-like jobs were frequently killed to give way to more important tasks, so dumping data on disk made lots of sense).

Modern Big Data processing should therefore choose more high-level and better-optimized solutions like Spark and Flink.

Handling derived data

When building a big system with various ways of accessing data, it’s hard to achieve high performance when using just one kind of database. It’s better to use several data systems, each optimized for a particular access pattern, for example a “normal” database to store all information and ElasticSearch for efficient full-text search on some parts.

The book strongly discourages using application code for updating various data stores. The risk is too big that with changes happening fast, one database will apply them in a different way than another type of database, creating two views that will contradict each other.

The suggested solution is to use Kafka and Change Data Capture to distribute updates. Durability and ordering guarantees from Kafka will ensure derived data will be kept consistent. As there are connectors to many popular databases, streaming changes is possible without writing any custom code.

Constructing derived data this way allows us to build less coupled and more performant systems in the microservices architecture by making the communication purely event-based and asynchronous. A service could then own a local database kept up-to-date by reading updates from Kafka. As a result, the service will be able to quickly query this local database and get data in the format ideally suited for it, instead of sending synchronous requests to other services. A nice example of this approach can be found in the presentation The Database Unbundled: Commit Logs in an Age of Microservices, which features a “refactoring” of a synchronous microservices architecture.

Another benefit Martin mentions is better schema evolution. Traditionally, if a data format no longer fitted the purpose, migrating to a new one was painful. Client code had to be changed, also running the migration job sometimes required outages. To avoid this, we can create a new derived view by reprocessing a Kafka log and experiment with the view to see if it works better than the old one. Then we can gradually modify clients, finally removing the old view that is no longer used.

Pushing the state to the client

When your app gets data from a server, displays it and allows to modify it, it acts similarly to a database replica – and network connectivity issues create problems similar to those from the database world (welcome conflict resolution). Another new trend Martin mentions is applying the previous idea not just to “derived” databases, but also to client applications.

Instead of asking the server to do the heavy lifting of preparing a big bunch of data computed for the needs of the client, an alternative is to keep an open communication channel and send very basic data changes to the client. Just as in Change Data Capture, but with something like WebSocket or Server-Sent Events instead of Kafka, and a Redux-like engine (see our post on ngrx) instead of a database with derived data. The advantage: the client does not work on an outdated view of the system, but is frequently updated as the state changes on the server. In Nexocode we use the Change Streams feature of MongoDB to send changes (added, modified or removed documents) to our Angular-based web application and it gives a great user experience.

Conclusion

We could assume that because databases have been developed for more than 50 years, we know well how to use them. But we are still learning and we make mistakes. Fortunately, it seems that academic research helps improve existing products, we are better informed about what different solutions are capable of, and old assumptions are revised so that systems work better in a distributed environment.

For those storing data in different kinds of databases, distributed log systems like Apache Kafka emerge as a proper way of propagating changes. Similarly, Kafka-like platforms allow to cut synchronous calls between microservices and build more responsive systems. Relational databases continue to be an important building block. NoSQL databases may adopt results of ongoing research on offering meaningful consistency guarantees without sacrificing performance – this way existing engines could find wider adoption or completely new ones will gain popularity. If you want you can check the copmarison of Kafka with other big data architecture like Hadoop and Spark HERE.

As always, there is no silver bullet. New tools and new features appear, but a solid understanding of data systems principles is needed in order to choose solutions appropriate for a concrete project.

About the author

Piotr Kubowicz

Piotr Kubowicz

Software Engineer

Linkedin profile Twitter Github profile

Piotr is a polyglot developer who has been coding in Java for over ten years. He also tried many other languages, from C and Perl to Ruby.
During his time at nexocode, Piotr's primary focus has been on evolving team culture and ongoing projects by developing build automation and systems architecture to ensure delivery is smooth even as project codebases get bigger and more complex. As an active developer in the community, you can notice him speaking at various meetups and conferences.

Tempted to work
on something
as creative?

That’s all we do.

join nexocode

This article is a part of

Zero Legacy
36 articles

Zero Legacy

What goes on behind the scenes in our engineering team? How do we solve large-scale technical challenges? How do we ensure our applications run smoothly? How do we perform testing and strive for clean code?

Follow our article series to get insight into our developers' current work and learn from their experience. Expect to see technical details, architecture discussions, reviews on libraries and tools we use, best practices on software quality, and maybe even some fail stories.

check it out

Zero Legacy

Insights from nexocode team just one click away

Sign up for our newsletter and don't miss out on the updates from our team on engineering and teal culture.

Done!

Thanks for joining the newsletter

Check your inbox for the confirmation email & enjoy the read!

This site uses cookies for analytical purposes.

Accept Privacy Policy

In the interests of your safety and to implement the principle of lawful, reliable and transparent processing of your personal data when using our services, we developed this document called the Privacy Policy. This document regulates the processing and protection of Users’ personal data in connection with their use of the Website and has been prepared by Nexocode.

To ensure the protection of Users' personal data, Nexocode applies appropriate organizational and technical solutions to prevent privacy breaches. Nexocode implements measures to ensure security at the level which ensures compliance with applicable Polish and European laws such as:

  1. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (published in the Official Journal of the European Union L 119, p 1); Act of 10 May 2018 on personal data protection (published in the Journal of Laws of 2018, item 1000);
  2. Act of 18 July 2002 on providing services by electronic means;
  3. Telecommunications Law of 16 July 2004.

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.

1. Definitions

  1. User – a person that uses the Website, i.e. a natural person with full legal capacity, a legal person, or an organizational unit which is not a legal person to which specific provisions grant legal capacity.
  2. Nexocode – NEXOCODE sp. z o.o. with its registered office in Kraków, ul. Wadowicka 7, 30-347 Kraków, entered into the Register of Entrepreneurs of the National Court Register kept by the District Court for Kraków-Śródmieście in Kraków, 11th Commercial Department of the National Court Register, under the KRS number: 0000686992, NIP: 6762533324.
  3. Website – website run by Nexocode, at the URL: nexocode.com whose content is available to authorized persons.
  4. Cookies – small files saved by the server on the User's computer, which the server can read when when the website is accessed from the computer.
  5. SSL protocol – a special standard for transmitting data on the Internet which unlike ordinary methods of data transmission encrypts data transmission.
  6. System log – the information that the User's computer transmits to the server which may contain various data (e.g. the user’s IP number), allowing to determine the approximate location where the connection came from.
  7. IP address – individual number which is usually assigned to every computer connected to the Internet. The IP number can be permanently associated with the computer (static) or assigned to a given connection (dynamic).
  8. GDPR – Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of individuals regarding the processing of personal data and onthe free transmission of such data, repealing Directive 95/46 / EC (General Data Protection Regulation).
  9. Personal data – information about an identified or identifiable natural person ("data subject"). An identifiable natural person is a person who can be directly or indirectly identified, in particular on the basis of identifiers such as name, identification number, location data, online identifiers or one or more specific factors determining the physical, physiological, genetic, mental, economic, cultural or social identity of a natural person.
  10. Processing – any operations performed on personal data, such as collecting, recording, storing, developing, modifying, sharing, and deleting, especially when performed in IT systems.

2. Cookies

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet. The Website, in accordance with art. 173 of the Telecommunications Act of 16 July 2004 of the Republic of Poland, uses Cookies, i.e. data, in particular text files, stored on the User's end device.
Cookies are used to:

  1. improve user experience and facilitate navigation on the site;
  2. help to identify returning Users who access the website using the device on which Cookies were saved;
  3. creating statistics which help to understand how the Users use websites, which allows to improve their structure and content;
  4. adjusting the content of the Website pages to specific User’s preferences and optimizing the websites website experience to the each User's individual needs.

Cookies usually contain the name of the website from which they originate, their storage time on the end device and a unique number. On our Website, we use the following types of Cookies:

  • "Session" – cookie files stored on the User's end device until the Uses logs out, leaves the website or turns off the web browser;
  • "Persistent" – cookie files stored on the User's end device for the time specified in the Cookie file parameters or until they are deleted by the User;
  • "Performance" – cookies used specifically for gathering data on how visitors use a website to measure the performance of a website;
  • "Strictly necessary" – essential for browsing the website and using its features, such as accessing secure areas of the site;
  • "Functional" – cookies enabling remembering the settings selected by the User and personalizing the User interface;
  • "First-party" – cookies stored by the Website;
  • "Third-party" – cookies derived from a website other than the Website;
  • "Facebook cookies" – You should read Facebook cookies policy: www.facebook.com
  • "Other Google cookies" – Refer to Google cookie policy: google.com

3. How System Logs work on the Website

User's activity on the Website, including the User’s Personal Data, is recorded in System Logs. The information collected in the Logs is processed primarily for purposes related to the provision of services, i.e. for the purposes of:

  • analytics – to improve the quality of services provided by us as part of the Website and adapt its functionalities to the needs of the Users. The legal basis for processing in this case is the legitimate interest of Nexocode consisting in analyzing Users' activities and their preferences;
  • fraud detection, identification and countering threats to stability and correct operation of the Website.

4. Cookie mechanism on the Website

Our site uses basic cookies that facilitate the use of its resources. Cookies contain useful information and are stored on the User's computer – our server can read them when connecting to this computer again. Most web browsers allow cookies to be stored on the User's end device by default. Each User can change their Cookie settings in the web browser settings menu: Google ChromeOpen the menu (click the three-dot icon in the upper right corner), Settings > Advanced. In the "Privacy and security" section, click the Content Settings button. In the "Cookies and site date" section you can change the following Cookie settings:

  • Deleting cookies,
  • Blocking cookies by default,
  • Default permission for cookies,
  • Saving Cookies and website data by default and clearing them when the browser is closed,
  • Specifying exceptions for Cookies for specific websites or domains

Internet Explorer 6.0 and 7.0
From the browser menu (upper right corner): Tools > Internet Options > Privacy, click the Sites button. Use the slider to set the desired level, confirm the change with the OK button.

Mozilla Firefox
browser menu: Tools > Options > Privacy and security. Activate the “Custom” field. From there, you can check a relevant field to decide whether or not to accept cookies.

Opera
Open the browser’s settings menu: Go to the Advanced section > Site Settings > Cookies and site data. From there, adjust the setting: Allow sites to save and read cookie data

Safari
In the Safari drop-down menu, select Preferences and click the Security icon.From there, select the desired security level in the "Accept cookies" area.

Disabling Cookies in your browser does not deprive you of access to the resources of the Website. Web browsers, by default, allow storing Cookies on the User's end device. Website Users can freely adjust cookie settings. The web browser allows you to delete cookies. It is also possible to automatically block cookies. Detailed information on this subject is provided in the help or documentation of the specific web browser used by the User. The User can decide not to receive Cookies by changing browser settings. However, disabling Cookies necessary for authentication, security or remembering User preferences may impact user experience, or even make the Website unusable.

5. Additional information

External links may be placed on the Website enabling Users to directly reach other website. Also, while using the Website, cookies may also be placed on the User’s device from other entities, in particular from third parties such as Google, in order to enable the use the functionalities of the Website integrated with these third parties. Each of such providers sets out the rules for the use of cookies in their privacy policy, so for security reasons we recommend that you read the privacy policy document before using these pages. We reserve the right to change this privacy policy at any time by publishing an updated version on our Website. After making the change, the privacy policy will be published on the page with a new date. For more information on the conditions of providing services, in particular the rules of using the Website, contracting, as well as the conditions of accessing content and using the Website, please refer to the the Website’s Terms and Conditions.

Nexocode Team

Close

Want to be a part of our engineering team?

Join our teal organization and work on challenging projects.

CHECK OPEN POSITIONS