For several years, it seems that there is hype for using NoSQL over Relational Databases, as a storage solution well suited for current needs. A glance at
Popularity Chart shows an exciting thing. Oracle, MySQL, and Microsoft SQL Server stay at the same level probably due to legacy projects and developers habits. Databases that gain the most on popularity are PostgreSQL (Relational) and
MongoDB (NoSQL).
If NoSQL DB’s, as said are designed for modern apps and growing data amounts, let’s think why developers still reach both for PostgreSQL and MongoDB. Maybe opinion that Relational DBS are obsolete is not fair according to them?
In this article we will show when to use non-relational MongoDB and Relational DB’s, and how to project MongoDB schema, to empower efficient queries. Mostly we will focus on Distributed Database Systems, which due to changing requirements for systems gains on popularity.
A bit of history for a better understanding
Generally, databases are used to store data. Before the era of web applications, the most common use case was to store well-structured data for various institutions in Relational DBS.
Since the World Wide Web was created, it started to change. WWW was invented in March 1989, until now, there were defined three periods in its evolution.
- Web 1.0 - a first version based on read-only web pages and hyperlinks between them
- Web 2.0 - current state, with services capable of read/write operations to enhance human interaction
- Web 3.0 - not yet here, it is expected to bring Semantic Web capabilities to process unstructured data to understand context and user intent. Web 3.0 called Web of Data, rely on inter-machine communication and algorithms to provide rich interaction via diverse human-computer interfaces
As we can see, with Web evolution, changed the way how we use our data, how we process them and the most imported changed amount and structure of this data. Structured Query Language existed even before the
WWW. First papers referring to the relational model of data were published in 1970, but SQL was initially developed at IBM in 1974.
Not only SQL (NoSQL) on the other hand is much more modern and supervenes web evolution, rising at the same time as Web 2.0 technologies. NoSQL foundations grew up with changing needs of applications, but to understand better this needs let’s get to know a bit of theory.
CAP Theorem, ACID, BASE acronyms
Before we decide which database to use, let’s formulate some requirements for distributed systems basing on CAP theorem. We will define two operating modes for Distributed Database Systems, and we will show how Relational DDBS and NoSQL Databases fulfill those requirements.
CAP Theorem
CAP theorem refers to DDBS, and it says, that it’s not possible for a distributed computer system to simultaneously provide consistency, availability, and partition tolerance guarantees. CAP stands for:
Consistency- ensures that data is the same across the cluster, so you can read from or write to any node and get the same data.
Availability- says that every request receives a response from cluster even if a node in the cluster goes down, but without a guarantee that it contains the most recent version of the information.
Partition tolerance - means that the cluster continues to function even if there is a “partition” (communication break) between any two nodes which are up, but can’t communicate.
For DDBS where the CAP theorem applies, we can select two of three from above, which leads us to combinations of guarantees that we require from DDBS:
AP: Highly available and partition tolerant, but not consistent. It means that nodes remain online even if they can’t communicate with each other and will resync data once the partition is resolved (communication is online again), but it is not guaranteed that all nodes will have the same data.
CP: Consistent and partition tolerant, but not highly available. It means that data is consistent between all nodes, and maintains partition tolerance, preventing data desync, by becoming unavailable when a node goes down.
CA: Highly available and consistent, but not partition tolerant. It means that data is consistent between all nodes - as long as all nodes are online - and we can read/write from any node and be sure that the data is the same, but if a partition between nodes happens, the data will be out of sync and won’t re-sync once the partition is resolved.
Since Partition Tolerance is rather a mandatory requirement for Distributed DBS, we need to prioritize between Consistency and Availability. This is known as the Availability/Consistency Tradeoff.
ACID vs BASE
The Availability/Consistency tradeoff requires a choice between two options:
- Fulfill CP - we need a Distributed DBS that guarantees ACID
- Fulfill AP - we need a Distributed DBS that guarantees BASE
Relational Databases Management Systems are mostly designed to be compatible with ACID guarantees; on the other hand, NoSQL databases like MongoDB are designed to be compatible with BASE guarantees.
Having solved Availability/Consistency tradeoff, we should already know do we need ACID (RDBS) or BASE (NoSQL) system, but before we make a final decision, let’s see what this guarantees gave us, and what they mean.
ACID
ACID is intended to guarantee validity even in the event of errors or power failures. ACID describes database transaction properties which are:
Atomicity - an executed whole transaction is all or nothing. Any failure causes the entire transaction to fail, what leaves the database unchanged.
Consistency - ensures that all transactions result in a valid state of the database and that all validation rules and constraints are met.
Isolation - ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially.
Durability - guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure.
The ACID model for transactions strongly favors consistency over availability and is not without criticism. That led to an alternate model called BASE, which is highly scalable transactional model focused on availability.
BASE
The BASE model is designed to loosening the requirements for immediate consistency, data freshness, and accuracy to gain benefits, like scalability and resilience. BASE describes database transaction properties which are:
Basically available - this states that the system does guarantee the availability of the data as regards CAP Theorem. There will be a response to any request, but that response could still be ‘failure’ to obtain the requested data, or the data may be in an inconsistent or changing state.
Soft state - indicates that given eventual consistency, the system may be in a changing state until the consistency is reached.
Eventual consistency - means that the system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.
Eventual consistency is considered an optimistic replication model, as opposed to ACID, which is considered a pessimistic replication model.
These concepts and potential tradeoffs are important to consider when selecting database technologies, as each address and prioritize requirements in different ways.
When to use MongoDB in short words?
MongoDB was built with high availability from the ground. Scaling and sharding are the most common patterns for MongoDB use cases. Relational DBS scale vertically by using more efficient servers, but horizontal scaling can be a challenge for them. Easily horizontal scaling using built-in sharding and replica sets for data replication and offloading primary servers from the read load can help developers to store massive data sets more effectively.
MongoDB is a general-purpose database. Thanks to document-oriented approach, with non-defined attributes that can be modified on the fly, it has a flexible schema design, which is a crucial contrast between MongoDB and relational databases.
Being able to store documents inside a collection that can have different properties can help both during the development phase but also in ingesting data from heterogeneous sources that may or may not have the same properties. Having the ability to deep nest attributes into documents, add arrays of values into attributes and all the while being able to search and index these fields helps application developers exploit the schema-less nature of MongoDB
You can read about MongoDB use cases
here.
MongoDB criticism
MongoDB schema-less nature is its huge advantage, but it is also a big point of debate and argument. Schema-less can be beneficial in many use cases as it allows for heterogeneous data to be dumped into the database without complex cleansing or ending up with lots of empty columns or blocks of text stuffed into a single column. On the other hand, this is a double-edged sword as a developer may end up with many documents in a collection that have loose semantics in their fields, and it becomes tough to extract this semantics at the code level. What we can have in the end if schema design is not optimal, is a plain datastore rather than a database.
Summary
I hope that now we have a better understanding of all the pros and cons of RDBS and NoSQL databases. Both solutions according to CAP theorem have their strength and weakness. Selecting between RDBS and NoSQL is a decision that is dependent on system requirements and available data structure.
We need to select MongoDB when the data structure is, or availability and horizontal scaling are the priority.