-->

SQL or NoSQL? NewSQL !!

Fifteen years ago, one database was enough to hold all the application’s data. Over the years, applications started to collect more and more data, from people, devices, events, etc. The single instance could not hold this amount of data and the throughput of usage, it needed to be scaled.

There are two ways to scale a system:

Because the traditional databases (RDBMS) such as MySQL, Oracle or SQL Server could not scale-out, companies needed to scale them up, by using monster computers that cost a lot of money. This was the main reason for using NoSQL databases such as MongoDB, Cassandra and others. They offered a scale-out solution on commodity hardware which lowered the price. But those databases needed to make compromises in order to support scale-out. To understand those compromises I will describe two concepts, the CAP theorem and ACID.

CAP theorem

According to the “CAP theorem”, any database can only support two of the three:

For example, MongoDB is “CP” (Consistency + Partition tolerance), Cassndra is “AP”, and RDBMS are “CA”.

ACID

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions:

The ACID properties allow the developer to be sure of the correctness of the data, in any transaction or calculation, at any given time. There are no surprises, missing data, or problems of data integrity. All RDBMS support ACID and support the “C” in the CAP theorem. That is why they are so reliable. In the other hand, NoSQL databases do not support ACID. Some NoSQL such as MongoDB support different model of consistency, usually eventually consistency. It means that consistency is guaranteed but not in any given time. It is fine for many usages but not for other (think about a finance transaction or even the order of comments in a blog post).

To overcome it companies had the following options:

Both of the solutions may cause problems. Synchronize databases is not an easy job and also to maintain them. Develop a transactional layer on for a software developer that is not skilled in this can cause bugs in a very sensitive area of the application, and also hurt the performance.

This is the background for the invention of NewSQL databases.

NewSQL databases

The NewSQL databases come to solve it. The term was coined by the 451 Group, in their now famous report, “NoSQL, NewSQL and Beyond”. The NewSQL databases support ACID, and they are distributed, can be scale-out, be able to handle large volume of data with great performances.

NewSQL types

There are 3 types of NewSQL:

I will describe two different approaches, VoltDB that is build with new architecture and scaling-out Postgres.

VoltDB

VoltDB is an in-memory database, it is an ACID-compliant RDBMS which uses a shared nothing architecture. One of VoltDB designers is Michael Stonebraker, and many concepts of his academic researches such as shared nothing architecture were implemented in this database. In his article, “OLTP Through the Looking Glass, and What We Found There”, Stonebraker and his colleges explain why concepts about database architecture that were right at the 70’s, no longer valid now, especially because of the in-memory capability. He claims that new databases should use new modern architecture that fits today’s challenges. VoltDB is built with these concepts.

In-memory database

VoltDB is an in-memory database what makes it work faster than on-disk databases. How can it support ACID, or be be more specific, how can it support durabilty (After the transaction is committed, it is ensured that the data will not be lost).

There are few ways that in-memory database can support durability:

VoltDB uses snapshot and transaction logging to ensure durability.

Shared nothing architecture (SN)

VoltDB uses a shared-nothing architecture. shared-nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, Instead of central entity that in control. That way VoltDB achieve database parallelism. There is no single point of failure and there is no bottle necks. The data is handled independently by each CPU core. It allows VoltDB scale by increasing cores.

Partitioning and scale out

VoltDB manages the partitions itself. VoltDB partitioning both the data and the transactions that access that data. So multiple transactions can be processed in parallel. Throughput is easily scaled by increasing either the number of partitions per node or the number of nodes in the cluster. If a node needs a data from other partition it can do it but it will hurt the performance.

Single threaded

Because VoltDB doesn’t use the disk in its transactions, it doesn’t need to run multi-threaded (one of the main reasons that RDBMS databases run multi-threading is because it writes the data to the disk, and multi-threading is efficient way to handle the disk wait. Every core runs single thread which eliminating the need for locking and latching.

Stored procedure interface for transactions

VoltDB uses stored procedures for transactions, stored procedures are atomic: they either succeed or fail and rollback as a whole. It boosts the performance because in other RDBMS a lot of time is wasted on network delays because the goes back and forth from the database to the application server several times. Another advantage of stored procedures is that they are pre-compiled. VoltDB supports writing stored procedures in SQL and also in Java for more complex procedures.

Scale-out Postgres

A different approach from building new database with new architecture, is using an existing database with a maturity and proven stability, and adding to it the scale-out feature. This is done in many RDBMS, I will focus on Postgres.

There are many approaches available to scale PostgreSQL. There is no built in solution yet embedded in Postgres yet, but achieved by using postgres features by third parties tools. Greenplum, Postgres-XL and more are some of the tools, I will describe how Citusdata does it.

Citus database

Citus is a distributed database that combines the low-latency requests that power real-time applications with high throughput, interactive analytics on billions of events. Citus does this by extending PostgreSQL to distribute your workload across multiple hosts, leveraging the memory, storage, and processing power of multiple machines. Citus distributes the data across a cluster of commodity servers. Incoming SQL queries are then parallel processed across these servers.

Citus architecture

One of the PostgreSQL instances in the cluster is the Citus master. The master stores metadata about the shards, and the worker PostgreSQL instances (Citus workers). All interaction with the database is done with the master through standard PostgreSQL API. All the data is distributed across the workers. Each table is logical “sharded”, it means that it is spread between shards according to the selected column. It enables the scale-out feature. Each shard is replicated on at least two of the workers, as a result, the loss of a single machine does not impact data availability.

The master receives the query, distribute it to the relevant workers according to the metadata, and then merges their results, and returns to the user. If a worker fails mid-query, Citus completes the query by re-routing the failed portions of the query to other workers which have a copy of the shard.

Conclusion

NewSQL databases are new type of databases that allow both data integrity and scalability in a low cost. They can either be build with new architecture that stored the data on memory, or expand the capabilities on existing databases. What lies ahead? I believe that the traditional RDBMS will adopt the concepts of NewSQL, by implementing it them self or by acquiring technologies and integrate them into their products. In the future all databases will support scaling-out and distribution. This will reduce the market share of the popular NoSQL databases, and the dominance of the major players will remain as it is today. Although the decreasing price of memory and CPU’s, can cause single machine to be strong enough to handle big data, the train has already left the station and the competition between the companies will drive them to support distribution and scale-out capabilities.

References

Originally posted on Spectory's blog