Retention Blog: August 2014

In a previous post I walked through some topics and listed reasons why NoSQL and SQL are better to use in different situations, but now I wanted to give an opportunity to give a more basic introduction on what NoSQL is all about.

Different ways to store data in NoSQL:

key:value (example redis)
document -json (example mongo-DB)
Difference? These are pretty similar actually. In key/value, you can store metadata for particular details for indexing, while in document, you typically have a "id" field as part of your json. The differences are in reality, blurry.

Schema-(less?):
Though NoSQL has the flexibility of no schema, its often misleading. For instance, assuming you have a database of products and want to query for the price of a product, your are implicitly applying a schema, because you are assuming that the product you are looking up has a price. When you think about it, a totally schema-less database doesn't make sense because without some set of rules or common fields among your data, your queries would have no value. In this way, you have to manage this schema on your own, and the database won't warn you on things like typos when you query the database.

Aggregate Oriented
Martin Fowler coins the phrase "aggregate-oriented" for describing how data is stored in NoSql database. In Relational databases the translation of an object that contains of non-scalar types, a list of unknown size of friends for a user entity in a language like java, gets transposed across multiple tables in a RDMS leading to the existence of Object Relational Mapping frameworks.. The friend object, (aggregate) in a NoSQL database is stored as a single entity in a NoSQL database, making it aggregate oriented.

The advantage of storing an aggregation or object as a single unit inside the database is when you try and distribute the data. If the object is stored as a single entity, then that entity in complete can be stored on a single machine. When you go to retrieve that entity, you only have to go to one node to access it. In a RDMS system, accessing that same entity would require accessing X number of nodes to access and build all of the data from the different tables.

Restriction on aggregate oriented structure is that is works well as long as the representation you are pushing back and forth remains the same. An order that contains n line orders. However, if you want to pull a different view on the same data, for example a product which has a relationship to n line order, you would have to do some extra work to produce that representation. RDMS has these different data models separate, and with an edit of the query, you can grab and build the aggregate easily that you want to see.

Consistency and Atomicy
In NoSQL database consistency and transactions are true within a single aggregate type, which is typically the use-case for when you need transactions.

Sharding vs Replication
Sharding is when data is broken into chunks and spread across multiple machines as can be done in a RDMS database. Replication is when the same set of data is replicated across multiple machines. Replication introduces some consistency problems, other than the logical problems seen on one machine or with sharding, but at a pro of being more performant and resilient. A number of requests on the same set of data can now be handled by multiple nodes, and if one of the nodes goes down, there is another node with the same set of data that can take its place. Sharding is complicated to implement, but it also introduces performance since each data set on the servers are smaller and faster to search with smaller indexes.

Cap Theorem (Consistency, Availability, Partition):
You can guarantee any two of these. In a NoSQL database that is distributed, we assume that we are using a partitioned database. In this case then, you can only guarantee consistency or availability. Which one you guarantee is a business choice that you can make. This is determined by the implementation of the action that you take when a communication line goes down between partitions. Either you determine the system to be "down" and don't accept any requests to the database (consistency over availability) or you leave the system "up" and hope there are no conflicts while the communication line is down (availability over consistency). In RDMS on a single server, you don't have partitions so you can guarantee both consistency and availability.

Even though in theory the Cap Theorem makes sense, in practice, a more common tradeoff is revealed which is consistency vs response time. A consistent request needs to wait for all the nodes involved to be updated before receiving a response. If a large number of nodes are involved, this response time may not be acceptable to the user, which leads to "eventually consistent" database writes.

Reasons for NoSQL:
If you are not Google or Amazon, then you have no need for a distributed NoSQL database. Even though you may not have the data now, choosing an implementation that is able to scale is often a good idea for the future.

But if you don't care about scaling, another reason for choosing NoSQL is because its easier to use in development. Aggregate types are natural in programming languages that support non-scalar types, and its easier to store those types as a single entity in a NoSQL database, rather than having to deal with Object-Relational Mapping into an RDMS system.

Retention Blog

Friday, August 1, 2014

An Introduction to NoSQL