Friday, August 1, 2014

An Introduction to NoSQL

In a previous post I walked through some topics and listed reasons why NoSQL and SQL are better to use in different situations, but now I wanted to give an opportunity to give a more basic introduction on what NoSQL is all about.

Different ways to store data in NoSQL:

key:value (example redis)
document -json (example mongo-DB)
Difference? These are pretty similar actually. In key/value, you can store metadata for particular details for indexing, while in document, you typically have a "id" field as part of your json. The differences are in reality, blurry.

Schema-(less?):
Though NoSQL has the flexibility of no schema, its often misleading. For instance, assuming you have a database of products and want to query for the price of a product, your are implicitly applying a schema, because you are assuming that the product you are looking up has a price. When you think about it, a totally schema-less database doesn't make sense because without some set of rules or common fields among your data, your queries would have no value. In this way, you have to manage this schema on your own, and the database won't warn you on things like typos when you query the database.

Aggregate Oriented
Martin Fowler coins the phrase "aggregate-oriented" for describing how data is stored in NoSql database. In Relational databases the translation of an object that contains of non-scalar types, a list of unknown size of friends for a user entity in a language like java, gets transposed across multiple tables in a RDMS leading to the existence of Object Relational Mapping frameworks.. The friend object, (aggregate) in a NoSQL database is stored as a single entity in a NoSQL database, making it aggregate oriented.

The advantage of storing an aggregation or object as a single unit inside the database is when you try and distribute the data. If the object is stored as a single entity, then that entity in complete can be stored on a single machine. When you go to retrieve that entity, you only have to go to one node to access it. In a RDMS system, accessing that same entity would require accessing X number of nodes to access and build all of the data from the different tables.

Restriction on aggregate oriented structure is that is works well as long as the representation you are pushing back and forth remains the same. An order that contains n line orders. However, if you want to pull a different view on the same data, for example a product which has a relationship to n line order, you would have to do some extra work to produce that representation. RDMS has these different data models separate, and with an edit of the query, you can grab and build the aggregate easily that you want to see.

Consistency and Atomicy 
In NoSQL database consistency and transactions are true within a single aggregate type, which is typically the use-case for when you need transactions.

Sharding vs Replication
Sharding is when data is broken into chunks and spread across multiple machines as can be done in a RDMS database. Replication is when the same set of data is replicated across multiple machines. Replication introduces some consistency problems, other than the logical problems seen on one machine or with sharding, but at a pro of being more performant and resilient. A number of requests on the same set of data can now be handled by multiple nodes, and if one of the nodes goes down, there is another node with the same set of data that can take its place. Sharding is complicated to implement, but it also introduces performance since each data set on the servers are smaller and faster to search with smaller indexes.

Cap Theorem (Consistency, Availability, Partition):
You can guarantee any two of these. In a NoSQL database that is distributed, we assume that we are using a partitioned database. In this case then, you can only guarantee consistency or availability. Which one you guarantee is a business choice that you can make. This is determined by the implementation of the action that you take when a communication line goes down between partitions. Either you determine the system to be "down" and don't accept any requests to the database (consistency over availability) or you leave the system "up" and hope there are no conflicts while the communication line is down (availability over consistency). In RDMS on a single server, you don't have partitions so you can guarantee both consistency and availability.

Even though in theory the Cap Theorem makes sense, in practice, a more common tradeoff is revealed which is consistency vs response time. A consistent request needs to wait for all the nodes involved to be updated before receiving a response. If a large number of nodes are involved, this response time may not be acceptable to the user, which leads to "eventually consistent" database writes.

Reasons for NoSQL:
If you are not Google or Amazon, then you have no need for a distributed NoSQL database. Even though you may not have the data now, choosing an implementation that is able to scale is often a good idea for the future.

But if you don't care about scaling, another reason for choosing NoSQL is because its easier to use in development. Aggregate types are natural in programming languages that support non-scalar types, and its easier to store those types as a single entity in a NoSQL database, rather than having to deal with Object-Relational Mapping into an RDMS system.


Thursday, July 31, 2014

NoSQL vs SQL

https://www.youtube.com/watch?v=rRoy6I4gKWU

NoSQL databases came about because of the need to store very large amounts of data and the need to store this data and traffic access to that data on multiple servers. RDMS was designed to work well on a single machine, but there is only so far a single machine can be scaled vertically and you quickly hit the limits when dealing with the amount of data handled by companies like Amazon or Google. NoSQL releases some of the ACID (Atomicity, Consistency, Isolation, Durabilityrestrictions of RDMS database allow it to distribute across multiple data centers, but of course, there are some drawbacks to this. Let's go through a few topics and see for each type of data storage, the pros and cons for those topics

Queries:
In relational databases, there is no limit on the queries you can write. Complexity of queries include multiple joins across different relational tables in the same database, whereas this data is located and scaled vertically in the same location. NoSQL databases on the other hand because of their distributed nature need to be denormalized ahead of time, by either combining data modules into one view, or by writing map reduce functions whose outputs are maintained and updated as new writes enter the database. If you know what queries you want to do ahead of time, then you might find NoSQL to be fast, because mapreduce functions can compute in parallel.

Transactions:
Traditional relational databases prided themselves in supporting transactions where you can edit anything in the world within a single transactions. NoSQL distributed nature is again a factor here, as it can't gaurenttee atomic operations since it doesn't know where the data is stored. Google App Engine datastore gets around this issue by introducing entity groups, and supporting transactions within an entity group. It actually support cross entity transactions as well, but those transactions are limited to 5. So, if you can structure your data in entity groups, or deal with the limits of 5 entities per cross-transaction, then NoSQL and Google App Engine datastore would work fine, otherwise you want the ACID qualities of RDMS.

Consistency:
NoSql only gaurentees "eventually consistent" data

Scalability:
Obviously the reason for NoSQL in the first place, data can be stored on multiple data centers with no master, with the support of mapreduce functions to get the data that you need. RDMS databases have gotten better at vertical scaling however.

Management:
Because of the less restrictions, developers can find that their time to their first MVP is faster, because of the less overhead of setting up SQL database.

Schema:
Part of the setup for a SQL RDMS database is a schema, and managing changes in the schema can be complex. With NoSQL they don't have those limitations, making changes to schemas having a lesser effect.