When listening to someone describe non-relational databases, it's hard not to think of this particular adage from English poet John Lydgate:
"You can please some of the people all of the time and all of the people some of the time, but you can’t please all of the people all of the time."
That's because the one big commonality between all non-relational databases seems to be some sort of sacrifice for the sake of a perceived greater benefit--usually speed.
That was certainly the impression after listening to a presentation from Geeknet Senior Web Developer Mark Ramm at this year's Penguicon in Troy, MI. Ramm's talk, "Learning Not to Relate: an Introduction to 'Non-Relational' Database Technologies," provided just that to attendees: a sharp introduction to the class of databases usually referred to as "NoSQL."
The first thing that must be understood about NoSQL databases is that there isn't a complete absence of Structured Query Language in this ecosystem. Ramm explained that while relational databases all use SQL as the
domain-specific language for ad-hoc queries, non-relational databases have no such standard query language, so they can use whatever they want--including SQL. According to the NoSQL Databases website, community members have been moving towards using the NoSQL term to mean "Not Only SQL," a bit of a deviation from the protest-flavored, more literal meaning of NoSQL that was coined by early users of this database class.
Other differentiators of NoSQL databases, Ramm continued, is that they all have their own APIs, and somewhat alarmingly to anyone unfamiliar with NoSQL concepts, "they are not the most scalable, simple, or flexible databases around."
If you're wondering if Ramm is the worst pitchman for non-relational databases at this point, join the club. None of these features initially make for compelling selling points: varied query languages and APIs, and a confirmed admittance that non-relational databases may not be the best at all of the qualities typically associated with a "good" database. But there's something else going on here that quickly makes you realize where the benefits of non-relational databases truly lie.
The key is the notion of being the best at everything. That's not what non-relational databases are about.
"There are non-relational databases that beat some of these aspects, but not all of these," Ramm explained.
To understand this statement, one must step back and see the broader theorem that dictates the infrastructure of relational databases: ACID. Ramm told the audience the acronym stands for Atomic, Consistent, Isolated, and Durable--core aspects that must apply to all data within a relational database. Data is broken down to atomic values (name, address_1, city...) while remaining consistent across the database, isolated from other transactions until the current transaction is finished, and durable in the sense that the data should never be lost.
The infrastructure of a relational database is well-suited to meet this criteria for data: data is held in tables connected by relational algebra.
Here's the eye-opener:
"Most non-relational databases drop at least one or more of these [criteria]," Ramm said.
Likely one of the biggest objections organizations might have against NoSQL databases hinges on this approach. They aren't willing to make a move to NoSQL because they can't give up ACID. Particularly the "C," because not having data consistency is a terrifying prospect for any company dealing with financial transactions. Which is just about everyone.
Yet non-relational databases are being used by firms like Amazon and Google every day, with great success. Amazon, in particular, needs to track millions of transactions on any given day--so how do they get away with inconsistent data?
The simple truth is, they almost have to. The trade-off would be a relational database that could never keep up with the speed and scaling necessary to make a company like Amazon work as it does now. Recall that non-relational databases are structured to sacrifice some aspect of ACID to gain something in return. In the case of Amazon, their proprietary non-relational Dynamo database is willing to apply an "eventually consistent" approach to their data in order to gain speed and uptime for their system when a database server somewhere goes down.
Querying Ramm about this, he gave me and the rest of the audience a good practical example of how this would work for Amazon and similar web services. Books, for example have a variety of different datapoints within the system for which Amazon tolerates inconsistency, Ramm explained: price, ratings, location. If a server is down or a change has been made that hasn't yet propagated to all the data containers in the Amazon system, there is a chance that one of these datapoints will be inconsistent when a customer comes in to buy that book.
Amazon's goal is that such inconsistencies will be eventually resolved, with "eventually" being right before the customer is formally billed for the book. Sometimes customers will see the inconsistency and can still continue with the sale: subtle changes in shipping dates or ratings for the book. Sometimes, the inconsistency isn't resolvable in time: a book sold out just as the buy now link was clicked by another customer, for instance. In this case, customers may see a "sorry, there's a problem with this order" screen and be directed to reserve the book instead. Or, if the inconsistency lasts a really long time, an e-mail is sent to the customer apologizing for the error and alternatives to rectifying the situation.
"Saying 'we're sorry' is way cheaper than no sale at all," Ramm said.
Such inconsistencies, particularly on important data like price and availability, don't happen as often as one might suspect. This data is very read-intensive, and not written as often, because product prices don't jump around that much, and availability is predictable. Ratings and location are more fluid, but these values are not necessarily as critical to the sale of the book--it probably isn't a deal-breaker if a book has 3.5 stars instead of 4, or may take one extra day to arrive.
"Guaranteeing consistency [in data] is expensive," Ramm remarked. By allowing flexibility into the system, Amazon and companies like it gain a huge advantage in speed, scalability, and availability.
That's the big advantage of non-relational databases, which is attracting corporate interest with each passing day.
There are multiple approaches to putting non-relational databases together: a whole ecosystem of databases that provide specific advantages to users, not just speed over consistency as the Amazon example demonstrated. Part 2 will examine the topology of the NoSQL ecosystem and explore the pros and cons of each type of database.


