NoSQL Databases: What, When and Why (PHPUK2011)
Abstract: Some considerations about NoSQL databases, and slides of my talk given at PHPUK 2011: "NoSQL Databases: What, When and Why".
I'm back from what IMHO was the best PHPUK edition so far, thanks a lot to all the organisers for such a well-organised conference! All the talks I attended were superb (from Marco Tabini's keynote, to Ian Barber's ZeroMQ talk, to Morgan Tocker's InnoDB insights, and Andrei Zmievski's introduction to ElasticSearch).
I also gave a talk about NoSQL databases, a topic I've become increasely interested in the last few years. People who know me know that I have a thing for relational databases (and I often have some fun and do crazy stuff with them), and by extension they must think that I dislike NoSQL databases. Truth is, I like the latter very much as well. What irks me is the blurry and shallow picture that emerges from the large majority of discussions on the subject. Let me explain what I mean.
The first impression of NoSQL databases that most people have is that they are "better" than the old RDBMS generation (and they're "better" at everything), they don't use SQL because SQL is oh-so-hard, and, uhm, they scale (take that, RDBMS!). And worst of all, since all these new storage solutions are waving the "NoSQL banner" all together, and unite-they-fight against the RDBMS enemy, then a distracted observer has the impression that they all are completely interchangeable. Argh.
So now you know what motivated me to talk about the subject. Yes, it was a bit naïve of me to think that I could cover all in 50 minutes, and surely I could have done a better job anyway with more time to prepare the talk, but I hope I was able to convey a few important points nonetheless:
- Despite what motivation draws you to the NoSQL world, it's fundamental to understand why and how they depart from relational databases: what shortcomings they're trying to circumvent, what architectural decisions support them, what trade-offs they accept in order to achieve their goals. Failing to understand these things can only bear unpleasant surprises, should you ignore them when switching to a non-relational data store.
- All NoSQL databases are not created equal. RDBMS databases are often blamed to be a one-size-fits-all solution to all problems. NoSQL databases are the exact opposite: their extremely rich variety, and diversity in the problems they're trying to solve and in the approaches they have, is what makes them so successful in today's complex world. So you can have products like Cassandra that are appropriate for full-text indexing, HBase that's perfect for low-latency high-write-throughput and mostly offline processing, CouchDB that's great for a small number of pre-defined queries (thanks to its "materialised resultset" views), MongoDB that's good for more dynamic queries while retaining some data structure, and then you have Graph databases that are thousands of times more efficient than all the others in expressing and traversing complex relationships. But put any of them out of their "comfort zone", try asking questions they're not optimised for, and see what happens. Hint: it's going to hurt, badly. And if you care about your data, have a look at their persistency model too. That's why it's so important to understand what each of them can and what they can't do, and their architecture.
- While almost all the NoSQL databases focus on scaling, they do so in a very different way. Some of them scale well with the growth in transaction volume (e.g. number of concurrent requests), some scale well with the growth in data set size, some are good at both (albeit sacrificing either the data model or the type of questions you can ask them), some are good at neither and only focus in raw speed.
- Related to the previous point, a fact that's often overlooked, not all the NoSQL databases are distributed. The power of K-V stores implementing consistent hashing is they scale beautifully with data size growth (at the cost of a very poor data model), and the same is true for BigTable derivates that sit on top of a distributed filesystem like GFS or HDFS. Others can't be considered distributed at all: for instance, CouchDB, and I would say the entire graph databases category, focus on vertical scalability in the number of transactions (either via replication or single-node efficiency) but do not distribute the data itself across many nodes. As a side-note, if you're interested in distributed systems, I can suggest reading Jeff Darcy's blog. No, I won't point you to any specific post, just read the entire blog archive. And then read it again, and again. It's that good.
- Finally, most people seem to make the jump from "it's so cool" to "it must be easy". Well, it's not always the case, starting from the language itself: I bet many NoSQL products would love to have an SQL-like query syntax, if only they could (a fact confirmed by the rise of interfaces like Pig and Hive on top of map-reduce). There's a new terminology, new paradigms, completely different data models. Development, deployment and integration are complex and often involve writing in or interacting with new languages (Erlang, Scala, C, Java), and finally -as already mentioned- there are new query models (and map-reduce is the emerging paradigm, with many slightly different implementations). There's a steep learning curve here.
- On batching vs. latency, and jobqueue models
- Updated Kafka PHP client library
- Musings on some technical papers I read this weekend: Google Dremel, NoSQL comparison, Gossip Protocols
- Historical Twitter access - A journey into optimising Hadoop jobs
- Kafka proposed as Apache incubator project
- NoSQL Databases: What, When and Why (PHPUK2011)
- PHPNW10 slides and new job!