NoSQL Databases: What, When and Why (PHPUK2011)

I'm back from what IMHO was the best PHPUK edition so far, thanks a lot to all the organisers for such a well-organised conference! All the talks I attended were superb (from Marco Tabini's keynote, to Ian Barber's ZeroMQ talk, to Morgan Tocker's InnoDB insights, and Andrei Zmievski's introduction to ElasticSearch).

I also gave a talk about NoSQL databases, a topic I've become increasely interested in the last few years. People who know me know that I have a thing for relational databases (and I often have some fun and do crazy stuff with them), and by extension they must think that I dislike NoSQL databases. Truth is, I like the latter very much as well. What irks me is the blurry and shallow picture that emerges from the large majority of discussions on the subject. Let me explain what I mean.

The first impression of NoSQL databases that most people have is that they are "better" than the old RDBMS generation (and they're "better" at everything), they don't use SQL because SQL is oh-so-hard, and, uhm, they scale (take that, RDBMS!). And worst of all, since all these new storage solutions are waving the "NoSQL banner" all together, and unite-they-fight against the RDBMS enemy, then a distracted observer has the impression that they all are completely interchangeable. Argh.

So now you know what motivated me to talk about the subject. Yes, it was a bit naïve of me to think that I could cover all in 50 minutes, and surely I could have done a better job anyway with more time to prepare the talk, but I hope I was able to convey a few important points nonetheless:

  1. Despite what motivation draws you to the NoSQL world, it's fundamental to understand why and how they depart from relational databases: what shortcomings they're trying to circumvent, what architectural decisions support them, what trade-offs they accept in order to achieve their goals. Failing to understand these things can only bear unpleasant surprises, should you ignore them when switching to a non-relational data store.

  2. All NoSQL databases are not created equal. RDBMS databases are often blamed to be a one-size-fits-all solution to all problems. NoSQL databases are the exact opposite: their extremely rich variety, and diversity in the problems they're trying to solve and in the approaches they have, is what makes them so successful in today's complex world. So you can have products like Cassandra that are appropriate for full-text indexing, HBase that's perfect for low-latency high-write-throughput and mostly offline processing, CouchDB that's great for a small number of pre-defined queries (thanks to its "materialised resultset" views), MongoDB that's good for more dynamic queries while retaining some data structure, and then you have Graph databases that are thousands of times more efficient than all the others in expressing and traversing complex relationships. But put any of them out of their "comfort zone", try asking questions they're not optimised for, and see what happens. Hint: it's going to hurt, badly. And if you care about your data, have a look at their persistency model too. That's why it's so important to understand what each of them can and what they can't do, and their architecture.

  3. While almost all the NoSQL databases focus on scaling, they do so in a very different way. Some of them scale well with the growth in transaction volume (e.g. number of concurrent requests), some scale well with the growth in data set size, some are good at both (albeit sacrificing either the data model or the type of questions you can ask them), some are good at neither and only focus in raw speed.

  4. Related to the previous point, a fact that's often overlooked, not all the NoSQL databases are distributed. The power of K-V stores implementing consistent hashing is they scale beautifully with data size growth (at the cost of a very poor data model), and the same is true for BigTable derivates that sit on top of a distributed filesystem like GFS or HDFS. Others can't be considered distributed at all: for instance, CouchDB, and I would say the entire graph databases category, focus on vertical scalability in the number of transactions (either via replication or single-node efficiency) but do not distribute the data itself across many nodes. As a side-note, if you're interested in distributed systems, I can suggest reading Jeff Darcy's blog. No, I won't point you to any specific post, just read the entire blog archive. And then read it again, and again. It's that good.

  5. Finally, most people seem to make the jump from "it's so cool" to "it must be easy". Well, it's not always the case, starting from the language itself: I bet many NoSQL products would love to have an SQL-like query syntax, if only they could (a fact confirmed by the rise of interfaces like Pig and Hive on top of map-reduce). There's a new terminology, new paradigms, completely different data models. Development, deployment and integration are complex and often involve writing in or interacting with new languages (Erlang, Scala, C, Java), and finally -as already mentioned- there are new query models (and map-reduce is the emerging paradigm, with many slightly different implementations). There's a steep learning curve here.

Anyway, I hope this adds some context to my slides. Thanks again to all the organisers and the attendees, and please leave your feedback on Joind.in or drop a note in this page!




7 responses to "NoSQL Databases: What, When and Why (PHPUK2011)"

Great presentation. This is one of the most interesting presentations on NoSQL that I have found!
Kind regards, Herbert

Hi Lorenzo, this presentation looks awesome from your slides. I look forward to watching on the conference site. Many thanks!

Hi Lorenzo, Nice presentation, looking forward to the Cassandra one next week. :-)

Thats one great presentation!
RavenDB (http://ravendb.net) is one impressive document-oriented database you are missing though. It is a serious evolving competitor for CouchDB and MongoDB, and definitely worths mentioning.

@Itamar: I did mention it during the talk, although I could not possibly describe every single database in 50 minutes, so I had to make a selection.

Nice Presentation!!
Does any of those systems that you have mentioned in your presentation supports LDAP?
As i am studying and collecting information to build high performance central authentication system which should handle(~1500 logins/sec)!!
Any help would be greatly appreciated!!
Regards, Sagar Sonawane

Really nice presentation! Just a small thing at the slide 27, where is written \"Range Lock\" should be \"Write Lock\" and \"Write Lock\" should be \"Range Lock\".

Lorenzo Alberton

Lorenzo Alberton Lorenzo PHP5 ZCE - Zend Certified Engineer has been working with large enterprise UK companies for the past years and is now Chief Tech Architect at DataSift. He's an international conference speaker and a long-time contributor to many open source projects. Lorenzo Alberton's profile on LinkedIN View Lorenzo Alberton's Twitter stream

Lorenzo Alberton - Sun Certified MySQL 5 Developer

Tags

AJAX, Apache, Book Review, Charset, Cheat Sheet, Data structures, Database, Firebird SQL, Hadoop, Imagick, INFORMATION_SCHEMA, JavaScript, Kafka, Linux, Message Queues, mod_rewrite, Monitoring, MySQL, NoSQL, Oracle, PDO, PEAR, Performance, PHP, PostgreSQL, Profiling, Scalability, Security, SPL, SQL Server, SQLite, Testing, Tutorial, TYPO3, Windows, Zend Framework

Buy me a book - The Text Mining Handbook