By Michael Cohen
Mr. Cohen is most famous here for the discussions we've had before about Application Developers vs. Database Developers. Part I is here, Part II is here. Mr. Cohen is a friend of mine. I have a great deal of respect for him. We obviously disagree in some areas, but I've learned to appreciate his push-back and learn from it.
He had left a comment the yesterday on my most recent post, The "Database is a Bit Bucket" Mentality (Michael O'Neill posted his followup, Everything is a Bit Bucket, as well). I thought the comment would get short shrift, so I asked him to write up a full post. I deleted that comment and here is his post.
Modern RDBMS's are quite powerful today. Pretty much every one of them has full support for SQL, including vendor extensions, all of the features we've come to expect from a relational database, a full fledged programming language built in, and quite often support for extras like full text search or native handling of XML. Most now also now ship with highly feature specific add-ons - PostgreSQL has a geospatial package that makes it the defacto standard in that domain, MySql has hot replication in a master-slave paradigm, Oracle has....well, Oracle has all kinds of things, a full object system and Java inside, a message broker, an HTTP server, a complete UI toolkit, among other things.
So the question arises as to how much of this capability one should use. I think it's becoming apparent that the answer to this is, "not much." Why shouldn't you take advantage of as much of the database's feature set as possible? The answer is performance and scalability. But wait, aren't stored procedures faster than ad hoc queries? Yes (theoretically). Won't it be more performant to execute business logic as close as possible to the data it operates on? Why should we introduce yet another component into the architecture when the database is perfectly capable of handling a particular task?
For one thing, the programming languages and environments offered by relational databases are now relatively long in the tooth, and have been eclipsed by modern OO languages. Developers are much more productive building applications with these new languages, and find it painful and tedious to work within the relational model, with SQL. You can see proof of this now with the overwhelming popularity of ORM frameworks in most of the popular OO languages out there. Java has Hibernate/EJB/JPA and many others. Ruby has ActiveRecord, DataMapper, and Sequel. Python has SqlAlchemy and Djanjo's ORM. And it's not because these developers lack the skills to work with the database directly. Quite the contrary actually, it takes intimate knowledge of the database to work effectively with an ORM. What's more, the ORM is often able to make runtime optimizations that would be difficult or prohibitively time consuming to hand code. Finally, clustered caches offer massive performance and scalability improvements, handling writes back to the database transparently behind the scenes, but for the most part they preclude implementing complex business logic in the database.
The overall trend is clear, across languages and platforms. It's the movement of data out of the database and into the application layer. Less and less reliance on the database, perhaps only for archival purposes. John Davies has a good comment on this. He's operating in a unique environment with extremely rigorous performance requirements, but we're now starting to see similar constraints imposed by the web. There's a whole class of software that has come about due to the inability to scale the relational database beyond a certain point. Facebook developed Cassandra, now used by Twitter, Reddit, and Digg, among others. LinkedIn built Voldemort. My employer doesn't deal with the massive scale of these companies, but we do large scale data processing with Hadoop. HBase, another non-relational persistent data store, is a natural fit, and just about the only option really. We use MySql less and less.
Of course, not everybody is building applications with such high scalability requirements. But even for applications with less intensive scalability requirements I would argue the same tendency to minimize the workload on the database should apply. Cameron Purdy has a good quote, "If you don't pick your bottlenecks, they'll pick you." Design your application to bottleneck, he says. What he means is, your application is going to bottleneck on something, so you need to explicitly decide what it will bottleneck on. Unfortunately, most applications bottleneck on the database, as this is the hardest layer to scale. It's pretty easy to scale the front end, we just throw more instances of Apache out there. It's a little bit harder, but not much, to scale the app server. But it's pretty hard to scale the database tier, particularly for write intensive applications. For well funded organizations, Oracle RAC is the standard. MySql's master-slave setup and hot replication saw it win out over PostgreSQL despite the fact that Postgres is a much better database in just about every other respect. The NoSql projects listed above grew out of the inability even to scale out MySql.
The trend is clear. We're collecting and processing more data than ever before, and this will only increase as we go forward. Unfortunately, the relational database (at least in it's current form) isn't well suited to the scale of data processing an already significant and growing number of organizations deal with on a daily basis. We're now seeing new solutions come forth to address the shortcomings of the traditional RDBMS, and the same forces that have necessitated such developments are at work even in smaller organizations. At all levels, developers would do well to require as little functionality as possible from the database, essentially, to treat it as a bit bucket.