Monday, August 9, 2010

Building a New NoSQL Database from Scratch

Over the course of the last few months, I've been writing a new NoSQL database.  Most readers will laugh and say one of a few different things.  Maybe something like, "NoSQL is a fad", "there are already so many NoSQL options, creating a new one is pointless", or even "unless you bring something new to the table, your adventure is going nowhere".  To sum up my response: yes, it is offering something new.

Every existing SQL/NoSQL solution has tradeoffs.  From the MyISAM storage backend for MySQL (no ACID compliance), to Postgres 8.4 and prior's lack of replication tools (I know I am looking forward to Postgres 9), to secondary index setup/creation in Cassandra, to MongoDB's use of memory-mapped files without transaction log, ... each database trades different advantages for other disadvantages.  For the existing nontrivial noSQL solutions that I have examined (Cassandra, MongoDB, CouchDB, S3, Voldemort, SimpleDB, etc.), I have found that many of my required features are just not available.  In the rare case where my required features were available (Google's AppEngine datastore), it's not available where I need it (Amazon AWS, Slicehost, etc.).

What do I require?

  • scales reasonably on small systems (MongoDB fails here)
  • document/row store (S3 and Voldemort fail here)
  • secondary indexes (SimpleDB fails here)
  • the ability to add/remove indexes during runtime (Cassandra fails here)
  • no surprise queries (aka no table scans, MongoDB, Cassandra, and CouchDB fail here)
  • no corruption on system restart (AWS and other cloud hosting providers sometimes lose your instance for a bit, MongoDB fails here)

Additional points for:

  • no schema (MongoDB, CouchDB, SimpleDB, ...)
  • multiple values per column (MongoDB and AppEngine's list columns)
  • sub-documents (like MongoDB's {'sub-doc':{'attr1':'val1', ...},} )
  • the ability to perform queries against a secondary index while it is being built
  • 10k+ rows inserted/second on modest hardware (MongoDB, ...)
  • replication / sharding / clustering (MongoDB Replica Sets would be optimal)

Astute observers will note that MongoDB offers all of these things, except for scaling well on small systems, and 'no surprise queries'.  To clarify what I mean by both of these; when you are using a 32 bit platform (small Amazon AWS hosts, some smaller VPS machines on other hosts), you are limited to 2 gigs of data and indexes with MongoDB.  This is because they use memory-mapped files as an interface to on-disk files, which limits you to the architecture address space.  As such, MongoDB really only makes sense for relatively small systems, or when you have a 64 bit system.  Further, if you forget to explain your queries in advance (like during ad-hoc queries), to ensure that you are using an index, you can end up scanning your multi-gig database for a simple count query.  On other databases (PostgreSQL and MySQL in the SQL world being the two I am most familiar with), a table scan will slow down other queries, but it won't destroy overall performance.  With MongoDB, that table scan will destroy write throughput for any other operation.  I have run into this myself.

To solve the table scan issue, we need to require an index for every query we perform (this is what Google's AppEngine datastore does).  This somewhat limits ad-hoc queries, but with the ability to add/remove secondary indexes during runtime, and with the ability to make queries against indexes while they are being built (and drop the index during it's creation), we can start an index creation, and immediately perform our query.  If there isn't enough data, we can wait and perform the query again.  Yes, the index building operation does perform a table scan to create the index, but it can be designed to balance itself with incoming queries so that performance doesn't suffer greatly.

Over the course of the coming weeks, I will be documenting the process of building this new NoSQL database from the ground up.  Yes, I said weeks.  Part of the design and implementation will include the use of a readily-available, fast, durable data store (which bypasses many of the nasty implementation details), wrapped with an interface to offer all of the requirements and the first five pluses I list above.  I do have a design for replication/sharding/clustering, but it's the only part of the system I've not built yet.  We'll see how it goes.  I will be posting source on Github as I discuss the design of the system, offering code as the nitty-gritty details of the higher-level design I will describe.

No comments:

Post a Comment