Friday, January 01, 2016

Writing a simple database in rust, part 1

A Simple and Efficient Implementation for Small Databases is an old paper that I've wanted to implement for a long time. It describes a simple application that embodies a lot of "systems" knowledge at the same time:
  • Using the filesystem as a persistence and synchronization mechanism.
  • Using a write-ahead log as a source of truth.
  • Creating crash tolerant (or crash-only) systems.
While ideally you'd want to write this in C or C++, mucking around with strings and files isn't a very pleasant experience unless you are willing to pull in some third party libraries. Fortunately, in 2015 there is a langauge that is great for systems programs which ships with good libraries and offers several high-level constructs that make writing systems programs a pleasure. I've been playing around with Rust and it seems like a good tool for the job.
The paper describes a key-value store with granular locking, write-ahead logs and snapshotting (log compaction). It then describes a Chubby/Zookeeper precursor built using the system. My aim is only to implement the former (for now). I will also try to add some ideas from more modern systems, like immutability, if I have the time and inclination.
The aim is to start from a very naive implementation that works and then make it more idiomatic, or have better names, or add performance improvements.
This first post covers creating the in-memory key-value store and the write-ahead log without compaction. Along the way we'll take a look at the design choices and how Rust helps to write safer programs.

The in-memory key-value store

Code
The in-memory KV store is really easy to implement in Rust. A built-in HashMap and generics makes it very similar to C++ or Java. Ignore the extraneous imports and the impl Drop for SimpleDB..., they are remnants of some experiments. The Path and I/O related operations follow in the next section.
In true low-level goodness, our database will operate only on arrays of bytes. There are no length restrictions on the size of either keys or values. A Vec<u8> captures this.
Our library exposes only one public struct, SimpleDB.

type SimpleCollection = HashMap<Vec<u8>, Vec<u8>>;
type Records = Arc<RwLock<SimpleCollection>>;

pub struct SimpleDB {
  records: Records
}

We do want users to be able to use the same database from multiple threads. Rust's thread-safety is a diamond-in-the-rough for those new to the language. It demands a lot of thinking when designing the program (and grappling with weird errors sometimes), but when everything is shaken out, it makes sense and you get strong guarantees that there aren't data races waiting to happen.
We are going to skimp on per-key locking for now and lock the entire database for any operation. Inefficient, but simple. We will also only use a read-write lock unlike the read-update-exclusive lock used in the paper, simply because it already exists in the standard library.
When I started hacking on this, I wondered why locks and mutexes needed to be wrapped in Arc when they were supposed to be thread-safe on their own. It is because the lock itself is separate from the data it guards. Locks and mutexes provide safe internal mutability on the data they guard. They do so by leveraging the Rust ownership rules so that only one thread can ever acquire the contents for mutation. But the lock instance itself is also shared between threads. The rust compiler requires all objects to only be owned by one other object. The language provides various escape hatches that internally dip into unsafe code. The intention is that those hatches are vetted for safety while requiring higher level code to conform to strict requirements. For an object (in this case, the lock) to safely be 'owned' by multiple threads, it is wrapped in an atomic ref-counting pointer that will ensure that the lock itself is only deleted once, when all threads have released it. These are the sort of mistakes easy to make in other systems languages that Rust catches at compile time.
We also make sure that SimpleDB can be used on multiple threads:

unsafe impl Sync for SimpleDB {}
unsafe impl Send for SimpleDB {}

There is actually no need to make this explicit. The compiler can infer that all the members of SimpleDB are themselves thread-safe, so SimpleDB is too.
How Rust captures thread-safety via traits is fascinating, but too big to cover here in it's entirety. I recommend reading the concurrency chapter in the book.
We expose a static "constructor" open() that does some other things and then create()s and returns an empty database.

let db = SimpleDB {
  // ...,
  records: Arc::new(RwLock::new(SimpleCollection::new()))
};
Ok(db)

The get() method acquires the underlying HashMap in read-only mode and returns a value if it exists. Rust's Option<> struct captures the possibility of emptyness and provides several "Promise-like" combinators to make decisions based on empty or non-empty.

pub fn get<S: Into<Vec<u8>>>(&self, key: S) -> Option<Vec<u8>> {
  self.records.read().ok().and_then(|guard| guard.get(&key.into()).map(|val| val.clone()))
}

The lock's read() getter may fail if the lock is poisoned, so it returns a Result. RwLocks get poisoned if a writer panics while holding the lock. I found this idea fascinating, but we will ignore it for now and use some conversions to get access to the lock's guard. As long as the guard object stays alive, the lock is read-locked. guard also acts like a smart pointer (by implementing Deref) so we can call get()
The put and delete operations are similar. We acquire a write lock and insert or remove keys from the hash table.
In all these operations, we alias io::Error to our library's exported Error type. My error message Boo isn't very useful either. We'll leave this as it is for a while, and look at maintaining a Write-Ahead Log next.

Adding some disk operations

This is the revision of code used for the following section.
A write-ahead log maintains a record of mutations on disk. The change is only reflected in memory after it is written to disk. This allows recovery to the last known consistent point from the application's point of view in case of a crash. Production ready databases allow tuning how often the write ahead log is fsynced to save on disk I/O cost vs. possibility of losing data. We do no such thing, every mutation will be fsynced.
Users of our library will pass a directory name to the SimpleDB constructor and all our I/O operations will be performed on files within this directory.
let mut db = try!(SimpleDB::open("data-dir"))
Notice how we use Rust's traits to accept anything that is willing to convert to a reference to a Path. The Path type is unsized and meant to be used as a pointer. The PathBuf is sized, owned and mutable. The AsRef<Path> allows users to pass a PathBuf, a string or any custom type willing to deref to a Path. We then try to create the directory. Once that is done, we create an instance of the Log. The Log constructor creates a writable, append-only operations.log.
The lack of a null value in Rust leads to a pattern of creating fully initialized objects whenever possible. In C++ I'd leave several fields as nullptr in the constructor, then have setters. Anybody using those fields would have to check for non-null to avoid crashes. This often meant forgetting to check something and leading to lower quality software. In Rust you are either forced to have a Option<> member, which the compiler will force you to deal with at all callsites, or to use a builder pattern to set up even a complicated object as fully initialized. While more cumbersome when writing, it leads to safer code and nicer APIs.
Who should own the log? Since the log should also be protected by the mutex and we will operate on the log and hashtable together, it makes sense to put it into the Mutex. This makes it compile-time impossible to acquire the log without acquiring the mutex.
The get operation remains unchanged. put and delete now call the relevant methods on Log before making the change to themselves.
We'll use Rust's rich enums to represent log operations. I realized that this was overkill while writing this post, but changesets are immutable, so I'll simplify the code later.

enum LogOperation {
    Put(Vec<u8>, Vec<u8>),
    Delete(Vec<u8>),
}

The Log operations delegate to the append method which actually writes bytes to the file. The log format is simple. Each list of bytes is prefixed with it's length as a 8-byte, big-endian number. We use a p for a put and d for delete. So a sample log would look like (each character represents 1 byte):

00000001p00000005hello00000006world!00000001d00000005hello

This allows fast sequential reading of the file. We will use the byteorder crate to perform the u64 conversion.

fn append(&mut self, op: LogOperation) -> io::Result<()> {
    let mut bytes = Vec::new();
    match op {
        LogOperation::Put(ref key, ref value) => {
            bytes.append(&mut encode_vec(&['p' as u8]));
            bytes.append(&mut encode_vec(key));
            bytes.append(&mut encode_vec(value));
        },
        LogOperation::Delete(ref key) => {
            bytes.append(&mut encode_vec(&['d' as u8]));
            bytes.append(&mut encode_vec(key));
        }
    }
    self.file.write_all(&bytes).and_then(|unused| self.file.sync_data())
}

We write out the bytes, then use sync_data() to force writes to disk on systems which support it. This gives us a key value store that can save data to disk. The problem is, a new run of the program doesn't load the data back from disk. We'll get to that in the next post.