- Using the filesystem as a persistence and synchronization mechanism.
- Using a write-ahead log as a source of truth.
- Creating crash tolerant (or crash-only) systems.
The paper describes a key-value store with granular locking, write-ahead logs and snapshotting (log compaction). It then describes a Chubby/Zookeeper precursor built using the system. My aim is only to implement the former (for now). I will also try to add some ideas from more modern systems, like immutability, if I have the time and inclination.
The aim is to start from a very naive implementation that works and then make it more idiomatic, or have better names, or add performance improvements.
This first post covers creating the in-memory key-value store and the write-ahead log without compaction. Along the way we'll take a look at the design choices and how Rust helps to write safer programs.
The in-memory key-value store
CodeThe in-memory KV store is really easy to implement in Rust. A built-in HashMap and generics makes it very similar to C++ or Java. Ignore the extraneous imports and the
impl Drop for SimpleDB...
, they are remnants of some
experiments. The Path and I/O related operations follow in the next section.In true low-level goodness, our database will operate only on arrays of bytes. There are no length restrictions on the size of either keys or values. A
Vec<u8>
captures this.Our library exposes only one public struct,
SimpleDB
.type SimpleCollection = HashMap<Vec<u8>, Vec<u8>>;
type Records = Arc<RwLock<SimpleCollection>>;
pub struct SimpleDB {
records: Records
}
We do want users to be able to use the same database from multiple threads. Rust's thread-safety is a diamond-in-the-rough for those new to the language. It demands a lot of thinking when designing the program (and grappling with weird errors sometimes), but when everything is shaken out, it makes sense and you get strong guarantees that there aren't data races waiting to happen.
We are going to skimp on per-key locking for now and lock the entire database for any operation. Inefficient, but simple. We will also only use a read-write lock unlike the read-update-exclusive lock used in the paper, simply because it already exists in the standard library.
When I started hacking on this, I wondered why locks and mutexes needed to be wrapped in
Arc
when they were supposed to be thread-safe on their own. It is
because the lock itself is separate from the data it guards. Locks and mutexes
provide safe internal mutability on the data they guard. They do so by
leveraging the Rust ownership rules so that only one thread can ever acquire
the contents for mutation. But the lock instance itself is also shared between
threads. The rust compiler requires all objects to only be owned by one other
object. The language provides various escape hatches that internally dip into
unsafe
code. The intention is that those hatches are vetted for safety while
requiring higher level code to conform to strict requirements. For an object
(in this case, the lock) to safely be 'owned' by multiple threads, it is
wrapped in an atomic ref-counting pointer that will ensure that the lock itself
is only deleted once, when all threads have released it. These are the sort of
mistakes easy to make in other systems languages that Rust catches at compile
time.We also make sure that
SimpleDB
can be used on multiple threads:unsafe impl Sync for SimpleDB {}
unsafe impl Send for SimpleDB {}
There is actually no need to make this explicit. The compiler can infer that all the members of
SimpleDB
are themselves thread-safe, so SimpleDB
is too.How Rust captures thread-safety via traits is fascinating, but too big to cover here in it's entirety. I recommend reading the concurrency chapter in the book.
We expose a static "constructor"
open()
that does some other things and
then create()
s and returns an empty database.let db = SimpleDB {
// ...,
records: Arc::new(RwLock::new(SimpleCollection::new()))
};
Ok(db)
The
get()
method acquires the underlying HashMap in read-only mode and
returns a value if it exists. Rust's Option<>
struct captures the possibility
of emptyness and provides several "Promise-like" combinators to make decisions
based on empty or non-empty.pub fn get<S: Into<Vec<u8>>>(&self, key: S) -> Option<Vec<u8>> {
self.records.read().ok().and_then(|guard| guard.get(&key.into()).map(|val| val.clone()))
}
The lock's
read()
getter may fail if the lock is poisoned, so it returns
a Result
. RwLocks get poisoned if a writer panics while holding the lock.
I found this idea fascinating, but we will ignore it for now and use some
conversions to get access to the lock's guard. As long as the guard object
stays alive, the lock is read-locked. guard
also acts like a smart pointer
(by implementing Deref
) so we can call get()
The
put
and delete
operations are similar. We acquire a write lock and
insert or remove keys from the hash table.In all these operations, we alias
io::Error
to our library's exported Error type. My error message Boo
isn't very useful either. We'll leave this as it is for a while, and look at maintaining a Write-Ahead Log next.Adding some disk operations
This is the revision of code used for the following section.A write-ahead log maintains a record of mutations on disk. The change is only reflected in memory after it is written to disk. This allows recovery to the last known consistent point from the application's point of view in case of a crash. Production ready databases allow tuning how often the write ahead log is fsynced to save on disk I/O cost vs. possibility of losing data. We do no such thing, every mutation will be fsynced.
Users of our library will pass a directory name to the
SimpleDB
constructor
and all our I/O operations will be performed on files within this directory.let mut db = try!(SimpleDB::open("data-dir"))
Notice how we use Rust's traits to accept anything that is willing to convert
to a reference to a Path. The Path
type is unsized and meant to be used as
a pointer. The PathBuf
is sized, owned and mutable. The AsRef<Path>
allows
users to pass a PathBuf, a string or any custom type willing to deref to
a Path. We then try to create the directory. Once that is done, we create an
instance of the Log
. The Log
constructor creates a writable, append-only
operations.log
.The lack of a
null
value in Rust leads to a pattern of creating fully
initialized objects whenever possible. In C++ I'd leave several fields as
nullptr
in the constructor, then have setters. Anybody using those fields
would have to check for non-null to avoid crashes. This often meant forgetting
to check something and leading to lower quality software. In Rust you are
either forced to have a Option<>
member, which the compiler will force you to
deal with at all callsites, or to use a builder pattern to set up
even a complicated object as fully initialized. While more cumbersome when
writing, it leads to safer code and nicer APIs.Who should own the log? Since the log should also be protected by the mutex and we will operate on the log and hashtable together, it makes sense to put it into the
Mutex
. This makes it compile-time impossible to acquire the log without acquiring the mutex.The
get
operation remains unchanged. put
and delete
now call the relevant
methods on Log
before making the change to themselves.We'll use Rust's rich enums to represent log operations. I realized that this was overkill while writing this post, but changesets are immutable, so I'll simplify the code later.
enum LogOperation {
Put(Vec<u8>, Vec<u8>),
Delete(Vec<u8>),
}
The
Log
operations delegate to the append
method which actually writes
bytes to the file. The log format is simple. Each list of bytes is prefixed
with it's length as a 8-byte, big-endian number. We use a p
for a put and d
for delete. So a sample log would look like (each character represents
1 byte):00000001p00000005hello00000006world!00000001d00000005hello
This allows fast sequential reading of the file. We will use the byteorder crate to perform the u64 conversion.
fn append(&mut self, op: LogOperation) -> io::Result<()> {
let mut bytes = Vec::new();
match op {
LogOperation::Put(ref key, ref value) => {
bytes.append(&mut encode_vec(&['p' as u8]));
bytes.append(&mut encode_vec(key));
bytes.append(&mut encode_vec(value));
},
LogOperation::Delete(ref key) => {
bytes.append(&mut encode_vec(&['d' as u8]));
bytes.append(&mut encode_vec(key));
}
}
self.file.write_all(&bytes).and_then(|unused| self.file.sync_data())
}
We write out the bytes, then use
sync_data()
to force writes to disk on
systems which support it. This gives us a key value store that can save data to
disk. The problem is, a new run of the program doesn't load the data back from
disk. We'll get to that in the next post.