Python has a super-simple, reasonably performant, on disk key-value store, available in the shelve module.
To use just:
import shelve shelve_dict = shelve.open("/myapplication/db-file.db")
After this call shelve_dict is a python mutable mapping (think of it as a dictionary).
Please remember to shelve_dict.close() after use!
I needed to add intermediate result cache for a long-running process. Each object I needed to cache has a unique identifier, so key-value store was perfect. I needed to cache about 50.000 items 1mb in size each.
This is a strictly single-threaded process running on a local machine, so using Memcached and/or Redis would be an overkill --- not to mention that using 50Gbs of RAM for this would not be cost-effective.
There is a nice shelve package in Python standard library which implements a key-value store on top of (either) gdbm or ndbm (two on-disk hash table implementations)
I had a vague recollection of having performance problems with shelve, and I didn't find any recent benchmarks of this module (or benchmarks that fit my use case) so I decided to do benchmark myself. Moreover while 50GB of total database size is not much, it is not a trivial amount, so I wanted to ensure that I won't run into performance problems on production.
Beforementioned performance problems happened years ago, in a totally different application (memoization for some high energy physics computation) and I might have misused the library.
Long story short: it is totally fast enough, with plenty to spare.
Machine and OS:
- Recent Debian
- Ryzen CPU
- PCI-E m2 SSD
- Encrypted (dm-crypt) btrfs
- 32 GB of RAM
Disabled COW on test data file (btrfs defaults to copy on write, which is not very fast for big files);
Set disk cache to 1 second (one second after there first dirty page all pages are written to disk):
echo 100 > /proc/sys/vm/dirty_writeback_centisecs echo 100 > /proc/sys/vm/dirty_expire_centisecs
After "write part" i cleaned disk cache:
echo 3 > /proc/sys/vm/drop_caches
- Ran on python 3.7.1
- Using ipython notebook
import os, sys, base64, datetime, random, pickle import shelve test_data = shelve.open("/tmp/test-data") # Create 100 1mb items to save (generating 50k of these takes a lot of time # and would blur the results) items = [os.urandom(1024 * 1024) for ii in range(100)]
Write part of benchmark:
ENTRIES = 50_000 keys =  with shelve.open( "/tmp/test-data", protocol=pickle.HIGHEST_PROTOCOL, writeback=False ) as test_data: for ii in range(ENTRIES): if ii % 250 == 0: print(ii) # Sync data so we don't measure RAM performance. test_data.sync() key = base64.b64encode(os.urandom(16)).decode('ascii') keys.append(key) test_data[key] = items[ii % len(items)]
Read part of benchmark:
random.shuffle(keys) with shelve.open("/tmp/test-foo/db") as test_data: for key in keys: value = test_data[key] # decode step is just to make 100% sure that data is actually read # from disk. value.decode('ascii', errors='ignore')
Write part of the benchmark took about 5 minutes, which is about 150 entries per second, which is about 150mb per second. Which is way faster than I needed.
This is also consistent with iotop results, which printed about 200-300 M/s disk write during the test.
Read part took about 15 min (which is still way faster than I need).
The linux kernel used about 50% of available RAM as a buffer which is more than is available for this process in the production system), so these results will not totally representative --- however since in my case entry will be added to cache every couple of seconds, cache overhead will nevertheless be negligible.