In this post I'll explain why I like mmap, why it is useful tool to have in your toolbox. There will be some Python code samples, that show it's uses.
While all examples will use Python (and scientific packages used by Python) most of this post is pretty generic (mmap is a Linux system call available to C and thus pretty much any other language). Parts of what I say here will also work on Windows (Python API will be kinda sorta similar, but C API will be totally different). Here is a SO question on what you can do using Windows and memory mapped files.
mmap is magic!
This is super simple, no file.read(), no buffers, no parsing. Just mmap and you have data at your fingertips.
But here is the magic:
It won't load whole file into memory.
The file might be bigger than available memory!
Your file will be stored in memory only once without any unnecessary copying, even if multiple processes read from it (and depending on options write to it).
When you read a file using the read call, you will generally copy data.
You can do a lot of cool stuff with it:
You can work on files bigger than your RAM, without worrying too much (depending on your access patterns this might be not so fast).
You can load binary data from disk insanely fast (no data copying, no parsing, one syscall some kernel magic and everything is ready)
You can do fast inter-process communication:
How does mmap work
Don't worry if you don't get all the details, low-level OS details often not obvious. I'll show you simple code samples that show how to use mmap in your programs, fill free to skip this section on first read.
Major task of the OS is to abstract hardware from the processes, processes should not care about whether:
- Your hard drive uses SATA or PCIEe;
- Speed of your RAM memory;
- How much RAM is currently available;
- What kind of graphics card do you have;
Obviously processes can care about RAM (if process happens to need GB's of cache) or graphics card (if process happens to use 3D or CUDA), but usually process can just don't care.
The same is with RAM memory, processes should not really care whether enough memory is available, they request RAM and it's OS job to make sure enough ram is available.
When you request memory from the OS (eg. by malloc call), you generally will get a memory pointer.
However you don't get pointer to any particular physical address, you get pointer to virtual memory, which (at some point) is translated to physical memory by the CPU.
Memory is split into blocks of fixed size (4kb on my OS) called pages, when program requests memory from the OS, OS will just allocate new virtual memory pages to this program and (eventually, often on first use) map these pages to physical memory.
mmap uses this mechanism --- when you open a file using this call, you get a pointer to a virtual memory region. There is no no actual disk IO during the mmap call itself, pages will be loaded as you read them (any modifications will also be synchronized to disk eventually). Pages read from disk will be stored in RAM as long as there is enough physical memory, if amount of memory decreases OS will start writing changes to disk and clearing the cache.
Two process can share the same memory mapped file. If share is read only only a single copy of file contents will exist in physical memory!
Memory can be shared either in:
- in read-write mode where all process can read and write to the same memory segment;
- in copy on write mode where each child process originally has the same data, but if they any of them writes to this memory, transparently a copy is created.
How to use mmap
Python has very thin wrapper around mmap POSIX system call, so I'll use Python in the samples however translation to C should be straightforward.
import mmap, os, multiprocessing TOY_FILE_LENGTH = 1024 * 1024 # Open the file example = open('/tmp/mmap-playground/toy', 'wb+') # Ensure file has proper length (fill with zeroes) os.posix_fallocate(example.fileno(), 0, TOY_FILE_LENGTH) # Create a mapping mmaped_file = mmap.mmap(example.fileno(), 1024 * 1024, mmap.MAP_SHARED) # Read the mapping print(mmaped_file[:16])
You can also create memory regions not backed by a file:
# To create anonymous region pass -1 as file descriptor # mmap.MAP_SHARED means that child processes can write to this array mapped_anonymous_memory_region = mmap.mmap(-1, 1024 * 1024, mmap.MAP_SHARED)
Memory is automatically shared with child processes:
def run_in_process(): mapped_anonymous_memory_region[:11] = b'Hello World' process = multiprocessing.Process(target=run_in_process) process.start() # Wait for process to finish process.join() assert mapped_anonymous_private[:11] == b'Hello World'
You can also share in Copy-On-Write (COW) mode, now subprocess can modify their copy of mapping, but other copies won't be updated:
# instead of mmap.MAP_SHARED, mmap.MAP_PRIVATE is used mapped_anonymous_private = mmap.mmap(-1, 1024 * 1024, mmap.MAP_PRIVATE) mapped_anonymous_private[:11] = b'Hello World' def run_in_process(): # Check data is shared assert mapped_anonymous_private[:11] == b'Hello World' # Try to modify mapped_anonymous_private[:11] = b'Hello Me!!!' # Assert data is modified locally assert mapped_anonymous_private[:11] == b'Hello Me!!!' process = multiprocessing.Process(target=run_in_process) process.start() process.join() # Check local copy has not been modified assert mapped_anonymous_private[:11] == b'Hello World'
Digression: loading binary files is super fast
Some time ago I read Joel Spolsky article about why .doc (not .docx format which is basically XML) is super complex.
Apart from obvious reasons (20 years of backward compatibility leads to super complex systems), the reason was basically speed.
Various versions of Word were created when computers had megabytes of RAM (Office 95 required at least 8mb of RAM!). Loading whole files into memory was not really an option equally parsing these files on the fly was not really an option (speed), so they just (more or less) dumped C structures on disk (along with some index structure). Loading these files required just to load binary index (fast), and then copy selected parts of file into memory, and then cast them directly to a it to a C structure (coincidentally we will be doing very similar thing in Python in a moment).
These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use. There’s no lexing or parsing involved in loading a file. Lexing and parsing are orders of magnitude slower than blitting.
mmap and numpy
Numpy has native support for mmap, so let's start with something simple:
# Create toy array GIGA = 1024 * 1024 * 1024 #Create array full of zeros array = np.memmap( '/tmp/mmap-playground/big-test', dtype=np.float64, shape=(GIGA, 3), mode='w+' ) # Yes you have created a 24 GB array (if you are # wondering if it is suspiciously fast --- # you are right, but more on that shortly) array[:10, :] = np.arange(10).reshape(10, 1) # Deleting an array flushes it to disk. # You can (and should!) explicitly call ``array.flush``. del array copy = np.memmap( '/tmp/mmap-playground/big-test', dtype=np.float64, shape=(GIGA, 3), mode='r+' ) # Now lets check if we got the same data: assert(np.all(copy[:10] == np.arange(10).reshape(10, 1))) assert(np.all(copy[10:100] == 0))
Microbenchmark! Array reading speed
Reading binary files is way faster than reading text files (parsing overhead).
|Opening 10 million by 3 float array|
|47.000 ms||119 ms||35 ms|
When Pickle protocol 5 is available redo these these benchmarks.
Here are three functions I benchmarked:
def test_memmap(): array = np.memmap('/tmp/mmap-playground/bench-read', ...) return array.sum() def test_csv(): array = np.loadtxt('/tmp/mmap-playground/bench-read.csv') return array.sum() def test_fromfile(): array = np.fromfile('/tmp/mmap-playground/bench-read') return array.sum()
It's obvious that reading from text format is way slower than reading raw binary file (but I was suprised speed difference was that big).
What was also kind of suprising is that using np.fromfile was slower than np.memmap.
I did some digging around the np.fromfile and probably culprit was copying data from buffer to buffer  (which is absent from np.memmap version).
More interesting result is that np.loadtxt took more than a thousand time longer than np.memmap. I expected huge difference, but this is so big it looks wrong. I think that the reason is that np.loadtxt does almost all work in plain Python code (and we all know that python VM is not fast), reading the same file using csv.reader (which has C implementation) takes 8 seconds.
Last (but also fascinating) thing I like about mmap and friends is that they all work with sparse files. Sparse files are files that don't store empty blocks physically on disk, but instead store information which blocks are filled. When reading sparse file, empty blocks return zeroes.
Here we create 24 GB array on disk, this call takes 23ms on my laptop, so there must be some magic here!
array = np.memmap('/tmp/mmap-playground/sparse-test', dtype=np.float64, shape=(GIGA, 3), mode='w+')
Let's check the file size! ls says that everything is OK.
ls -lah /tmp/mmap-playground/sparse-test rw-r--r-- 1 jb jb 24G Dec 30 14:30 /tmp/mmap-playground/sparse-test
Hower du will tell the truth:
du -hs /tmp/mmap-playground/sparse-test 4.0K /tmp/mmap-playground/sparse-test
Apart from being cool I didn't find any use for sparse files in my sciencey computing adventures.
Extra reading | references
In POSIX you open a file by calling the open function. This function returns a file descriptor, that is a small non-negative integer which then is used to identify opened file when using read, write functions.
In python you can get file descriptor either by using os.open, or by calling file.fileno() on opened file.
|||When sharing data in read-write mode bear in mind that you need to perform explicit locking.|
If two processes use the same page of the mmaped region, accesses may result in a page fault that will remap physical address between virtual addresses for these two processes.
Here is a revelant quote, from very nice IBM manual.
Here is relevant snippet:
def fromfile(fd, ...): # A lot of one time setup # descr is essentially dtype _array = recarray(shape, descr) nbytesread = fd.readinto(_array.data) # extra checks return _array
Now let's dig readinto is a method on python file descriptor which (or really BufferedReader.readinto``long story short it ``read call in a while loop until all data is read into buffer passed as argument).
Since I spend too much too much time on searching this code in wrong places, here is reference to actual loop.
I use quotes around "problems" because in practice, more often than not GIL is really more of a minor nuisance than a problem. In science most often you can get more performance by employing C (or cython) code instead of naive multiprocessing, also if you release GIL in you C code you'll suddenly get good multi threading performance.
If you really need performance try using as much numpy, and then try using threading.
When you need multiprocessing mind cost of pickling data (and possibly use mmap)
|||This algorithm is not exact, I devised it from top of my head in like 30sec, however I know people that use very similar algorithms to estimate median from datasets that don't fit into memory.|