mmap my favourite system call

In this post I'll explain why I like mmap, why it is useful tool to have in your toolbox. There will be some Python code samples, that show it's uses.

Note

While all examples will use Python (and scientific packages used by Python) most of this post is pretty generic (mmap is a Linux system call available to C and thus pretty much any other language). Parts of what I say here will also work on Windows (Python API will be kinda sorta similar, but C API will be totally different). Here is a SO question on what you can do using Windows and memory mapped files.

mmap is magic!

mmap is a system call that takes a file descriptor [1], and returns a memory pointer that points to the beginning of the file contents (you of course slice the file at will).

This is super simple, no file.read(), no buffers, no parsing. Just mmap and you have data at your fingertips.

But here is the magic:

  1. It won't load whole file into memory.

  2. The file might be bigger than available memory!

  3. Your file will be stored in memory only once without any unnecessary copying, even if multiple processes read from it (and depending on options write to it).

    When you read a file using the read call, you will generally copy data.

You can do a lot of cool stuff with it:

  1. You can work on files bigger than your RAM, without worrying too much (depending on your access patterns this might be not so fast).

  2. You can load binary data from disk insanely fast (no data copying, no parsing, one syscall some kernel magic and everything is ready)

  3. You can do fast inter-process communication:

    Some people also use shared memory to push tens of GB/s of raw packets from one process to another (which is super cool).

How does mmap work

Don't worry if you don't get all the details, low-level OS details often not obvious. I'll show you simple code samples that show how to use mmap in your programs, fill free to skip this section on first read.

Major task of the OS is to abstract hardware from the processes, processes should not care about whether:

  • Your hard drive uses SATA or PCIEe;
  • Speed of your RAM memory;
  • How much RAM is currently available;
  • What kind of graphics card do you have;

Obviously processes can care about RAM (if process happens to need GB's of cache) or graphics card (if process happens to use 3D or CUDA), but usually process can just don't care.

The same is with RAM memory, processes should not really care whether enough memory is available, they request RAM and it's OS job to make sure enough ram is available.

When you request memory from the OS (eg. by malloc call), you generally will get a memory pointer.

However you don't get pointer to any particular physical address, you get pointer to virtual memory, which (at some point) is translated to physical memory by the CPU.

Memory is split into blocks of fixed size (4kb on my OS) called pages, when program requests memory from the OS, OS will just allocate new virtual memory pages to this program and (eventually, often on first use) map these pages to physical memory.

mmap uses this mechanism --- when you open a file using this call, you get a pointer to a virtual memory region. There is no no actual disk IO during the mmap call itself, pages will be loaded as you read them (any modifications will also be synchronized to disk eventually). Pages read from disk will be stored in RAM as long as there is enough physical memory, if amount of memory decreases OS will start writing changes to disk and clearing the cache.

Two process can share the same memory mapped file. If share is read only only a single copy of file contents will exist in physical memory!

You can even use mmap to share read write memory between processes [2] --- remember mmaped files are mapped to single region of physical memory, so sharing data will be essentially free [3].

Memory can be shared either in:

  • in read-write mode where all process can read and write to the same memory segment;
  • in copy on write mode where each child process originally has the same data, but if they any of them writes to this memory, transparently a copy is created.

How to use mmap

Python has very thin wrapper around mmap POSIX system call, so I'll use Python in the samples however translation to C should be straightforward.

import mmap, os, multiprocessing

TOY_FILE_LENGTH = 1024 * 1024

# Open the file
example = open('/tmp/mmap-playground/toy', 'wb+')

# Ensure file has proper length (fill with zeroes)
os.posix_fallocate(example.fileno(), 0, TOY_FILE_LENGTH)

# Create a mapping
mmaped_file = mmap.mmap(example.fileno(), 1024 * 1024, mmap.MAP_SHARED)

# Read the mapping
print(mmaped_file[:16])

You can also create memory regions not backed by a file:

# To create anonymous region pass -1 as file descriptor
# mmap.MAP_SHARED means that child processes can write to this array
mapped_anonymous_memory_region = mmap.mmap(-1, 1024 * 1024, mmap.MAP_SHARED)

Memory is automatically shared with child processes:

def run_in_process():
    mapped_anonymous_memory_region[:11] = b'Hello World'

process = multiprocessing.Process(target=run_in_process)
process.start()
# Wait for process to finish
process.join()

assert mapped_anonymous_private[:11] == b'Hello World'

You can also share in Copy-On-Write (COW) mode, now subprocess can modify their copy of mapping, but other copies won't be updated:

# instead of mmap.MAP_SHARED, mmap.MAP_PRIVATE is used
mapped_anonymous_private = mmap.mmap(-1, 1024 * 1024, mmap.MAP_PRIVATE)
mapped_anonymous_private[:11] = b'Hello World'

def run_in_process():
    # Check data is shared
    assert mapped_anonymous_private[:11] == b'Hello World'
    # Try to modify
    mapped_anonymous_private[:11] = b'Hello Me!!!'
    # Assert data is modified locally
    assert mapped_anonymous_private[:11] == b'Hello Me!!!'


process = multiprocessing.Process(target=run_in_process)
process.start()
process.join()

# Check local copy has not been modified
assert mapped_anonymous_private[:11] == b'Hello World'

Digression: loading binary files is super fast

Some time ago I read Joel Spolsky article about why .doc (not .docx format which is basically XML) is super complex.

Apart from obvious reasons (20 years of backward compatibility leads to super complex systems), the reason was basically speed.

Various versions of Word were created when computers had megabytes of RAM (Office 95 required at least 8mb of RAM!). Loading whole files into memory was not really an option equally parsing these files on the fly was not really an option (speed), so they just (more or less) dumped C structures on disk (along with some index structure). Loading these files required just to load binary index (fast), and then copy selected parts of file into memory, and then cast them directly to a it to a C structure (coincidentally we will be doing very similar thing in Python in a moment).

These are binary formats, so loading a record is usually a matter of just copying (blitting) a range of bytes from disk to memory, where you end up with a C data structure you can use. There’s no lexing or parsing involved in loading a file. Lexing and parsing are orders of magnitude slower than blitting.

Quote from before mentioned Spolsky article

mmap and numpy

Numpy has native support for mmap, so let's start with something simple:

# Create toy array

GIGA = 1024 * 1024 * 1024

#Create array full of zeros
array = np.memmap(
    '/tmp/mmap-playground/big-test',
    dtype=np.float64,
    shape=(GIGA, 3),
    mode='w+'
)
# Yes you have created a 24 GB array (if you are
# wondering if it is suspiciously fast ---
# you are right, but more on that shortly)

array[:10, :] = np.arange(10).reshape(10, 1)

# Deleting an array flushes it to disk.
# You can (and should!) explicitly call ``array.flush``.
del array

copy = np.memmap(
    '/tmp/mmap-playground/big-test',
    dtype=np.float64,
    shape=(GIGA, 3),
    mode='r+'
)

# Now lets check if we got the same data:

assert(np.all(copy[:10] == np.arange(10).reshape(10, 1)))
assert(np.all(copy[10:100] == 0))

Microbenchmark! Array reading speed

Reading binary files is way faster than reading text files (parsing overhead).

Benchmark results
Opening 10 million by 3 float array
np.loadtxt np.fromfile np.memmap
47.000 ms 119 ms 35 ms

Note

When Pickle protocol 5 is available redo these these benchmarks.

Here are three functions I benchmarked:

def test_memmap():
    array = np.memmap('/tmp/mmap-playground/bench-read', ...)
    return array.sum()

def test_csv():
    array = np.loadtxt('/tmp/mmap-playground/bench-read.csv')
    return array.sum()

def test_fromfile():
    array = np.fromfile('/tmp/mmap-playground/bench-read')
    return array.sum()

It's obvious that reading from text format is way slower than reading raw binary file (but I was suprised speed difference was that big).

What was also kind of suprising is that using np.fromfile was slower than np.memmap.

I did some digging around the np.fromfile and probably culprit was copying data from buffer to buffer [4] (which is absent from np.memmap version).

More interesting result is that np.loadtxt took more than a thousand time longer than np.memmap. I expected huge difference, but this is so big it looks wrong. I think that the reason is that np.loadtxt does almost all work in plain Python code (and we all know that python VM is not fast), reading the same file using csv.reader (which has C implementation) takes 8 seconds.

Using mmap to share memory between processes

Python has some "problems" [5] with multi thread performance, due to Global Interpreter Lock (or GIL for short). Here is short explanation.

So in most cases if you really need parallelism (and threading won't work) you might be inclined to try multiprocessing module, however sharing memory between processes is not that easy. And np.mmap arrays are great for it.

Let's calculate median of 10GB array.

Set up code:

import multiprocessing, multiprocessing.dummy
import numpy as np
GIGA = 1024 * 1024 * 1024
ARR_SIZE = 8 * GIGA

data = np.memmap('/tmp/mmap-playground/random', shape=ARR_SIZE, dtype=np.int16, mode='w+')

# Fill the array with random data in gigabyte chunks
for ii in range(0, ARR_SIZE, GIGA):
    data[ii:ii+GIGA] = np.random.randint(
        np.iinfo(data.dtype).min, np.iinfo(data.dtype).max, size=GIGA
    )

Naive implementation took 107 seconds:

%timeit np.median(data)

Let's speed it up, I propose following median estimation algorithm [6]: let's break array into N chunks, calculate median for each chunk in parralel, and then calculate median of medians.

So here is (naive) implementation:

def estimate_median_naive():
    CHUNK =  128 * 1024 * 1024
    pool = multiprocessing.Pool(4)
    partials = pool.map(np.median, (data[start:start+CHUNK] for start in range(0, ARR_SIZE, CHUNK)))
    return np.median(partials)

This naive parallel implementation took 58sec on a 4 core computer, so this is less than two times faster than serial one.

Reason for this low speedup is that arguments to pool.map are pickled, and pickle is not super fast with numpy arrays.

Note

Pickle was created as a disk serialization format, it was also created back when disks had spinning pieces of metal inside them, so extra memory copies were not that important.

Here is a PEP-574 that aims to make pickle faster in memory-to-memory serialization scenario.

Here is a implementation that uses multiprocessing and memmap:

def calculate_median_process(start, stop=None):
    # Unpack arguments tuple
    if stop is None:
        start, stop = start
    data = np.memmap('/tmp/mmap-playground/random', shape=ARR_SIZE, dtype=np.int16, mode='r+')
    return np.median(data[start:stop])

def estimate_median_pool():
    pool = multiprocessing.Pool(4)
    CHUNK =  128 * 1024 * 1024
    start_indices = list(range(0, ARR_SIZE, CHUNK))
    indices = [(start, start+CHUNK) for start in start_indices]
    partials = pool.map(calculate_median_process, indices)
    return np.median(partials)

This implementation takes 28.5sec which is almost 4 times the speedup.

Sparse Files!!

Last (but also fascinating) thing I like about mmap and friends is that they all work with sparse files. Sparse files are files that don't store empty blocks physically on disk, but instead store information which blocks are filled. When reading sparse file, empty blocks return zeroes.

Here we create 24 GB array on disk, this call takes 23ms on my laptop, so there must be some magic here!

array = np.memmap('/tmp/mmap-playground/sparse-test', dtype=np.float64, shape=(GIGA, 3), mode='w+')

Let's check the file size! ls says that everything is OK.

ls -lah /tmp/mmap-playground/sparse-test
rw-r--r-- 1 jb jb 24G Dec 30 14:30 /tmp/mmap-playground/sparse-test

Hower du will tell the truth:

du -hs /tmp/mmap-playground/sparse-test
4.0K    /tmp/mmap-playground/sparse-test

Apart from being cool I didn't find any use for sparse files in my sciencey computing adventures.

Footnotes

[1]

In POSIX you open a file by calling the open function. This function returns a file descriptor, that is a small non-negative integer which then is used to identify opened file when using read, write functions.

In python you can get file descriptor either by using os.open, or by calling file.fileno() on opened file.

[2]When sharing data in read-write mode bear in mind that you need to perform explicit locking.
[3]

If two processes use the same page of the mmaped region, accesses may result in a page fault that will remap physical address between virtual addresses for these two processes.

Here is a revelant quote, from very nice IBM manual.

Both the mmap and shmat services provide the capability for multiple processes to map the same region of an object such that they share addressability to that object. However, the mmap subroutine extends this capability beyond that provided by the shmat subroutine by allowing a relatively unlimited number of such mappings to be established. While this capability increases the number of mappings supported per file object or memory segment, it can prove inefficient for applications in which many processes map the same file data into their address space.

The mmap subroutine provides a unique object address for each process that maps to an object. The software accomplishes this by providing each process with a unique virtual address, known as an alias. The shmat subroutine allows processes to share the addresses of the mapped objects.

Because only one of the existing aliases for a given page in an object has a real address translation at any given time, only one of the mmap mappings can make a reference to that page without incurring a page fault. Any reference to the page by a different mapping (and thus a different alias) results in a page fault that causes the existing real-address translation for the page to be invalidated. As a result, a new translation must be established for it under a different alias. Processes share pages by moving them between these different translations.

[4]

Here is relevant snippet:

def fromfile(fd, ...):

    # A lot of one time setup

    # descr is essentially dtype
    _array = recarray(shape, descr)
    nbytesread = fd.readinto(_array.data)
    # extra checks
    return _array

Now let's dig readinto is a method on python file descriptor which (or really BufferedReader.readinto``long story short it ``read call in a while loop until all data is read into buffer passed as argument).

Since I spend too much too much time on searching this code in wrong places, here is reference to actual loop.

[5]

I use quotes around "problems" because in practice, more often than not GIL is really more of a minor nuisance than a problem. In science most often you can get more performance by employing C (or cython) code instead of naive multiprocessing, also if you release GIL in you C code you'll suddenly get good multi threading performance.

If you really need performance try using as much numpy, and then try using threading.

When you need multiprocessing mind cost of pickling data (and possibly use mmap)

[6]This algorithm is not exact, I devised it from top of my head in like 30sec, however I know people that use very similar algorithms to estimate median from datasets that don't fit into memory.