Reading numpy structured from a text file

Numpy has a very nice feature: a structured array, that is an array in which rows have some structure and can store different types of data in each column.

For example:

>>> import numpy as np
>>> arr = np.zeros(10, dtype=[['id', np.uint16], ['position', np.dtype('3float32')], ['momentum', np.dype('3float32')]])

We have defined a structured array in each row we store: id of a particle (unsigned int), its position (three floats) and momentum (again three floats).

You can easily select from this array:

>>> arr['position'] # positions of all particles
>>> arr[0]['position'] # position of first particle
>>> arr[arr['id']=1]['position']  # positions of all particles with id equal to 1

This is a nice format because:

  • Your data has structure. No more off-by-one errors: particle position is labeled.
  • Very easy to load from binary files

Loading from text files is an entirely different matter --- because writing to such arrays is kind of pain.

My requirements were:

  • Array structure is the same as source file structure (order of fields is the same)
  • Array structure is defined only in a single place: that is the dtype definition

Solution

The solution is to:

  • Read file line by line parsing contents to an unstructured array.
  • Create a structured view
  • Should be fast, that means no copying of large arrays.

Actual dtype used:

URQMD_DATA_DTYPE = [
    ("time", np.float32),
    ("position", np.dtype("3float32")),
    ("energy", np.float32),
    ("momentum", np.dtype("3float32")),
    ("mass", np.float32),
    ("particle_type", np.float32),
    ("additional", np.dtype("5int32")),
]

Helper function that takes structured dtype, and turns it to dtype that has the same number of fields but is unstructured:

def serialize_dtype(dt):
    dt = np.dtype(dt)
    newdt = []
    for item in dt.descr:
        if len(item) == 2:
            count = 1,
            name, type = item
        else:
            name, type, count = item
        if len(count) > 1:
            raise ValueError()
        count = count[0]
        for ii in range(count):
            newdt.append(type)
    return np.dtype(", ".join(newdt))

Now frame is a list of lines from text file.

parsed = np.zeros(len(frame), dtype=serialize_dtype(URQMD_DATA_DTYPE)) # Create array without structure
for ii, line in enumerate(frame):
    data = [float(x) for x in line.split()] # Parse lines
    #-- ignoring wheher it is a float or int
    parsed[ii] = tuple(data) # Now numpy will convert single row to proper types
parsed = parsed.view(URQMD_DATA_DTYPE) # Create a structured view (no copy!)

Sound simple but took me some time to get it right.