Numpy has a very nice feature: a structured array, that is an array in which rows have some structure and can store different types of data in each column.
For example:
>>> import numpy as np
>>> arr = np.zeros(10, dtype=[['id', np.uint16], ['position', np.dtype('3float32')], ['momentum', np.dype('3float32')]])
We have defined a structured array in each row we store: id of a particle (unsigned int), its position (three floats) and momentum (again three floats).
You can easily select from this array:
>>> arr['position'] # positions of all particles >>> arr[0]['position'] # position of first particle >>> arr[arr['id']=1]['position'] # positions of all particles with id equal to 1
This is a nice format because:
- Your data has structure. No more off-by-one errors: particle position is labeled.
- Very easy to load from binary files
Loading from text files is an entirely different matter --- because writing to such arrays is kind of pain.
My requirements were:
- Array structure is the same as source file structure (order of fields is the same)
- Array structure is defined only in a single place: that is the dtype definition
Solution
The solution is to:
- Read file line by line parsing contents to an unstructured array.
- Create a structured view
- Should be fast, that means no copying of large arrays.
Actual dtype used:
URQMD_DATA_DTYPE = [
("time", np.float32),
("position", np.dtype("3float32")),
("energy", np.float32),
("momentum", np.dtype("3float32")),
("mass", np.float32),
("particle_type", np.float32),
("additional", np.dtype("5int32")),
]
Helper function that takes structured dtype, and turns it to dtype that has the same number of fields but is unstructured:
def serialize_dtype(dt):
dt = np.dtype(dt)
newdt = []
for item in dt.descr:
if len(item) == 2:
count = 1,
name, type = item
else:
name, type, count = item
if len(count) > 1:
raise ValueError()
count = count[0]
for ii in range(count):
newdt.append(type)
return np.dtype(", ".join(newdt))
Now frame is a list of lines from text file.
parsed = np.zeros(len(frame), dtype=serialize_dtype(URQMD_DATA_DTYPE)) # Create array without structure
for ii, line in enumerate(frame):
data = [float(x) for x in line.split()] # Parse lines
#-- ignoring wheher it is a float or int
parsed[ii] = tuple(data) # Now numpy will convert single row to proper types
parsed = parsed.view(URQMD_DATA_DTYPE) # Create a structured view (no copy!)
Sound simple but took me some time to get it right.