Python and HDF5

It is a saturday so I get to choose what type of books that I read at my "leisure time". As a continuation of my quest for trying to use python (well) for my research I have been investigating the possibility of using HDF5 to store some simulation results ... HDF5 is a flexible and fast file format that the theoretical astrophysicists prefer nowadays (also have to mention that the file format was first designed by my beloved college UIUC!) I like it especially because of :

  • fast read time - library  written in C, with nice Python wrapper library h5py
  • flexible - you can choose a chunk of the data to read, not the entire dataset
  • allows metadata to be stored - this is important coz I don't want to have to remember what exact parameters I used for creating a dataset 3 months after creating them, I want the data to be self-documented
  • data structure organized like directory - beyond the scope of discussion here since I haven't got to that part of the book yet

All in all the book was alright but they should really include a list of reference of the most commonly used commands. Here they go (for my later reference):

>>> import h5py
>>> import numpy as np
>>> f = h5py.File("filepath", "w") # let's just open a file to write
# create two datasets with names "array1" and "array2"
>>> f["array1"] = np.ones((100, 1000)) # initialize a big 2D array
>>> f["array2"] = np.zeros((int(1e5), int(1e6)))
# easy writing of metadata as attributes
>>> f["array1"].attrs["info"] = "big array"
>>> f["array2"].attrs["info"] = "bigger array"
>>> f.close()

Now this part about reading and examining the data is a bit lacking from the book, just tell me that ONE command that I need!:

>>> f = h5py.File("filepath", "r") # read only
# this should give you the keys e.g. ["array1", "array2"]
# so you don't have to remember what "datasets" are actually in the file
# this is the one command to rule them all
>>> f.keys()
>>> f["array1"].attrs.keys() # gives you the keys to call the attributes

HDF5 has a weird official syntax for reading data back into python:

>>> arr = f["array1"][...]

However, the following alternative syntax works for reading from both HDF5 or Numpy array from a dictionary, so I will stick to this alternative syntax instead:

>>> arr = f["array1"][:]

which extends naturally to slicing an array:

>>> arr = f["array1"][:10]  # read first 10 entries

Moving on to next topic: If you have a linux / mac, you just want to check the file structure quickly,

at the command line you can do:

$ h5ls -vlr file.h5

It would spit out the descriptions of the dataset and the keys for calling the dataset.

Currently not trying to do anything fancy with the hdf5 files yet but I think it 'd be good to use hdf5 in the long run.


Comments

comments powered by Disqus