To HDF5 and beyond
This post contains some notes about three Python libraries for working with numerical data too large to fit into main memory: h5py, Bcolz and Zarr.
2016-05-18: Updated to use the new 1.0.0 release of Zarr.
HDF5 (h5py
)
When I first discovered the HDF5 file format a few years ago it was pretty transformative. I’d been struggling to find efficient ways of exploring and analysing data coming from our research on genetic variation in the mosquitoes that carry malaria. The data are not enormous - typically integer arrays with around 20 billion elements - but they are too large to work with in memory on a typical laptop or desktop machine. The file formats traditionally used for these data are very slow to parse, so I went looking for an alternative.
HDF5 files provide a great solution for storing multi-dimensional arrays of numerical data. The arrays are divided up into chunks and each chunk is compressed, enabling data to be stored efficiently in memory or on disk. Depending on how the chunks are configured, usually a good compromise can be achieved that means data can be read very quickly even when using different access patterns, e.g., taking horizontal or vertical slices of a matrix. Also, the h5py
Python library provides a very convenient API for working with HDF5 files. I found there was a whole range of analyses I could happily get done on my laptop on the train home from work.
Bcolz
A bit later on I discovered the Bcolz library. Bcolz is primarily intended for storing and querying large tables of data, but it does provide a carray
class that is roughly analogous to an HDF5 dataset in that it can store numerical arrays in a chunked, compressed form, either in memory or on disk.
Reading and writing data to a bcolz.carray
is typically a lot faster than HDF5. For example:
Setup a simple array of integer data to store.
array([ 0, 1, 2, ..., 99999997, 99999998, 99999999], dtype=int32)
Time how long it takes to store in an HDF5 dataset.
1 loop, best of 3: 1.39 s per loop
Time how long it takes to store in a bcolz.carray
.
10 loops, best of 3: 81.5 ms per loop
In the example above, Bcolz is more than 10 times faster at storing (compressing) the data than HDF5. As I understand it, this performance gain comes from several factors. Bcolz uses a C library called Blosc to perform compression and decompression operations. Blosc can use multiple threads internally, so some of the work is done in parallel. Blosc also splits data up in a way that is designed to work well with the CPU cache architecture. Finally, Blosc is a meta-compressor and several different compression libraries can be used internally - above I used the LZ4 compressor, which does not achieve quite the same compression ratios as gzip (zlib) but is much faster with numerical data.
Zarr
Speed really makes a difference when working interactively with data, so I started using the bcolz.carray
class where possible in my analyses, especially for storing intermediate data. However, it does have some limitations. A bcolz.carray
can be multidimensional, but because Bcolz is not really designed for multi-dimensional data, a bcolz.carray
can only be chunked along the first dimension. This means taking slices of the first dimension is efficient, but slicing any other dimension will be very inefficient, because the entire array will need to be read and decompressed to access even a single column of a matrix.
To explore better ways of working with large multi-dimensional data, I recently created a new library called Zarr. Zarr like Bcolz uses Blosc internally to handle all compression and decompression operations. However, Zarr supports chunking of arrays along multiple dimensions, enabling good performance for multiple data access patterns. For example:
'1.0.0'
Setup a 2-dimensional array of integer data.
Store the data in a carray
.
carray((10000, 10000), int32)
nbytes: 381.47 MB; cbytes: 10.63 MB; ratio: 35.87
cparams := cparams(clevel=5, shuffle=1, cname='blosclz')
[[ 0 1 2 ..., 9997 9998 9999]
[ 10000 10001 10002 ..., 19997 19998 19999]
[ 20000 20001 20002 ..., 29997 29998 29999]
...,
[99970000 99970001 99970002 ..., 99979997 99979998 99979999]
[99980000 99980001 99980002 ..., 99989997 99989998 99989999]
[99990000 99990001 99990002 ..., 99999997 99999998 99999999]]
Store the data in a zarr
array.
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'cname': 'blosclz', 'clevel': 5, 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 10.0M; ratio: 38.0; initialized: 100/100
store: builtins.dict
Time how long it takes to access a slice along the first dimension.
100 loops, best of 3: 11.5 ms per loop
10 loops, best of 3: 19 ms per loop
Time how long it takes to access a slice along the second dimension.
1 loop, best of 3: 148 ms per loop
100 loops, best of 3: 12.9 ms per loop
By using Zarr and chunking along both dimensions of the array, we have forfeited a small amount of speed when slicing the first dimension to gain a lot of speed when accessing the second dimension.
Like h5py and Bcolz, Zarr can store data either in memory or on disk. Zarr has some other notable features too. For example, multi-dimensional arrays can be resized along any dimension, allowing an array to be grown by appending new data in a flexible way. Also, Zarr arrays can be used in parallel computations, supporting concurrent reads and writes in either a multi-threaded or multi-process context.
Zarr is still in an experimental phase, but if you do try it out, any feedback is very welcome.
Further reading
Post-script: Performance with real genotype data
Here are benchmarks with some real data.
Locate a genotype dataset within an HDF5 file from the Ag1000G project.
<HDF5 dataset "genotype": shape (13167162, 765, 2), type "|i1">
Extract the first million rows of the dataset to use for benchmarking.
Benchmark compression performance.
CPU times: user 6.58 s, sys: 72 ms, total: 6.66 s
Wall time: 6.62 s
<HDF5 dataset "genotype": shape (1000000, 765, 2), type "|i1">
cname=gzip, clevel=1, shuffle=False
nbytes=1.4G, cbytes=51.1M, ratio=28.5, chunks=(10000, 100, 2)
CPU times: user 2.01 s, sys: 100 ms, total: 2.11 s
Wall time: 2.1 s
<HDF5 dataset "genotype": shape (1000000, 765, 2), type "|i1">
cname=lzf, clevel=None, shuffle=False
nbytes=1.4G, cbytes=71.7M, ratio=20.3, chunks=(10000, 100, 2)
CPU times: user 2.29 s, sys: 0 ns, total: 2.29 s
Wall time: 669 ms
carray((1000000, 765, 2), int8)
nbytes: 1.42 GB; cbytes: 48.70 MB; ratio: 29.96
cparams := cparams(clevel=1, shuffle=2, cname='lz4')
[[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
...,
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]]
CPU times: user 2.85 s, sys: 56 ms, total: 2.9 s
Wall time: 1.26 s
zarr.core.Array((1000000, 765, 2), int8, chunks=(10000, 100, 2), order=C)
compression: blosc; compression_opts: {'cname': 'lz4', 'clevel': 1, 'shuffle': 2}
nbytes: 1.4G; nbytes_stored: 50.2M; ratio: 29.1; initialized: 800/800
store: builtins.dict
Note that although I’ve used the LZ4 compression library with Bcolz and Zarr, the compression ratio is actually better than when using gzip (zlib) with HDF5. This is due to the bitshuffle filter, which comes bundled with Blosc. The bitshuffle filter can also be used with HDF5 with some configuration I believe.
Compression with Zarr is slightly slower than Bcolz above, but this is entirely due to the choice of chunk shape and the correlation structure in the data. If we use the same chunking for both, compression speed is similar…
CPU times: user 2.21 s, sys: 56 ms, total: 2.26 s
Wall time: 675 ms
Benchmark data access via slices along first and second dimensions.
10 loops, best of 3: 24.4 ms per loop
10 loops, best of 3: 27.8 ms per loop
100 loops, best of 3: 9.32 ms per loop
100 loops, best of 3: 13 ms per loop
1 loop, best of 3: 279 ms per loop
1 loop, best of 3: 290 ms per loop
1 loop, best of 3: 1.09 s per loop
10 loops, best of 3: 123 ms per loop
Setup
Vendor ID: GenuineIntel
Hardware Raw:
Brand: Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz
Hz Advertised: 2.0000 GHz
Hz Actual: 3.1355 GHz
Hz Advertised Raw: (2000000000, 0)
Hz Actual Raw: (3135546000, 0)
Arch: X86_64
Bits: 64
Count: 4
Raw Arch String: x86_64
L2 Cache Size: 4096 KB
L2 Cache Line Size: 0
L2 Cache Associativity: 0
Stepping: 9
Model: 58
Family: 6
Processor Type: 0
Extended Model: 0
Extended Family: 0
Flags: acpi, aes, aperfmperf, apic, arat, arch_perfmon, avx, bts, clflush, cmov, constant_tsc, cx16, cx8, de, ds_cpl, dtes64, dtherm, dts, eagerfpu, epb, ept, erms, est, f16c, flexpriority, fpu, fsgsbase, fxsr, ht, ida, lahf_lm, lm, mca, mce, mmx, monitor, msr, mtrr, nonstop_tsc, nopl, nx, pae, pat, pbe, pcid, pclmulqdq, pdcm, pebs, pge, pln, pni, popcnt, pse, pse36, pts, rdrand, rdtscp, rep_good, sep, smep, smx, ss, sse, sse2, sse4_1, sse4_2, ssse3, syscall, tm, tm2, tpr_shadow, tsc, tsc_deadline_timer, vme, vmx, vnmi, vpid, x2apic, xsave, xsaveopt, xtopology, xtpr