To HDF5 and beyond
This post contains some notes about three Python libraries for working with numerical data too large to fit into main memory: h5py, Bcolz and Zarr.
2016-05-18: Updated to use the new 1.0.0 release of Zarr.
HDF5 (h5py
)
When I first discovered the HDF5 file format a few years ago it was pretty transformative. I’d been struggling to find efficient ways of exploring and analysing data coming from our research on genetic variation in the mosquitoes that carry malaria. The data are not enormous - typically integer arrays with around 20 billion elements - but they are too large to work with in memory on a typical laptop or desktop machine. The file formats traditionally used for these data are very slow to parse, so I went looking for an alternative.
HDF5 files provide a great solution for storing multi-dimensional arrays of numerical data. The arrays are divided up into chunks and each chunk is compressed, enabling data to be stored efficiently in memory or on disk. Depending on how the chunks are configured, usually a good compromise can be achieved that means data can be read very quickly even when using different access patterns, e.g., taking horizontal or vertical slices of a matrix. Also, the h5py
Python library provides a very convenient API for working with HDF5 files. I found there was a whole range of analyses I could happily get done on my laptop on the train home from work.
Bcolz
A bit later on I discovered the Bcolz library. Bcolz is primarily intended for storing and querying large tables of data, but it does provide a carray
class that is roughly analogous to an HDF5 dataset in that it can store numerical arrays in a chunked, compressed form, either in memory or on disk.
Reading and writing data to a bcolz.carray
is typically a lot faster than HDF5. For example:
import numpy as np
import h5py
import bcolz
import tempfile
def h5fmem(**kwargs):
"""Convenience function to create an in-memory HDF5 file."""
# need a file name even tho nothing is ever written
fn = tempfile.mktemp()
# file creation args
kwargs['mode'] = 'w'
kwargs['driver'] = 'core'
kwargs['backing_store'] = False
# open HDF5 file
h5f = h5py.File(fn, **kwargs)
return h5f
Setup a simple array of integer data to store.
a1 = np.arange(1e8, dtype='i4')
a1
array([ 0, 1, 2, ..., 99999997, 99999998, 99999999], dtype=int32)
Time how long it takes to store in an HDF5 dataset.
%timeit h5fmem().create_dataset('arange', data=a1, chunks=(2**18,), compression='gzip', compression_opts=1, shuffle=True)
1 loop, best of 3: 1.39 s per loop
Time how long it takes to store in a bcolz.carray
.
%timeit bcolz.carray(a1, chunklen=2**18, cparams=bcolz.cparams(cname='lz4', clevel=5, shuffle=1))
10 loops, best of 3: 81.5 ms per loop
In the example above, Bcolz is more than 10 times faster at storing (compressing) the data than HDF5. As I understand it, this performance gain comes from several factors. Bcolz uses a C library called Blosc to perform compression and decompression operations. Blosc can use multiple threads internally, so some of the work is done in parallel. Blosc also splits data up in a way that is designed to work well with the CPU cache architecture. Finally, Blosc is a meta-compressor and several different compression libraries can be used internally - above I used the LZ4 compressor, which does not achieve quite the same compression ratios as gzip (zlib) but is much faster with numerical data.
Zarr
Speed really makes a difference when working interactively with data, so I started using the bcolz.carray
class where possible in my analyses, especially for storing intermediate data. However, it does have some limitations. A bcolz.carray
can be multidimensional, but because Bcolz is not really designed for multi-dimensional data, a bcolz.carray
can only be chunked along the first dimension. This means taking slices of the first dimension is efficient, but slicing any other dimension will be very inefficient, because the entire array will need to be read and decompressed to access even a single column of a matrix.
To explore better ways of working with large multi-dimensional data, I recently created a new library called Zarr. Zarr like Bcolz uses Blosc internally to handle all compression and decompression operations. However, Zarr supports chunking of arrays along multiple dimensions, enabling good performance for multiple data access patterns. For example:
import zarr
zarr.__version__
'1.0.0'
Setup a 2-dimensional array of integer data.
a2 = np.arange(1e8, dtype='i4').reshape(10000, 10000)
Store the data in a carray
.
c2 = bcolz.carray(a2, chunklen=100)
c2
carray((10000, 10000), int32)
nbytes: 381.47 MB; cbytes: 10.63 MB; ratio: 35.87
cparams := cparams(clevel=5, shuffle=1, cname='blosclz')
[[ 0 1 2 ..., 9997 9998 9999]
[ 10000 10001 10002 ..., 19997 19998 19999]
[ 20000 20001 20002 ..., 29997 29998 29999]
...,
[99970000 99970001 99970002 ..., 99979997 99979998 99979999]
[99980000 99980001 99980002 ..., 99989997 99989998 99989999]
[99990000 99990001 99990002 ..., 99999997 99999998 99999999]]
Store the data in a zarr
array.
z = zarr.array(a2, chunks=(1000, 1000))
z
zarr.core.Array((10000, 10000), int32, chunks=(1000, 1000), order=C)
compression: blosc; compression_opts: {'cname': 'blosclz', 'clevel': 5, 'shuffle': 1}
nbytes: 381.5M; nbytes_stored: 10.0M; ratio: 38.0; initialized: 100/100
store: builtins.dict
Time how long it takes to access a slice along the first dimension.
%timeit c2[:1000]
100 loops, best of 3: 11.5 ms per loop
%timeit z[:1000]
10 loops, best of 3: 19 ms per loop
Time how long it takes to access a slice along the second dimension.
%timeit c2[:, :1000]
1 loop, best of 3: 148 ms per loop
%timeit z[:, :1000]
100 loops, best of 3: 12.9 ms per loop
By using Zarr and chunking along both dimensions of the array, we have forfeited a small amount of speed when slicing the first dimension to gain a lot of speed when accessing the second dimension.
Like h5py and Bcolz, Zarr can store data either in memory or on disk. Zarr has some other notable features too. For example, multi-dimensional arrays can be resized along any dimension, allowing an array to be grown by appending new data in a flexible way. Also, Zarr arrays can be used in parallel computations, supporting concurrent reads and writes in either a multi-threaded or multi-process context.
Zarr is still in an experimental phase, but if you do try it out, any feedback is very welcome.
Further reading
Post-script: Performance with real genotype data
Here are benchmarks with some real data.
import operator
from functools import reduce
def human_readable_size(size):
if size < 2**10:
return "%s" % size
elif size < 2**20:
return "%.1fK" % (size / float(2**10))
elif size < 2**30:
return "%.1fM" % (size / float(2**20))
elif size < 2**40:
return "%.1fG" % (size / float(2**30))
else:
return "%.1fT" % (size / float(2**40))
def h5d_diagnostics(d):
"""Print some diagnostics on an HDF5 dataset."""
print(d)
nbytes = reduce(operator.mul, d.shape) * d.dtype.itemsize
cbytes = d._id.get_storage_size()
if cbytes > 0:
ratio = nbytes / cbytes
else:
ratio = np.inf
r = ' cname=%s' % d.compression
r += ', clevel=%s' % d.compression_opts
r += ', shuffle=%s' % d.shuffle
r += '\n nbytes=%s' % human_readable_size(nbytes)
r += ', cbytes=%s' % human_readable_size(cbytes)
r += ', ratio=%.1f' % ratio
r += ', chunks=%s' % str(d.chunks)
print(r)
Locate a genotype dataset within an HDF5 file from the Ag1000G project.
callset = h5py.File('/data/coluzzi/ag1000g/data/phase1/release/AR3/variation/main/hdf5/ag1000g.phase1.ar3.pass.h5',
mode='r')
genotype = callset['3R/calldata/genotype']
genotype
<HDF5 dataset "genotype": shape (13167162, 765, 2), type "|i1">
Extract the first million rows of the dataset to use for benchmarking.
a3 = genotype[:1000000]
Benchmark compression performance.
%%time
genotype_hdf5_gzip = h5fmem().create_dataset('genotype', data=a3,
compression='gzip', compression_opts=1, shuffle=False,
chunks=(10000, 100, 2))
CPU times: user 6.58 s, sys: 72 ms, total: 6.66 s
Wall time: 6.62 s
h5d_diagnostics(genotype_hdf5_gzip)
<HDF5 dataset "genotype": shape (1000000, 765, 2), type "|i1">
cname=gzip, clevel=1, shuffle=False
nbytes=1.4G, cbytes=51.1M, ratio=28.5, chunks=(10000, 100, 2)
%%time
genotype_hdf5_lzf = h5fmem().create_dataset('genotype', data=a3,
compression='lzf', shuffle=False,
chunks=(10000, 100, 2))
CPU times: user 2.01 s, sys: 100 ms, total: 2.11 s
Wall time: 2.1 s
h5d_diagnostics(genotype_hdf5_lzf)
<HDF5 dataset "genotype": shape (1000000, 765, 2), type "|i1">
cname=lzf, clevel=None, shuffle=False
nbytes=1.4G, cbytes=71.7M, ratio=20.3, chunks=(10000, 100, 2)
%%time
genotype_carray = bcolz.carray(a3, cparams=bcolz.cparams(cname='lz4', clevel=1, shuffle=2))
CPU times: user 2.29 s, sys: 0 ns, total: 2.29 s
Wall time: 669 ms
genotype_carray
carray((1000000, 765, 2), int8)
nbytes: 1.42 GB; cbytes: 48.70 MB; ratio: 29.96
cparams := cparams(clevel=1, shuffle=2, cname='lz4')
[[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
...,
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]
[[0 0]
[0 0]
[0 0]
...,
[0 0]
[0 0]
[0 0]]]
%%time
genotype_zarr = zarr.array(a3, chunks=(10000, 100, 2), compression='blosc',
compression_opts=dict(cname='lz4', clevel=1, shuffle=2))
CPU times: user 2.85 s, sys: 56 ms, total: 2.9 s
Wall time: 1.26 s
genotype_zarr
zarr.core.Array((1000000, 765, 2), int8, chunks=(10000, 100, 2), order=C)
compression: blosc; compression_opts: {'cname': 'lz4', 'clevel': 1, 'shuffle': 2}
nbytes: 1.4G; nbytes_stored: 50.2M; ratio: 29.1; initialized: 800/800
store: builtins.dict
Note that although I’ve used the LZ4 compression library with Bcolz and Zarr, the compression ratio is actually better than when using gzip (zlib) with HDF5. This is due to the bitshuffle filter, which comes bundled with Blosc. The bitshuffle filter can also be used with HDF5 with some configuration I believe.
Compression with Zarr is slightly slower than Bcolz above, but this is entirely due to the choice of chunk shape and the correlation structure in the data. If we use the same chunking for both, compression speed is similar…
%%time
_ = zarr.array(a3, chunks=(genotype_carray.chunklen, 765, 2), compression='blosc',
compression_opts=dict(cname='lz4', clevel=1, shuffle=2))
CPU times: user 2.21 s, sys: 56 ms, total: 2.26 s
Wall time: 675 ms
Benchmark data access via slices along first and second dimensions.
%timeit genotype_hdf5_gzip[:10000]
10 loops, best of 3: 24.4 ms per loop
%timeit genotype_hdf5_lzf[:10000]
10 loops, best of 3: 27.8 ms per loop
%timeit genotype_carray[:10000]
100 loops, best of 3: 9.32 ms per loop
%timeit genotype_zarr[:10000]
100 loops, best of 3: 13 ms per loop
%timeit genotype_hdf5_gzip[:, :10]
1 loop, best of 3: 279 ms per loop
%timeit genotype_hdf5_lzf[:, :10]
1 loop, best of 3: 290 ms per loop
%timeit genotype_carray[:, :10]
1 loop, best of 3: 1.09 s per loop
%timeit genotype_zarr[:, :10]
10 loops, best of 3: 123 ms per loop
Setup
# run on my laptop, which doesn't have AVX2
import cpuinfo
cpuinfo.main()
Vendor ID: GenuineIntel
Hardware Raw:
Brand: Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz
Hz Advertised: 2.0000 GHz
Hz Actual: 3.1355 GHz
Hz Advertised Raw: (2000000000, 0)
Hz Actual Raw: (3135546000, 0)
Arch: X86_64
Bits: 64
Count: 4
Raw Arch String: x86_64
L2 Cache Size: 4096 KB
L2 Cache Line Size: 0
L2 Cache Associativity: 0
Stepping: 9
Model: 58
Family: 6
Processor Type: 0
Extended Model: 0
Extended Family: 0
Flags: acpi, aes, aperfmperf, apic, arat, arch_perfmon, avx, bts, clflush, cmov, constant_tsc, cx16, cx8, de, ds_cpl, dtes64, dtherm, dts, eagerfpu, epb, ept, erms, est, f16c, flexpriority, fpu, fsgsbase, fxsr, ht, ida, lahf_lm, lm, mca, mce, mmx, monitor, msr, mtrr, nonstop_tsc, nopl, nx, pae, pat, pbe, pcid, pclmulqdq, pdcm, pebs, pge, pln, pni, popcnt, pse, pse36, pts, rdrand, rdtscp, rep_good, sep, smep, smx, ss, sse, sse2, sse4_1, sse4_2, ssse3, syscall, tm, tm2, tpr_shadow, tsc, tsc_deadline_timer, vme, vmx, vnmi, vpid, x2apic, xsave, xsaveopt, xtopology, xtpr