Understanding PBWT
Last year a colleague pointed me at Richard Durbin’s 2014 paper on the positional Burrows Wheeler transform. I read it with interest but didn’t really get it at the time. Recently there have been papers developing PBWT for variation graphs and chromosome painting. Given the applicability of PBWT to a range of problems, I wanted to try and wrap my head around it. This notebook is an attempt to understand the basic algorithms described in Durbin (2014), I wrote it for my own benefit but thought I’d share in case it’s useful to anyone and/or anyone might be able to help me fill in some of the gaps.
2016-06-23: Updated implementation of algorithm 4 to avoid reporting 0 length matches. Also added some discussion of the value of using PBWT to improve compressibility of haplotype data versus simply transposing the haplotypes so the data are organised one variant per row.
Let’s start with some data. The variable X below is a list of 8 haplotypes with 6 variable sites, all biallelic. This is just some dummy data I made up myself to test out the algorithms
Algorithm 1. Build prefix array
The basic idea of PBWT is to sort the haplotypes in order of reversed prefixes. This sorting then enables matches between haplotypes to be found efficiently. Algorithm 1 in Durbin (2014) builds something called the “positional prefix array”, which is just the list of haplotype indices that would sort the haplotypes in reversed prefix order at some position k.
Durbin uses the notation ak to denote the positional prefix array for position k, however he also uses the variable a for one of the intermediates within algorithm 1, so to avoid any ambiguity here I will use ppak to denote the positional prefix array at position k.
The trick of algorithm 1 is to sweep through the data by position, building ppak+1 from ppak. The algorithm simply sorts the haplotypes by allele at position k, making sure that the sort order from the previous position is retained for all haplotypes with the same allele value.
Here is a pure Python implementation of algorithm 1, purely for illustration purposes.
Let’s run it on the dummy data X.
[4, 1, 6, 0, 5, 7, 3, 2]
The resulting list of integers is the list of haplotype indices that sorts the haplotypes up to the final position (k = N-1). To help visualise this, let’s write a display function.
4|00000 0
1|11000 1
6|11000 1
0|01010 1
5|10001 0
7|01011 0
3|01111 0
2|11111 1
The positional prefix array [4, 1, 6, …] is the column to the left of the ‘|
’ showing the haplotype indices. To the right of the ‘|
’ are the haplotypes up to k-1 sorted in reversed prefix order. Then separated by a space is the next allele for each haplotype - this final column makes up what Durbin (2014) calls yk[k].
Algorithm 2. Build prefix and divergence arrays
Sorting haplotypes in this order has a very useful property, which is that each haplotype will be adjacent to the haplotype with the longest match. If several haplotypes have equally long matches these will all be adjacent. Also, the length of match between any pair of non-adjacent haplotypes is the minimum of the lengths of the matches between all haplotypes occurring in between them in the sorted order.
Algorithm 2 shows how to find and keep track of the start position for matches between adjacent haplotypes. This can be done while sweeping through the data building the positional prefix arrays. The reason this is possible is because once a match begins between a pair of adjacent haplotypes, those two haplotypes will remain adjacent until the match breaks.
Here’s a pure Python implementation of algorithm 2, again purely for illustration. Durbin (2014) uses dk to denote the “divergence array” at position k, where dk[i] is the position where a match begins between the ith haplotype in the sorted order and it’s predecessor. Because d is also used for one of the intermediates, for clarity I’ll use divk to denote the divergence array.
Let’s try it out on the dummy haplotype data.
[4, 1, 6, 0, 5, 7, 3, 2]
[5, 2, 0, 4, 5, 4, 3, 1]
Again to help make sense of this let’s write a display function.
4|00000 0
1|11000 1
6|11000 1
0|01010 1
5|10001 0
7|01011 0
3|01111 0
2|11111 1
The divergence array has been used to show underlined in bold the maximal matches between each haplotype and it’s predecessor in the sorted order, as in Durbin (2014) Figure 1.
Let’s try it out on some randomly generated data, just to have another example to look at.
1|11001011100011111101111000000 0
9|11001101101111011001000010100 0
8|11110001100100100111000001100 1
4|10011111110101000110111011100 1
5|11111001101010001010000010110 1
2|00101000111100110011101010001 0
6|11010001100000001011101010001 0
7|10010110111010100000000011001 1
3|00010001111100011011010110101 1
0|11010111110110100011010011111 0
Algorithm 3. Report long matches
Algorithm 3 shows how to find all matches between haplotypes longer than some value L. The trick here is to notice that all haplotypes with matches longer than L will be adjacent in the sorted order, forming a “block”. So algorithm 3 iterates over the haplotypes in sorted order, collecting haplotypes with matches longer than L. When it finds a match less than L, it must mean a break between blocks, so it reports matches for all haplotypes in the previous block. The algorithm also collects separately for haplotypes with different alleles at position k, to ensure that only matches that terminate at k are reported.
One minor note, I don’t think algorithm 3 as presented in Durbin (2014) accounts for the case where a block of matching haplotypes extends up to the final haplotype, so I added in an extra couple of lines to account for this case.
Let’s try it out on the dummy data, finding matches 3 or longer.
(4, [4], [5])
(4, [0], [7])
(5, [4], [1, 6])
(5, [3], [2])
So 4 matches have been found. The first match terminates at position 4, and is between haplotypes 4 and 5. Again let’s write a display function to help visualise.
match ending at position 4 between haplotypes 4 and 5:
4|0000
5|0001
match ending at position 4 between haplotypes 0 and 7:
0|1010
7|1011
match ending at position 5 between haplotypes 4 and 1:
4|0000
1|0001
match ending at position 5 between haplotypes 4 and 6:
4|0000
6|0001
match ending at position 5 between haplotypes 3 and 2:
3|1110
2|1111
match ending at position 4 between haplotypes 0 and 7:
0|01010
7|01011
match ending at position 5 between haplotypes 3 and 2:
3|11110
2|11111
You may have noticed that the dummy data X contains two haplotypes that are completely identical over all 6 positions, so why aren’t these found? That’s because the algorithm requires matches to terminate.
Algorithm 4. Report set maximal matches
Algorithm 4 shows how to find “set maximal matches” for each haplotype. These are the longest match for each haplotype at each position k. Again this makes use of the fact that haplotypes with maximal matches will be adjacent in the sorted order. At each position k, the algorithm iterates through the haplotypes in sorted order, looking for a maximal match for each haplotype. The first thing it does is to try and find any neighbours where the match can be extended, i.e., where the alleles at position k are the same. If so, continue on to the next haplotype without reporting anything. Otherwise, report a match with the longest matching neighbour (taking into account the fact that several neighbours may have equally long matches).
(4, 0, 0, 1)
(4, 3, 0, 1)
(4, 7, 0, 1)
(5, 1, 0, 1)
(5, 2, 0, 1)
(5, 6, 0, 1)
(3, 0, 0, 2)
(3, 7, 0, 2)
(2, 1, 0, 2)
(2, 6, 0, 2)
(4, 5, 1, 4)
(5, 4, 1, 4)
(0, 7, 0, 4)
(7, 0, 0, 4)
(4, 1, 2, 5)
(4, 6, 2, 5)
(3, 2, 1, 5)
(2, 3, 1, 5)
set maximal matches for haplotype 0:
0|010101
7|010110
set maximal matches for haplotype 2:
2|111111
1|110001
6|110001
3|011110
set maximal matches for haplotype 3:
3|011110
0|010101
7|010110
2|111111
set maximal matches for haplotype 4:
4|000000
0|010101
3|011110
7|010110
5|100010
1|110001
6|110001
set maximal matches for haplotype 5:
5|100010
1|110001
2|111111
6|110001
4|000000
set maximal matches for haplotype 7:
7|010110
0|010101
Here the bold underline indicates the regions in the haplotype representing set maximal matches to the haplotype at the top.
Again you may notice that there are no matches reported for the two haplotypes that are identical across all 6 positions. Again this is because matches are required to terminate.
Let’s have a look with some random data for further illustration.
set maximal matches for haplotype 0:
0|100001010111100110111010100101
2|011000110001000111111011110011
1|011011110111001111100110000010
1|011011110111001111100110000010
2|011000110001000111111011110011
2|011000110001000111111011110011
1|011011110111001111100110000010
2|011000110001000111111011110011
1|011011110111001111100110000010
set maximal matches for haplotype 1:
1|011011110111001111100110000010
2|011000110001000111111011110011
0|100001010111100110111010100101
2|011000110001000111111011110011
0|100001010111100110111010100101
2|011000110001000111111011110011
2|011000110001000111111011110011
0|100001010111100110111010100101
0|100001010111100110111010100101
2|011000110001000111111011110011
set maximal matches for haplotype 2:
2|011000110001000111111011110011
1|011011110111001111100110000010
0|100001010111100110111010100101
1|011011110111001111100110000010
1|011011110111001111100110000010
0|100001010111100110111010100101
1|011011110111001111100110000010
0|100001010111100110111010100101
0|100001010111100110111010100101
1|011011110111001111100110000010
Algorithm 5. Set maximal matches from a new sequence z to X
I haven’t got my head around this yet. Any help with the basic intuition would be very welcome.
Compact representation of X
I think I partly get this, at least why PBWT should be more compressible than the original X. But I don’t get how you recover X from PBWT. I need to go and read up about the FM-index.
Let’s at least check the assertion that the PBWT is more compressible than X, using some real haplotype data from Ag1000G. First, I’m going to implement algorithm 2 in a slightly different way, using NumPy and Cython to speed things up, and also constructing the PBWT as output.
Import some extra libraries.
'1.0.0'
Set up some real haplotype data.
<HDF5 file "ag1000g.phase1.ar3.haplotypes.3R.h5" (mode r)>
0 | 1 | 2 | 3 | 4 | ... | 768 | 769 | 770 | 771 | 772 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
1 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
3 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
4 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
...
For comparison with results from Durbin (2014) let’s take 1000 haplotypes and use the same number of segregating sites.
0 | 1 | 2 | 3 | 4 | ... | 995 | 996 | 997 | 998 | 999 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
...
Before applying the PBWT transform, check how compressible these data are already, using a standard compression algorithm (LZ4).
zarr.core.Array((370264, 1000), int8, chunks=(679, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 1, 'cname': 'lz4', 'shuffle': 2}
nbytes: 353.1M; nbytes_stored: 10.7M; ratio: 33.1; initialized: 546/546
store: builtins.dict
So as they are, these haplotype data can be compressed down to 10.7M.
Now let’s built the PBWT.
CPU times: user 3.69 s, sys: 248 ms, total: 3.94 s
Wall time: 3.9 s
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int8)
zarr.core.Array((370264, 1000), int8, chunks=(679, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 1, 'cname': 'lz4', 'shuffle': 2}
nbytes: 353.1M; nbytes_stored: 8.2M; ratio: 43.1; initialized: 546/546
store: builtins.dict
So in this case, PBWT is more compressible than the original data, but only marginally, requiring 8.2M rather than 10.7M. This seems at odds with what is reported in Durbin (2014), i.e., that PBWT is many times more compressible than raw haplotypes, so what’s going on?
I think that a lot of the benefit in terms of compressibility that is reported in Durbin (2014) is actually due to the fact that the haplotype data in the PBWT are transposed relative to the original .gz representation. I.e., If haplotypes are organised one variant per row rather than one haplotype per row, and the underlying bytes are in row-major order, then the data become more compressible as you add more haplotypes, because variants tend to be rare and hence most rows will be composed almost entirely of zeros. If you then permute each row via the PBWT you get a further benefit, because PBWT tends to bring the ones and zeros together, however the added benefit is fairly marginal and most of the gains come simply from the transposed layout.
Note that I am using LZ4 here and not the run length encoding used in Durbin (2014), I don’t know how much difference that would make.
To make use of the PBWT you also need the divergence array, so how compressible is that?
array([[ 0, 0, 0, ..., 0, 0, 0],
[ 1, 0, 0, ..., 0, 0, 1],
[ 2, 0, 0, ..., 0, 1, 2],
...,
[370261, 369928, 369973, ..., 370259, 370260, 370261],
[370262, 369928, 369973, ..., 370260, 370261, 370262],
[370263, 369928, 369973, ..., 370261, 370262, 370263]], dtype=uint32)
zarr.core.Array((370264, 1000), uint32, chunks=(679, 1000), order=C)
compression: blosc; compression_opts: {'clevel': 1, 'cname': 'lz4', 'shuffle': 1}
nbytes: 1.4G; nbytes_stored: 249.8M; ratio: 5.7; initialized: 546/546
store: builtins.dict
The divergence array is not so compressible with this compression configuration. However Durbin (2014) suggests using a delta filter. Let’s try this via the Python LZMA library.
CPU times: user 29.3 s, sys: 12 ms, total: 29.3 s
Wall time: 29.3 s
zarr.core.Array((370264, 1000), uint32, chunks=(679, 1000), order=C)
compression: lzma; compression_opts: {'check': 0, 'filters': [{'id': 3, 'dist': 4}, {'id': 33, 'preset': 1}], 'format': 1, 'preset': None}
nbytes: 1.4G; nbytes_stored: 16.0M; ratio: 88.5; initialized: 546/546
store: builtins.dict
Unfortunately LZMA is painfully slow, however it does at least demonstrate that the divergence arrays are also highly compressible in principle, here taking 16.0M.
The raw .ipynb for this post is here.