In phase 1 of the Ag1000G project we have whole genome sequence data for mosquitoes from 9 African countries. As part of our analysis of population structure, I recently needed to calculate average FST between each pair of populations. I also needed to calculate FST in windows over the genome, to look for genome regions that are particularly differentiated between certain populations.

FST is a statistic which seems simple at first yet quickly becomes very technical when you start reading the literature. I asked around my lab for advice and George pointed me to Bhatia et al. (2013) which provides some clear advice on how to estimate FST. However, Bhatia et al. were working with relatively well-studied human populations, and mosquito population genetics can get pretty extreme by comparison, so I didn’t want to take anything for granted.

To help explore the impact of different FST estimators and SNP ascertainment schemes, I implemented both the Weir and Cockerham estimator and the Hudson estimator in scikit-allel. This post gives some examples of using these functions with large scale SNP data, and some practical experiences from applying them to mosquito populations.

Setup

import numpy as np
import h5py
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('white')
sns.set_style('ticks')
import bcolz
import pandas
import allel; print('scikit-allel', allel.__version__)
scikit-allel 1.0.3

I have a copy of the Ag1000G phase 1 AR3 data release on a local drive. The SNP genotype data is available in an HDF5 file.

callset_fn = 'data/2015-09-21/ag1000g.phase1.ar3.pass.h5'
callset = h5py.File(callset_fn, mode='r')
callset
<HDF5 file "ag1000g.phase1.ar3.pass.h5" (mode r)>

Let’s work with chromosome arm 3L.

chrom = '3L'
# load all variant positions
pos_all = allel.SortedIndex(callset[chrom]['variants']['POS'])
pos_all
<SortedIndex shape=(9643193,) dtype=int32>
01234...96431889643189964319096431919643192
97909798981298189829...4195653241956537419565414195655141956556

There are 9,643,193 SNPs on this chromosome, genotyped in 765 individuals, so this is a relatively big dataset, too big to work with in memory uncompressed. To cope with larger datasets, scikit-allel can run computations directly against the data on disk, and can also make use of compression to reduce data size.

genotype_all = allel.GenotypeChunkedArray(callset[chrom]['calldata']['genotype'])
genotype_all
<GenotypeChunkedArray shape=(9643193, 765, 2) dtype=int8 chunks=(6553, 10, 2) nbytes=13.7G cbytes=548.0M cratio=25.7 compression=gzip compression_opts=3 values=h5py._hl.dataset.Dataset>
01234...760761762763764
00/00/00/00/00/0...0/00/00/00/00/0
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
96431900/00/00/00/00/0...0/00/00/00/00/0
96431910/00/00/00/00/0...0/00/00/00/00/0
96431920/00/00/00/00/0...0/00/00/00/00/0

This array would be 13.7G uncompressed, but genotype data compress very well, so the actual size on disk is only 548M.

There is also a table of sample metadata which we’ll need because it tells us which mosquito comes from which population.

df_samples = pandas.read_csv('data/2015-09-21/samples.meta.txt', sep='\t', index_col='index')
df_samples.head()
ox_code src_code sra_sample_accession population country region contributor contact year m_s sex n_sequences mean_coverage latitude longitude
index
0 AB0085-C BF2-4 ERS223996 BFS Burkina Faso Pala Austin Burt Sam O'Loughlin 2012 S F 89905852 28.01 11.150 -4.235
1 AB0087-C BF3-3 ERS224013 BFM Burkina Faso Bana Austin Burt Sam O'Loughlin 2012 M F 116706234 36.76 11.233 -4.472
2 AB0088-C BF3-5 ERS223991 BFM Burkina Faso Bana Austin Burt Sam O'Loughlin 2012 M F 112090460 23.30 11.233 -4.472
3 AB0089-C BF3-8 ERS224031 BFM Burkina Faso Bana Austin Burt Sam O'Loughlin 2012 M F 145350454 41.36 11.233 -4.472
4 AB0090-C BF3-10 ERS223936 BFM Burkina Faso Bana Austin Burt Sam O'Loughlin 2012 M F 105012254 34.64 11.233 -4.472

The ‘index’ column in this table corresponds to the order of columns in the genotype array.

Let’s pick two populations to work with.

pop1 = 'BFM'
pop2 = 'AOM'
n_samples_pop1 = np.count_nonzero(df_samples.population == pop1)
n_samples_pop2 = np.count_nonzero(df_samples.population == pop2)
print(pop1, n_samples_pop1, pop2, n_samples_pop2)
BFM 69 AOM 60

I’ve chosen BFM (Anopheles coluzzii from Burkina Faso) and AOM (Anopheles coluzzii from Angola) because FST is reasonably high between these two populations.

Now compute allele counts in each population. This takes a minute or so.

# dictionary mapping population names to sample indices
subpops = {
    pop1: df_samples[df_samples.population == pop1].index,
    pop2: df_samples[df_samples.population == pop2].index,
}
# allele counts
acs = genotype_all.count_alleles_subpops(subpops)
acs
<AlleleCountsChunkedTable shape=(9643193,) dtype=[('AOM', '<i4', (4,)), ('BFM', '<i4', (4,))] nbytes=294.3M cbytes=23.5M cratio=12.5 values=allel.chunked.storage_zarr.ZarrTable>
AOMBFM
0[120 0 0 0][138 0 0 0]
1[120 0 0 0][138 0 0 0]
2[120 0 0 0][138 0 0 0]
......
9643190[120 0 0 0][138 0 0 0]
9643191[120 0 0 0][138 0 0 0]
9643192[120 0 0 0][138 0 0 0]

Finally, we can filter out variants that aren’t segregating in the union of our two populations. Let’s also filter out multiallelic variants for simplicity.

acu = allel.AlleleCountsArray(acs[pop1][:] + acs[pop2][:])
flt = acu.is_segregating() & (acu.max_allele() == 1)
print('retaining', np.count_nonzero(flt), 'SNPs')
retaining 3177369 SNPs
pos = pos_all.compress(flt)
ac1 = allel.AlleleCountsArray(acs[pop1].compress(flt, axis=0)[:, :2])
ac2 = allel.AlleleCountsArray(acs[pop2].compress(flt, axis=0)[:, :2])
genotype = genotype_all.compress(flt, axis=0)
genotype
<GenotypeChunkedArray shape=(3177369, 765, 2) dtype=int8 chunks=(776, 765, 2) nbytes=4.5G cbytes=412.7M cratio=11.2 compression=blosc compression_opts={'cname': 'lz4', 'clevel': 5, 'shuffle': 1} values=zarr.core.Array>
01234...760761762763764
00/00/00/00/00/0...0/00/00/00/00/0
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
31773660/00/00/00/00/0...0/00/00/00/00/0
31773670/00/00/00/00/0...0/00/00/00/00/0
31773680/00/00/00/00/0...0/00/00/00/00/0

Note that the new genotype array we have created is in-memory, however we are making use of compressed arrays to store these intermediate data, so the amount of memory required is small. Btw don’t get confused with the name of the ‘compress’ method, this comes from numpy and means making a selection along a particular axis, it has nothing to do with data compression which is handled automatically by scikit-allel.

Comparing FST estimators

Per-SNP estimates

Let’s first compute the per-SNP FST value from each of the two estimators. The Weir & Cockerham estimator takes a little longer because it has to revisit the genotype data. The Hudson estimator is faster because it only needs the allele counts, which we’ve already computed.

# sample indices for population 1
pop1_idx = subpops[pop1]
# sample indices for population 2
pop2_idx = subpops[pop2]
a, b, c = allel.stats.weir_cockerham_fst(genotype, subpops=[pop1_idx, pop2_idx], max_allele=1)
snp_fst_wc = (a / (a + b + c))[:, 0]
snp_fst_wc
array([-0.00102459,  0.01277143,  0.08574083, ..., -0.00102459,
        0.0554781 ,  0.08247673])
num, den = allel.stats.hudson_fst(ac1, ac2)
snp_fst_hudson = num / den
snp_fst_hudson
array([ 0.        ,  0.01459854,  0.08403361, ...,  0.        ,
        0.05042017,  0.07563025])
fig, ax = plt.subplots(figsize=(5, 5))
sns.despine(ax=ax, offset=5)
ax.plot(snp_fst_hudson, snp_fst_wc, color='k', marker='.', linestyle=' ')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xlabel('Hudson $F_{ST}$')
ax.set_ylabel('Weir & Cockerham $F_{ST}$')
ax.set_title('%s (%s) vs %s (%s), SNP $F_{ST}$' % (pop1, n_samples_pop1, pop2, n_samples_pop2));

png

With a couple of exceptions, the two estimators are virtually identical for all SNPs. However, one thing that Bhatia et al. warn is that the Weir & Cockerham estimator can give different results if sample sizes are unequal. We’ve chosen two populations with similar sample sizes, but what happens if we fake one of the populations to have a much smaller sample size?

# keep only 20 samples from first population
pop1_idx_ds = subpops[pop1][:20]
a, b, c = allel.stats.weir_cockerham_fst(genotype, subpops=[pop1_idx_ds, pop2_idx], max_allele=1)
# there may be some non-segregating variants after down-sampling, suppress errors about zero division
with np.errstate(divide='ignore', invalid='ignore'):
    snp_fst_wc_ds = (a / (a + b + c))[:, 0]
snp_fst_wc_ds
array([        nan,  0.03306823,  0.03605869, ...,         nan,
        0.02435467,  0.04264706])
np.count_nonzero(np.isnan(snp_fst_wc_ds))
1105405
# recompute allele counts for downsampled population
ac1_ds = genotype.count_alleles(subpop=pop1_idx_ds, max_allele=1)
num, den = allel.stats.hudson_fst(ac1_ds, ac2)
# there may be some non-segregating variants after down-sampling, suppress errors about zero division
with np.errstate(divide='ignore', invalid='ignore'):
    snp_fst_hudson_ds = num / den
snp_fst_hudson_ds
array([        nan,  0.        ,  0.08403361, ...,         nan,
        0.05042017,  0.07563025])
fig, ax = plt.subplots(figsize=(5, 5))
sns.despine(ax=ax, offset=5)
ax.plot(snp_fst_hudson_ds, snp_fst_wc_ds, color='k', marker='.', linestyle=' ')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xlabel('Hudson $F_{ST}$')
ax.set_ylabel('Weir & Cockerham $F_{ST}$')
ax.set_title('%s (20) vs %s (%s), SNP $F_{ST}$' % (pop1, pop2, n_samples_pop2));

png

fig, ax = plt.subplots(figsize=(5, 5))
sns.despine(ax=ax, offset=5)
ax.plot(snp_fst_hudson, snp_fst_hudson_ds, color='k', marker='.', linestyle=' ')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xlabel('Hudson $F_{ST}$')
ax.set_ylabel('Hudson $F_{ST}$ (one population downsampled)')
ax.set_title('%s vs %s, SNP $F_{ST}$' % (pop1, pop2));

png

fig, ax = plt.subplots(figsize=(5, 5))
sns.despine(ax=ax, offset=5)
ax.plot(snp_fst_wc, snp_fst_wc_ds, color='k', marker='.', linestyle=' ')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_xlabel('Weir & Cockerham $F_{ST}$')
ax.set_ylabel('Weir & Cockerham $F_{ST}$ (one population downsampled)')
ax.set_title('%s vs %s, SNP $F_{ST}$' % (pop1, pop2));

png

When the sample sizes are unequal, the correspondance between the two estimators is clearly much less. Also, the Weir & Cockerham estimator appears to be systematically different with and without one population down-sampled.

Chromosome average estimates

Now compute chromosome-wide average FST with standard errors approximated via a block-jackknife.

fst_wc, se_wc, vb_wc, _ = allel.stats.blockwise_weir_cockerham_fst(genotype, subpops=[pop1_idx, pop2_idx], 
                                                                   blen=10000, max_allele=1)
print('%.04f +/- %.04f (Weir & Cockerham)' % (fst_wc, se_wc))
0.1057 +/- 0.0030 (Weir & Cockerham)
fst_hudson, se_hudson, vb_hudson, _ = allel.stats.blockwise_hudson_fst(ac1, ac2, blen=10000)
print('%.04f +/- %.04f (Hudson)' % (fst_hudson, se_hudson))
0.1065 +/- 0.0030 (Hudson)

The two estimates are very close, well within one standard error.

How about with one population downsampled to 20?

fst_wc_ds, se_wc_ds, _, _ = allel.stats.blockwise_weir_cockerham_fst(genotype, subpops=[pop1_idx_ds, pop2_idx], 
                                                                     blen=10000, max_allele=1)
print('%.04f +/- %.04f (Weir & Cockerham)' % (fst_wc_ds, se_wc_ds))
0.1120 +/- 0.0030 (Weir & Cockerham)
fst_hudson_ds, se_hudson_ds, _, _ = allel.stats.blockwise_hudson_fst(ac1_ds, ac2, blen=10000)
print('%.04f +/- %.04f (Hudson)' % (fst_hudson_ds, se_hudson_ds))
0.1057 +/- 0.0030 (Hudson)

The two estimates are now separated by about two standard errors, with the Weir & Cockerham estimator inflated relative to the estimate with full samples from both populations

SNP ascertainment

Another issue that Bhatia et al. discuss is SNP ascertainment. Basically, how you choose which SNPs to use when estimating FST can make a difference. As I understand it, when computing average FST you want to use a set of SNPs which segregated in the ancestral population, because changes in allele frequency at these SNPs will tell you something about genetic drift.

Bhatia et al. recommend ascertaining SNPs by choosing SNPs that are segregating in a third “outgroup” population. However, we don’t really have an obvious outgroup population in Ag1000G. So we then have four choices: (1) choose SNPs segregating in the first population; (2) choose SNPs segregating in the second population; (3) choose SNPs segregating in either population; (4) choose SNPs segregating in both populations.

Let’s explore the impact of different ascertainment schemes, using the Hudson estimator.

def compute_fst(ac1, ac2, scheme):
    
    if scheme == 'first':
        loc_asc = ac1.is_segregating()
    elif scheme == 'second':
        loc_asc = ac2.is_segregating()
    elif scheme == 'either':
        loc_asc = ac1.is_segregating() | ac2.is_segregating()
    elif scheme == 'both':
        loc_asc = ac1.is_segregating() & ac2.is_segregating()    
    n_snps = np.count_nonzero(loc_asc)
    
    ac1 = ac1.compress(loc_asc, axis=0)
    ac2 = ac2.compress(loc_asc, axis=0)
    
    fst, se, _, _ = allel.stats.blockwise_hudson_fst(ac1, ac2, blen=10000)
    
    print('%.04f +/- %.04f (using %s SNPs segregating in %s population)' % (fst, se, n_snps, scheme))
for scheme in 'first', 'second', 'either', 'both':
    compute_fst(ac1, ac2, scheme)
0.1027 +/- 0.0031 (using 2797341 SNPs segregating in first population)
0.1140 +/- 0.0040 (using 1086498 SNPs segregating in second population)
0.1065 +/- 0.0030 (using 3177369 SNPs segregating in either population)
0.1101 +/- 0.0045 (using 706470 SNPs segregating in both population)

The spread of values here is more than three standard errors, so clearly ascertainment makes a difference. Here I’d be inclined to use SNPs segregating in both populations as it is a stricter criterion, however comments very welcome.

Genome plot

Finally, let’s plot FST over the chromosome, to see if any regions are particularly differentiated.

def plot_fst(ac1, ac2, pos, blen=2000):
    
    fst, se, vb, _ = allel.stats.blockwise_hudson_fst(ac1, ac2, blen=blen)
    
    # use the per-block average Fst as the Y coordinate
    y = vb
    
    # use the block centres as the X coordinate
    x = allel.stats.moving_statistic(pos, statistic=lambda v: (v[0] + v[-1]) / 2, size=blen)
    
    # plot
    fig, ax = plt.subplots(figsize=(10, 4))
    sns.despine(ax=ax, offset=5)
    ax.plot(x, y, 'k-', lw=.5)
    ax.set_ylabel('$F_{ST}$')
    ax.set_xlabel('Chromosome %s position (bp)' % chrom)
    ax.set_xlim(0, pos.max())
    
plot_fst(ac1, ac2, pos)

png

This plot suggests some genome regions where FST is higher than the chromosome-wide average, which are interesting to follow up.

Conclusions

Hudson’s FST estimator is more robust to unequal sample sizes, and faster to compute because it only requires allele counts as input.

SNP ascertainment also makes a difference. It’s probably a good idea to try different ascertainment schemes to see what impact they have on the results.

Further reading


import datetime
print(datetime.datetime.now().isoformat())
2016-11-01T19:32:06.817362