This post has some examples of analysing a genetic cross, using scikit-allel and general scientific Python libraries (NumPy, SciPy, pandas, matplotlib, etc.). As usual, if you spot any errors or have any suggestions, please drop a comment below.

Setup

import numpy as np
import scipy
import pandas
import h5py
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
sns.set_style('ticks')
import allel; print('scikit-allel', allel.__version__)
scikit-allel 1.0.3

I’m going to use data from the Ag1000G project phase 1 data releases (AR3, AR3.1), which include genotype calls for four genetic crosses. Each cross involves two parents (a mother and a father) and up to 20 offspring (progeny). These are mosquito crosses, but mosquitoes are diploid (like us), so the genetics are the same as if analysing a cross or family of any other diploid species.

Here’s some information about the crosses.

samples = pandas.read_csv('data/phase1.AR3.1/samples/cross.samples.meta.txt', sep='\t')
samples.head()
ox_code cross role n_reads median_cov mean_cov sex colony_id
0 AD0231-C 29-2 parent 451.762 20.0 19.375 F ghana
1 AD0232-C 29-2 parent 572.326 25.0 24.370 M kisumu
2 AD0234-C 29-2 progeny 489.057 16.0 15.742 F NaN
3 AD0235-C 29-2 progeny 539.649 17.0 17.364 F NaN
4 AD0236-C 29-2 progeny 537.237 17.0 17.284 F NaN
samples.cross.value_counts()
46-9    22
29-2    22
36-9    20
42-4    16
Name: cross, dtype: int64

So there are four crosses. The two largest (crosses ‘29-2’ and ‘46-9’) each have 22 individuals (2 parents, 20 progeny), and the smallest (‘42-4’) has 16 individuals (2 parents, 14 progeny).

All individuals in all crosses have been sequenced on Illumina HiSeq machines, and then have had genotypes called at variant sites discovered in a cohort of wild specimens. The genotype data were originally in VCF format, however for ease of analysis we’ve converted the data to HDF5 format.

Open the file containing genotype data for chromosome arm 3R.

callset = h5py.File('data/phase1.AR3/variation/crosses/ar3/hdf5/ag1000g.crosses.phase1.ar3sites.3R.h5',
                    mode='r')
callset
<HDF5 file "ag1000g.crosses.phase1.ar3sites.3R.h5" (mode r)>

To analyse your own data using the examples shown below, you would need to convert the genotype data to either NumPy or HDF5 format. If you have the data in VCF format then you can use the vcfnp utility to perform the conversion. There is some documentation in the vcfnp README but please feel free to email me if you run into any difficulty. This data conversion is the slow, painful step, if you can get over it then the rest should be relatively plain sailing.

Here I am going to start from unphased genotype data. If you have already phased the data that’s fine, convert to NumPy or HDF5 as you would for unphased data.

In total I have genotype calls in 80 individuals at 22,632,425 SNPs on chromosome 3R. However, I’m only going to analyse data for a single cross, ‘29-2’, between a mother from the ‘Ghana’ colony and a father from the ‘Kisumu’ colony. I can subset out the genotype data for just this cross, and keep only SNPs that are segregating in this cross. I’m also only going to keep SNPs that passed all quality filters.

genotypes = allel.GenotypeChunkedArray(callset['3R/calldata/genotype'])
genotypes
<GenotypeChunkedArray shape=(22632425, 80, 2) dtype=int8 chunks=(6553, 10, 2) nbytes=3.4G cbytes=131.6M cratio=26.2 compression=gzip compression_opts=1 values=h5py._hl.dataset.Dataset>
01234...7576777879
01/10/10/10/10/1...0/10/10/10/10/1
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
22632422./../../../../....0/0./../.0/00/0
22632423./../../../../....0/0./../../.0/0
22632424./../../../../...../../../../.1/1
# locate the indices of the samples within the callset
sample_indices = samples[samples.cross == '29-2'].index.values.tolist()
sample_indices
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
# do an allele count to find segregating variants
ac = genotypes.count_alleles(max_allele=3, subpop=sample_indices)[:]
ac
<AlleleCountsArray shape=(22632425, 4) dtype=int32>
0123
01925 0 0
144 0 0 0
244 0 0 0
......
226324220000
226324230000
226324240000
# locate segregating SNPs
loc_seg = ac.is_segregating()
loc_seg
array([ True, False, False, ..., False, False, False], dtype=bool)
# how many SNPs are segregating within the cross?
np.count_nonzero(loc_seg)
2142258
# locate SNPs that passed all quality filters
loc_pass = callset['3R/variants/FILTER_PASS'][:]
loc_pass
array([False, False, False, ..., False, False, False], dtype=bool)
# perform the subset and load the results into memory uncompressed
genotypes_cross_pass = genotypes.subset(loc_seg & loc_pass, sample_indices)[:]
genotypes_cross_pass
<GenotypeArray shape=(709399, 22, 2) dtype=int8>
01234...1718192021
0./.0/00/10/00/0...0/00/00/00/00/0
1./.0/00/00/00/0...0/00/00/00/00/0
20/10/00/10/10/0...0/00/00/00/00/0
......
7093961/10/00/10/10/1...0/10/10/10/10/1
7093970/01/10/10/10/1...0/10/10/10/10/1
7093980/01/10/10/10/1...0/10/10/10/10/1

Now I have an array of genotype calls at 709,399 segregating SNPs in 22 individuals. The mother is the first column, the father is the second column, and the progeny are the remaining columns. You’ll notice that the mother’s genotype call is missing at the first two SNPs: we could remove these now, but we’ll leave them in for the moment, just to check that phasing is robust to some missing data.

Phasing by transmission

I’m starting from unphased genotype calls, so the first thing to do is phase the calls to generate haplotypes. There are several options for phasing a cross. Here I’m going to use the phase_by_transmission() function from scikit-allel, because it’s convenient and fast (couple of seconds). We’ve found this function works well for crosses with relatively large numbers of progeny. However, if you have a smaller family with only a couple of progeny, and/or you have a more complicated pedigree with multiple generations, try phasing with SHAPEIT2 + duoHMM or MERLIN.

genotypes_cross_phased = allel.phase_by_transmission(genotypes_cross_pass, window_size=100)
genotypes_cross_phased
<GenotypeArray shape=(709399, 22, 2) dtype=int8>
01234...1718192021
0./.0/00/10/00/0...0/00/00/00/00/0
1./.0/00/00/00/0...0/00/00/00/00/0
20|10|01|01|00|0...0|00|00|00|00|0
......
7093961|10|01|01|01|0...1|01|01|01|01|0
7093970|01|10|10|10|1...0|10|10|10|10|1
7093980|01|10|10|10|1...0|10|10|10|10|1

Notice that most of the genotype calls in the snippet shown above now have a pipe character (‘|’) as the allele separator, indicating the call is phased. The first two SNPs could not be phased, however, because the genotype call for one of the parents is missing.

Before we go further, let’s subset down to the SNPs where all individuals could be phased.

# locate variants where all genotype calls were phased
loc_phased_all = np.all(genotypes_cross_phased.is_phased, axis=1)
loc_phased_all
array([False, False,  True, ...,  True,  True,  True], dtype=bool)
genotypes_cross_phased_all = genotypes_cross_phased[loc_phased_all]
genotypes_cross_phased_all
<GenotypeArray shape=(629086, 22, 2) dtype=int8>
01234...1718192021
00|10|01|01|00|0...0|00|00|00|00|0
10|10|01|01|00|0...0|00|01|00|00|0
20|20|02|02|00|0...2|00|02|00|00|0
......
6290831|10|01|01|01|0...1|01|01|01|01|0
6290840|01|10|10|10|1...0|10|10|10|10|1
6290850|01|10|10|10|1...0|10|10|10|10|1

Note we have a few SNPs here which have a “0|2” or “2|0” genotype call. These were multi-allelic in the original callset.

Visualising transmission and recombination

Now we have phased data, let’s plot the transmission of alleles from parents to progeny. This is a useful diagnostic for assessing the quality of the phasing, and also gives an indication of how much recombination has occurred.

Before plotting, I’m going to separate out the data into maternal and paternal haplotypes. The maternal haplotypes are the two haplotypes carried by the mother, and the haplotype in each of the progeny inherited from the mother. The paternal haplotypes are the same but for the father.

# pull out mother's genotypes from the first column
genotypes_mother = genotypes_cross_phased_all[:, 0]
# convert to haplotype array
haplotypes_mother = genotypes_mother.to_haplotypes()
# pull out maternal haplotypes from the progeny
haplotypes_progeny_maternal = allel.HaplotypeArray(genotypes_cross_phased_all[:, 2:, 0])
# stack mother's haplotypes alongside haplotypes she transmitted to her progeny
haplotypes_maternal = haplotypes_mother.concatenate(haplotypes_progeny_maternal, axis=1)
haplotypes_maternal
<HaplotypeArray shape=(629086, 22) dtype=int8>
01234...1718192021
001110...00000
101110...00100
202220...20200
......
62908311111...11111
62908400000...00000
62908500000...00000

Let’s fix on the mother for a moment. The mother has two haplotypes (displayed in the first two column above). Because recombination occurs during gamete formation, each haplotype the mother passes on to her progeny is a unique mosaic of her own two haplotypes. For any SNP where the mother’s two haplotypes carry a different allele, we can “paint” the maternal haplotypes within the progeny according to which of the mother’s two alleles were inherited, using the paint_transmission() function.

painting_maternal = allel.paint_transmission(haplotypes_mother, haplotypes_progeny_maternal)
painting_maternal
array([[2, 2, 1, ..., 1, 1, 1],
       [2, 2, 1, ..., 2, 1, 1],
       [2, 2, 1, ..., 2, 1, 1],
       ..., 
       [4, 4, 4, ..., 4, 4, 4],
       [3, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3]], dtype=uint8)

This new “painting” array is an array of integer codes, where each number means something. The meanings are given in the help text for the paint_transmission() function:

help(allel.paint_transmission)
Help on function paint_transmission in module allel.stats.mendel:

paint_transmission(parent_haplotypes, progeny_haplotypes)
    Paint haplotypes inherited from a single diploid parent according to
    their allelic inheritance.
    
    Parameters
    ----------
    parent_haplotypes : array_like, int, shape (n_variants, 2)
        Both haplotypes from a single diploid parent.
    progeny_haplotypes : array_like, int, shape (n_variants, n_progeny)
        Haplotypes found in progeny of the given parent, inherited from the
        given parent. I.e., haplotypes from gametes of the given parent.
    
    Returns
    -------
    painting : ndarray, uint8, shape (n_variants, n_progeny)
        An array of integers coded as follows: 1 = allele inherited from
        first parental haplotype; 2 = allele inherited from second parental
        haplotype; 3 = reference allele, also carried by both parental
        haplotypes; 4 = non-reference allele, also carried by both parental
        haplotypes; 5 = non-parental allele; 6 = either or both parental
        alleles missing; 7 = missing allele; 0 = undetermined.
    
    Examples
    --------
    >>> import allel
    >>> haplotypes = allel.HaplotypeArray([
    ...     [0, 0, 0, 1, 2, -1],
    ...     [0, 1, 0, 1, 2, -1],
    ...     [1, 0, 0, 1, 2, -1],
    ...     [1, 1, 0, 1, 2, -1],
    ...     [0, 2, 0, 1, 2, -1],
    ...     [0, -1, 0, 1, 2, -1],
    ...     [-1, 1, 0, 1, 2, -1],
    ...     [-1, -1, 0, 1, 2, -1],
    ... ], dtype='i1')
    >>> painting = allel.stats.paint_transmission(haplotypes[:, :2],
    ...                                           haplotypes[:, 2:])
    >>> painting
    array([[3, 5, 5, 7],
           [1, 2, 5, 7],
           [2, 1, 5, 7],
           [5, 4, 5, 7],
           [1, 5, 2, 7],
           [6, 6, 6, 7],
           [6, 6, 6, 7],
           [6, 6, 6, 7]], dtype=uint8)

We are particularly interested in plotting the “1” and “2” values, because these occur where the mother’s haplotypes carried two different alleles, and so we have information about which allele has been transmitted.

We can do the same for the father.

genotypes_father = genotypes_cross_phased_all[:, 1]
haplotypes_father = genotypes_father.to_haplotypes()
haplotypes_progeny_paternal = allel.HaplotypeArray(genotypes_cross_phased_all[:, 2:, 1])
haplotypes_paternal = haplotypes_father.concatenate(haplotypes_progeny_paternal, axis=1)
haplotypes_paternal
<HaplotypeArray shape=(629086, 22) dtype=int8>
01234...1718192021
000000...00000
100000...00000
200000...00000
......
62908300000...00000
62908411111...11111
62908511111...11111
painting_paternal = allel.paint_transmission(haplotypes_father, haplotypes_progeny_paternal)
painting_paternal
array([[3, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3],
       [3, 3, 3, ..., 3, 3, 3],
       ..., 
       [3, 3, 3, ..., 3, 3, 3],
       [4, 4, 4, ..., 4, 4, 4],
       [4, 4, 4, ..., 4, 4, 4]], dtype=uint8)

Now we have these “paintings”, we can plot them.

# create a colour palette
palette = sns.color_palette('colorblind')
sns.palplot(palette)

png

def plot_transmission(painting, title=None):
    
    # set figure height depending on number of haplotypes
    fig, ax = plt.subplots(figsize=(12, .2 * painting.shape[1]))
    
    # map painting codes onto colours
    cmap = mpl.colors.ListedColormap([
        'grey',      # 0 = undetermined
        palette[0],  # 1 = allele inherited from first parental haplotype
        palette[2],  # 2 = allele inherited from second parental haplotype
        palette[1],  # 3 = reference allele, also carried by both parental haplotypes
        palette[3],  # 4 = non-reference allele, also carried by both parental haplotypes
        'black',     # 5 = non-parental allele (i.e., Mendelian error)
        palette[5],  # 6 = either or both parental alleles missing
        'white',     # 7 = missing allele
    ])
    
    # plot painting
    ax.pcolormesh(painting.T, cmap=cmap, vmin=0, vmax=7)
    
    # tidy up axes
    ax.set_yticks(np.arange(painting.shape[1]) + .5)
    ax.set_yticklabels(range(painting.shape[1]))
    ax.set_ylabel('Progeny haplotypes')
    ax.set_xlabel('Variants')
    if title:
        ax.set_title(title)

Run the plot on the maternal progeny haplotypes. I’ll plot only every 10th variant as there are >600,000 SNPs here which is way more that can be visualised and slows down plotting.

plot_transmission(painting_maternal[::10], title='Maternal haplotypes')

png

Recombination events can be seen in the switches between blue and red within a single progeny haplotype. However, there are quite a few sites here that do not segregate between the mother’s two haplotypes, shown in green and pink. These aren’t really interesting, so we can get rid of them by subsetting the painting to only sites that are heterozygous in the mother.

plot_transmission(painting_maternal[genotypes_mother.is_het()][::10], 
                  title='Maternal haplotypes (sites heterozygous in mother)')

png

Note that there are some SNPs where all progeny are blue, and some SNPs where all progeny are red. These are unlikely to be true as they would imply recombination in multiple progeny at the same location. It’s more likely these are genotyping errors in the parent.

There are also some sites where a single progeny switches from blue to red then straight back to blue, or vice versa. It is possible these are gene conversion events, however they may also be genotyping errors in the progeny.

Let’s take a look at the paternal haplotypes.

plot_transmission(painting_paternal[::10], title='Paternal haplotypes')

png

plot_transmission(painting_paternal[genotypes_father.is_het()][::10], 
                  title='Paternal haplotypes (sites heterozygous in father)')

png

What’s going on with the father? It turns out that the father in this cross comes from a mosquito colony (‘Kisumu’) that was colonised back in the 1970s. It’s been maintained in colony ever since, and colonies are kept in small cages, so inbreeding is inevitable. The father has long runs of homozygosity, and within these runs there are no SNPs that are informative for recombination. Let’s take a look at heterozygosity in the parents to confirm.

# load variant physical positions
pos = callset['3R/variants/POS'][:]
pos_cross_pass = pos[loc_seg & loc_pass]
def plot_density(pos, window_size=100000, title=None):
    
    fig, ax = plt.subplots(figsize=(12, 2))
    sns.despine(ax=ax, offset=5)
    y, windows = allel.windowed_count(pos, size=window_size)
    x = np.mean(windows, axis=1)
    ax.plot(x, y/window_size)
    ax.set_ylabel('Density (bp$^{-1}$)')
    ax.set_xlabel('Position (bp)')
    if title:
        ax.set_title(title)
    
plot_density(pos_cross_pass[genotypes_cross_pass[:, 0].is_het()],
             title='Heterozygosity in mother')

png

plot_density(pos_cross_pass[genotypes_cross_pass[:, 1].is_het()],
             title='Heterozygosity in father')

png

Mendelian errors

Another thing you can do with genotype calls from a cross is to look for calls in one or more progeny that are not consistent with Mendelian transmission, also known as Mendelian errors. Let’s locate Mendelian errors using the mendel_errors() function. We’ll run this on a different set of calls, again keeping only variants that segregate within the cross, but this time we won’t exclude SNPs that didn’t pass quality filters.

genotypes_cross = genotypes.subset(loc_seg, sample_indices)
genotypes_cross
<GenotypeChunkedArray shape=(2142258, 22, 2) dtype=int8 chunks=(8369, 22, 2) nbytes=89.9M cbytes=25.1M cratio=3.6 compression=blosc compression_opts={'clevel': 5, 'shuffle': 1, 'cname': 'lz4'} values=zarr.core.Array>
01234...1718192021
01/10/10/10/10/1...1/10/01/10/10/1
10/00/00/00/10/0...0/00/10/00/10/0
20/00/00/00/00/0...0/00/00/00/00/0
......
2142255./.1/10/0./../....0/0./../../../.
2142256./.0/00/0./../....0/0./../../../.
2142257./.0/0./../../....0/0./../../../.
genotypes_parents = genotypes_cross[:, :2]
genotypes_progeny = genotypes_cross[:, 2:]
mendel_errors = allel.mendel_errors(genotypes_parents, genotypes_progeny)
mendel_errors
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])
mendel_errors.shape
(2142258, 20)

The mendel_errors array has one element for each progeny genotype call, and counts the number of non-Mendelian alleles present (up to 2). This can be summarized in various ways, e.g., by individual.

mendel_errors_by_sample = np.sum(mendel_errors, axis=0)
mendel_errors_by_sample
array([150437, 147810, 145152, 144064, 147128, 146949, 131825, 146373,
       128950, 148635, 138708, 142210, 145581, 135932, 139459, 136204,
       140298, 149186, 144172, 139478])
# plot errors by sample
fig, ax = plt.subplots(figsize=(10, 5))
sns.despine(ax=ax, offset=5)
left = list(range(len(mendel_errors_by_sample)))
height = mendel_errors_by_sample
ax.bar(left, height)
ax.set_xticks(left)
ax.set_xlabel('Progeny')
ax.set_ylabel('No. errors')
ax.set_title('Mendel errors by individual');

png

We could also look at the density of Mendel errors over the genome.

mendel_errors_by_variant = np.sum(mendel_errors, axis=1)
mendel_errors_by_variant
array([ 1, 10,  5, ...,  0,  0,  0])
pos_cross = pos[loc_seg]
fig, ax = plt.subplots(figsize=(12, 2))
sns.despine(ax=ax, offset=5)
y, windows, _ = allel.windowed_statistic(pos_cross, values=mendel_errors_by_variant, statistic=np.sum, size=100000)
x = np.mean(windows, axis=1)
ax.plot(x, y)
ax.set_xlabel('Position (bp)')
ax.set_ylabel('No. errors')
ax.set_title('Mendel errors by genome position');

png

Another analysis we can do is to study the association between rate of Mendel error and variant annotations like QD (quality by depth) and MQ (mapping quality). This can be useful when designing a variant filtering strategy.

# load some variant annotations
vtbl = allel.VariantChunkedTable(callset['3R/variants'], names=['POS', 'QD', 'MQ'])
# subset annotations to sites segregating in the cross
vtbl_cross = vtbl.compress(loc_seg)[:]
vtbl_cross
<VariantTable shape=(2142258,) dtype=(numpy.record, [('POS', '<i4'), ('QD', '<f2'), ('MQ', '<f2')])>
POSQDMQ
01322.73451.5
1661.320351.625
2940.2800351.188
......
2142255532003690.01000214.047
2142256532003910.700210.609
2142257532003960.680189.7734
def plot_variant_me(f, bins):

    fig, ax = plt.subplots(figsize=(7, 5))
    sns.despine(ax=ax, offset=5)
    x = vtbl_cross[f]

    # plot a histogram
    ax.hist(x, bins=bins)
    ax.set_xlabel(f)
    ax.set_ylabel('No. variants')

    # plot Mendel errors
    ax = ax.twinx()
    sns.despine(ax=ax, bottom=True, left=True, right=False, offset=10)
    values = mendel_errors_by_variant
    with np.errstate(over='ignore'):
        # binned_statistic generates an annoying overflow warning which we can ignore
        y1, _, _ = scipy.stats.binned_statistic(x, values, statistic=np.mean, bins=bins)
    bx = (bins[1:] + bins[:-1]) / 2
    ax.plot(bx, y1, color='k', label='Mendel errors')
    ax.set_ylabel('Mendel error rate')
    ax.legend(loc='upper right')
plot_variant_me('QD', bins=np.arange(0, 40, 1))

png

plot_variant_me('MQ', bins=np.arange(0, 60, 1))

png

def plot_joint_variant_me(f1, f2, gridsize=20, mincnt=20, extent=None):
    fig, ax = plt.subplots()
    sns.despine(ax=ax, offset=5)
    x = vtbl_cross[f1]
    y = vtbl_cross[f2]
    C = mendel_errors_by_variant
    im = ax.hexbin(x, y, C=C, reduce_C_function=np.mean, mincnt=mincnt, extent=extent,
                   gridsize=gridsize, cmap='jet')
    fig.colorbar(im)
    ax.set_xlabel(f1)
    ax.set_ylabel(f2)
    ax.set_title('Variant %s versus %s and Mendel error rate' % (f1, f2))
plot_joint_variant_me('QD', 'MQ', extent=(0, 40, 0, 60))

png

Further reading