Mendelian transmission
This post has some examples of analysing a genetic cross, using scikit-allel and general scientific Python libraries (NumPy, SciPy, pandas, matplotlib, etc.). As usual, if you spot any errors or have any suggestions, please drop a comment below.
Setup
scikit-allel 1.0.3
I’m going to use data from the Ag1000G project phase 1 data releases (AR3, AR3.1), which include genotype calls for four genetic crosses. Each cross involves two parents (a mother and a father) and up to 20 offspring (progeny). These are mosquito crosses, but mosquitoes are diploid (like us), so the genetics are the same as if analysing a cross or family of any other diploid species.
Here’s some information about the crosses.
ox_code | cross | role | n_reads | median_cov | mean_cov | sex | colony_id | |
---|---|---|---|---|---|---|---|---|
0 | AD0231-C | 29-2 | parent | 451.762 | 20.0 | 19.375 | F | ghana |
1 | AD0232-C | 29-2 | parent | 572.326 | 25.0 | 24.370 | M | kisumu |
2 | AD0234-C | 29-2 | progeny | 489.057 | 16.0 | 15.742 | F | NaN |
3 | AD0235-C | 29-2 | progeny | 539.649 | 17.0 | 17.364 | F | NaN |
4 | AD0236-C | 29-2 | progeny | 537.237 | 17.0 | 17.284 | F | NaN |
46-9 22
29-2 22
36-9 20
42-4 16
Name: cross, dtype: int64
So there are four crosses. The two largest (crosses ‘29-2’ and ‘46-9’) each have 22 individuals (2 parents, 20 progeny), and the smallest (‘42-4’) has 16 individuals (2 parents, 14 progeny).
All individuals in all crosses have been sequenced on Illumina HiSeq machines, and then have had genotypes called at variant sites discovered in a cohort of wild specimens. The genotype data were originally in VCF format, however for ease of analysis we’ve converted the data to HDF5 format.
Open the file containing genotype data for chromosome arm 3R.
<HDF5 file "ag1000g.crosses.phase1.ar3sites.3R.h5" (mode r)>
To analyse your own data using the examples shown below, you would need to convert the genotype data to either NumPy or HDF5 format. If you have the data in VCF format then you can use the vcfnp utility to perform the conversion. There is some documentation in the vcfnp README but please feel free to email me if you run into any difficulty. This data conversion is the slow, painful step, if you can get over it then the rest should be relatively plain sailing.
Here I am going to start from unphased genotype data. If you have already phased the data that’s fine, convert to NumPy or HDF5 as you would for unphased data.
In total I have genotype calls in 80 individuals at 22,632,425 SNPs on chromosome 3R. However, I’m only going to analyse data for a single cross, ‘29-2’, between a mother from the ‘Ghana’ colony and a father from the ‘Kisumu’ colony. I can subset out the genotype data for just this cross, and keep only SNPs that are segregating in this cross. I’m also only going to keep SNPs that passed all quality filters.
0 | 1 | 2 | 3 | 4 | ... | 75 | 76 | 77 | 78 | 79 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1/1 | 0/1 | 0/1 | 0/1 | 0/1 | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | |
1 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
22632422 | ./. | ./. | ./. | ./. | ./. | ... | 0/0 | ./. | ./. | 0/0 | 0/0 | |
22632423 | ./. | ./. | ./. | ./. | ./. | ... | 0/0 | ./. | ./. | ./. | 0/0 | |
22632424 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | 1/1 |
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
0 | 1 | 2 | 3 | ||
---|---|---|---|---|---|
0 | 19 | 25 | 0 | 0 | |
1 | 44 | 0 | 0 | 0 | |
2 | 44 | 0 | 0 | 0 | |
... | ... | ||||
22632422 | 0 | 0 | 0 | 0 | |
22632423 | 0 | 0 | 0 | 0 | |
22632424 | 0 | 0 | 0 | 0 |
array([ True, False, False, ..., False, False, False], dtype=bool)
2142258
array([False, False, False, ..., False, False, False], dtype=bool)
0 | 1 | 2 | 3 | 4 | ... | 17 | 18 | 19 | 20 | 21 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ./. | 0/0 | 0/1 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | ./. | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/1 | 0/0 | 0/1 | 0/1 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
709396 | 1/1 | 0/0 | 0/1 | 0/1 | 0/1 | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | |
709397 | 0/0 | 1/1 | 0/1 | 0/1 | 0/1 | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | |
709398 | 0/0 | 1/1 | 0/1 | 0/1 | 0/1 | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
Now I have an array of genotype calls at 709,399 segregating SNPs in 22 individuals. The mother is the first column, the father is the second column, and the progeny are the remaining columns. You’ll notice that the mother’s genotype call is missing at the first two SNPs: we could remove these now, but we’ll leave them in for the moment, just to check that phasing is robust to some missing data.
Phasing by transmission
I’m starting from unphased genotype calls, so the first thing to do is phase the calls to generate haplotypes. There are several options for phasing a cross. Here I’m going to use the phase_by_transmission()
function from scikit-allel, because it’s convenient and fast (couple of seconds). We’ve found this function works well for crosses with relatively large numbers of progeny. However, if you have a smaller family with only a couple of progeny, and/or you have a more complicated pedigree with multiple generations, try phasing with SHAPEIT2 + duoHMM or MERLIN.
0 | 1 | 2 | 3 | 4 | ... | 17 | 18 | 19 | 20 | 21 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ./. | 0/0 | 0/1 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | ./. | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0|1 | 0|0 | 1|0 | 1|0 | 0|0 | ... | 0|0 | 0|0 | 0|0 | 0|0 | 0|0 | |
... | ... | |||||||||||
709396 | 1|1 | 0|0 | 1|0 | 1|0 | 1|0 | ... | 1|0 | 1|0 | 1|0 | 1|0 | 1|0 | |
709397 | 0|0 | 1|1 | 0|1 | 0|1 | 0|1 | ... | 0|1 | 0|1 | 0|1 | 0|1 | 0|1 | |
709398 | 0|0 | 1|1 | 0|1 | 0|1 | 0|1 | ... | 0|1 | 0|1 | 0|1 | 0|1 | 0|1 |
Notice that most of the genotype calls in the snippet shown above now have a pipe character (‘|
’) as the allele separator, indicating the call is phased. The first two SNPs could not be phased, however, because the genotype call for one of the parents is missing.
Before we go further, let’s subset down to the SNPs where all individuals could be phased.
array([False, False, True, ..., True, True, True], dtype=bool)
0 | 1 | 2 | 3 | 4 | ... | 17 | 18 | 19 | 20 | 21 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0|1 | 0|0 | 1|0 | 1|0 | 0|0 | ... | 0|0 | 0|0 | 0|0 | 0|0 | 0|0 | |
1 | 0|1 | 0|0 | 1|0 | 1|0 | 0|0 | ... | 0|0 | 0|0 | 1|0 | 0|0 | 0|0 | |
2 | 0|2 | 0|0 | 2|0 | 2|0 | 0|0 | ... | 2|0 | 0|0 | 2|0 | 0|0 | 0|0 | |
... | ... | |||||||||||
629083 | 1|1 | 0|0 | 1|0 | 1|0 | 1|0 | ... | 1|0 | 1|0 | 1|0 | 1|0 | 1|0 | |
629084 | 0|0 | 1|1 | 0|1 | 0|1 | 0|1 | ... | 0|1 | 0|1 | 0|1 | 0|1 | 0|1 | |
629085 | 0|0 | 1|1 | 0|1 | 0|1 | 0|1 | ... | 0|1 | 0|1 | 0|1 | 0|1 | 0|1 |
Note we have a few SNPs here which have a “0|2
” or “2|0
” genotype call. These were multi-allelic in the original callset.
Visualising transmission and recombination
Now we have phased data, let’s plot the transmission of alleles from parents to progeny. This is a useful diagnostic for assessing the quality of the phasing, and also gives an indication of how much recombination has occurred.
Before plotting, I’m going to separate out the data into maternal and paternal haplotypes. The maternal haplotypes are the two haplotypes carried by the mother, and the haplotype in each of the progeny inherited from the mother. The paternal haplotypes are the same but for the father.
0 | 1 | 2 | 3 | 4 | ... | 17 | 18 | 19 | 20 | 21 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 0 | |
2 | 0 | 2 | 2 | 2 | 0 | ... | 2 | 0 | 2 | 0 | 0 | |
... | ... | |||||||||||
629083 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | |
629084 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | |
629085 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
Let’s fix on the mother for a moment. The mother has two haplotypes (displayed in the first two column above). Because recombination occurs during gamete formation, each haplotype the mother passes on to her progeny is a unique mosaic of her own two haplotypes. For any SNP where the mother’s two haplotypes carry a different allele, we can “paint” the maternal haplotypes within the progeny according to which of the mother’s two alleles were inherited, using the paint_transmission()
function.
array([[2, 2, 1, ..., 1, 1, 1],
[2, 2, 1, ..., 2, 1, 1],
[2, 2, 1, ..., 2, 1, 1],
...,
[4, 4, 4, ..., 4, 4, 4],
[3, 3, 3, ..., 3, 3, 3],
[3, 3, 3, ..., 3, 3, 3]], dtype=uint8)
This new “painting” array is an array of integer codes, where each number means something. The meanings are given in the help text for the paint_transmission()
function:
Help on function paint_transmission in module allel.stats.mendel:
paint_transmission(parent_haplotypes, progeny_haplotypes)
Paint haplotypes inherited from a single diploid parent according to
their allelic inheritance.
Parameters
----------
parent_haplotypes : array_like, int, shape (n_variants, 2)
Both haplotypes from a single diploid parent.
progeny_haplotypes : array_like, int, shape (n_variants, n_progeny)
Haplotypes found in progeny of the given parent, inherited from the
given parent. I.e., haplotypes from gametes of the given parent.
Returns
-------
painting : ndarray, uint8, shape (n_variants, n_progeny)
An array of integers coded as follows: 1 = allele inherited from
first parental haplotype; 2 = allele inherited from second parental
haplotype; 3 = reference allele, also carried by both parental
haplotypes; 4 = non-reference allele, also carried by both parental
haplotypes; 5 = non-parental allele; 6 = either or both parental
alleles missing; 7 = missing allele; 0 = undetermined.
Examples
--------
>>> import allel
>>> haplotypes = allel.HaplotypeArray([
... [0, 0, 0, 1, 2, -1],
... [0, 1, 0, 1, 2, -1],
... [1, 0, 0, 1, 2, -1],
... [1, 1, 0, 1, 2, -1],
... [0, 2, 0, 1, 2, -1],
... [0, -1, 0, 1, 2, -1],
... [-1, 1, 0, 1, 2, -1],
... [-1, -1, 0, 1, 2, -1],
... ], dtype='i1')
>>> painting = allel.stats.paint_transmission(haplotypes[:, :2],
... haplotypes[:, 2:])
>>> painting
array([[3, 5, 5, 7],
[1, 2, 5, 7],
[2, 1, 5, 7],
[5, 4, 5, 7],
[1, 5, 2, 7],
[6, 6, 6, 7],
[6, 6, 6, 7],
[6, 6, 6, 7]], dtype=uint8)
We are particularly interested in plotting the “1” and “2” values, because these occur where the mother’s haplotypes carried two different alleles, and so we have information about which allele has been transmitted.
We can do the same for the father.
0 | 1 | 2 | 3 | 4 | ... | 17 | 18 | 19 | 20 | 21 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | |
2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | |
... | ... | |||||||||||
629083 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | |
629084 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | |
629085 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 |
array([[3, 3, 3, ..., 3, 3, 3],
[3, 3, 3, ..., 3, 3, 3],
[3, 3, 3, ..., 3, 3, 3],
...,
[3, 3, 3, ..., 3, 3, 3],
[4, 4, 4, ..., 4, 4, 4],
[4, 4, 4, ..., 4, 4, 4]], dtype=uint8)
Now we have these “paintings”, we can plot them.
Run the plot on the maternal progeny haplotypes. I’ll plot only every 10th variant as there are >600,000 SNPs here which is way more that can be visualised and slows down plotting.
Recombination events can be seen in the switches between blue and red within a single progeny haplotype. However, there are quite a few sites here that do not segregate between the mother’s two haplotypes, shown in green and pink. These aren’t really interesting, so we can get rid of them by subsetting the painting to only sites that are heterozygous in the mother.
Note that there are some SNPs where all progeny are blue, and some SNPs where all progeny are red. These are unlikely to be true as they would imply recombination in multiple progeny at the same location. It’s more likely these are genotyping errors in the parent.
There are also some sites where a single progeny switches from blue to red then straight back to blue, or vice versa. It is possible these are gene conversion events, however they may also be genotyping errors in the progeny.
Let’s take a look at the paternal haplotypes.
What’s going on with the father? It turns out that the father in this cross comes from a mosquito colony (‘Kisumu’) that was colonised back in the 1970s. It’s been maintained in colony ever since, and colonies are kept in small cages, so inbreeding is inevitable. The father has long runs of homozygosity, and within these runs there are no SNPs that are informative for recombination. Let’s take a look at heterozygosity in the parents to confirm.
Mendelian errors
Another thing you can do with genotype calls from a cross is to look for calls in one or more progeny that are not consistent with Mendelian transmission, also known as Mendelian errors. Let’s locate Mendelian errors using the mendel_errors()
function. We’ll run this on a different set of calls, again keeping only variants that segregate within the cross, but this time we won’t exclude SNPs that didn’t pass quality filters.
0 | 1 | 2 | 3 | 4 | ... | 17 | 18 | 19 | 20 | 21 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1/1 | 0/1 | 0/1 | 0/1 | 0/1 | ... | 1/1 | 0/0 | 1/1 | 0/1 | 0/1 | |
1 | 0/0 | 0/0 | 0/0 | 0/1 | 0/0 | ... | 0/0 | 0/1 | 0/0 | 0/1 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
2142255 | ./. | 1/1 | 0/0 | ./. | ./. | ... | 0/0 | ./. | ./. | ./. | ./. | |
2142256 | ./. | 0/0 | 0/0 | ./. | ./. | ... | 0/0 | ./. | ./. | ./. | ./. | |
2142257 | ./. | 0/0 | ./. | ./. | ./. | ... | 0/0 | ./. | ./. | ./. | ./. |
array([[0, 0, 0, ..., 0, 0, 0],
[0, 1, 0, ..., 0, 1, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
(2142258, 20)
The mendel_errors
array has one element for each progeny genotype call, and counts the number of non-Mendelian alleles present (up to 2). This can be summarized in various ways, e.g., by individual.
array([150437, 147810, 145152, 144064, 147128, 146949, 131825, 146373,
128950, 148635, 138708, 142210, 145581, 135932, 139459, 136204,
140298, 149186, 144172, 139478])
We could also look at the density of Mendel errors over the genome.
array([ 1, 10, 5, ..., 0, 0, 0])
Another analysis we can do is to study the association between rate of Mendel error and variant annotations like QD (quality by depth) and MQ (mapping quality). This can be useful when designing a variant filtering strategy.
POS | QD | MQ | ||
---|---|---|---|---|
0 | 13 | 22.734 | 51.5 | |
1 | 66 | 1.3203 | 51.625 | |
2 | 94 | 0.28003 | 51.188 | |
... | ... | |||
2142255 | 53200369 | 0.010002 | 14.047 | |
2142256 | 53200391 | 0.7002 | 10.609 | |
2142257 | 53200396 | 0.68018 | 9.7734 |
Further reading
- scikit-allel API docs (Mendelian inheritance module)
- J. O’Connell et al. (2013) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genetics. 10(4) doi:10.1371/journal.pgen.1004234
- MERLIN tutorial