Selecting variants and samples
A couple of people have recently asked me about how to use scikit-allel to select data from a variant call set for a particular set of variants. This could be variants for a specific genome region (e.g., a gene), or variants matching a particular set of filters. This notebook gives a couple of examples, using data from the (human) 1000 genomes project phase 3.
Update 2018-04-19: Added a sub-section on selecting samples.
Here’s the Python packages we’ll need. If you have earlier versions, you’ll need to upgrade.
'1.1.10'
'2.2.0'
'0.5.5'
'1.13.3'
Extract data from a VCF
The source data comes as a VCF file, which I’ve downloaded to the local disk:
-rw-r--r-- 1 aliman aliman 205M Jun 20 2017 data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
I’m going to use data from chromosome 22 only for illustration. I’m also going to extract the data and convert to Zarr format, which will make life easier downstream. To do the conversion I’m going to use the vcf_to_zarr() function from scikit-allel. This is a one-off operation, once the data have been converted to Zarr they can be loaded directly from Zarr the next time you want to do some analysis. The conversion takes about 3 minutes on my computer.
[vcf_to_zarr] 65536 rows in 11.83s; chunk in 11.83s (5539 rows/s); 22 :18539397
[vcf_to_zarr] 131072 rows in 25.01s; chunk in 13.18s (4972 rows/s); 22 :21016127
[vcf_to_zarr] 196608 rows in 38.05s; chunk in 13.04s (5023 rows/s); 22 :23236362
[vcf_to_zarr] 262144 rows in 49.28s; chunk in 11.22s (5838 rows/s); 22 :25227844
[vcf_to_zarr] 327680 rows in 62.99s; chunk in 13.71s (4780 rows/s); 22 :27285434
[vcf_to_zarr] 393216 rows in 75.12s; chunk in 12.14s (5399 rows/s); 22 :29572822
[vcf_to_zarr] 458752 rows in 87.23s; chunk in 12.11s (5411 rows/s); 22 :31900536
[vcf_to_zarr] 524288 rows in 99.03s; chunk in 11.80s (5554 rows/s); 22 :34069864
[vcf_to_zarr] 589824 rows in 112.06s; chunk in 13.03s (5028 rows/s); 22 :36053392
[vcf_to_zarr] 655360 rows in 124.48s; chunk in 12.41s (5279 rows/s); 22 :38088395
[vcf_to_zarr] 720896 rows in 137.07s; chunk in 12.60s (5203 rows/s); 22 :40216200
[vcf_to_zarr] 786432 rows in 147.83s; chunk in 10.76s (6092 rows/s); 22 :42597446
[vcf_to_zarr] 851968 rows in 161.38s; chunk in 13.55s (4835 rows/s); 22 :44564263
[vcf_to_zarr] 917504 rows in 173.08s; chunk in 11.70s (5601 rows/s); 22 :46390672
[vcf_to_zarr] 983040 rows in 184.95s; chunk in 11.87s (5522 rows/s); 22 :48116697
[vcf_to_zarr] 1048576 rows in 199.16s; chunk in 14.21s (4611 rows/s); 22 :49713436
[vcf_to_zarr] 1103547 rows in 209.81s; chunk in 10.65s (5160 rows/s)
[vcf_to_zarr] all done (5252 rows/s)
Let’s open the Zarr data and do a little bit of poking around to see what’s there.
- /
- 22
- calldata
- GT (1103547, 2504, 2) int8
- samples (2504,) object
- variants
- AA (1103547,) object
- AC (1103547, 3) int32
- AF (1103547, 3) float32
- AFR_AF (1103547, 3) float32
- ALT (1103547, 3) object
- AMR_AF (1103547, 3) float32
- AN (1103547,) int32
- CHROM (1103547,) object
- CIEND (1103547, 2) int32
- CIPOS (1103547, 2) int32
- CS (1103547,) object
- DP (1103547,) int32
- EAS_AF (1103547, 3) float32
- END (1103547,) int32
- EUR_AF (1103547, 3) float32
- EX_TARGET (1103547,) bool
- FILTER_PASS (1103547,) bool
- ID (1103547,) object
- IMPRECISE (1103547,) bool
- MC (1103547,) object
- MEINFO (1103547, 4) object
- MEND (1103547,) int32
- MLEN (1103547,) int32
- MSTART (1103547,) int32
- MULTI_ALLELIC (1103547,) bool
- NS (1103547,) int32
- POS (1103547,) int32
- QUAL (1103547,) float32
- REF (1103547,) object
- SAS_AF (1103547, 3) float32
- SVLEN (1103547,) int32
- SVTYPE (1103547,) object
- TSD (1103547,) object
- VT (1103547,) object
- is_snp (1103547,) bool
- numalt (1103547,) int32
- svlen (1103547, 3) int32
- calldata
- 22
The tree()
method shows us how the data are organised hierarchically within the Zarr store. Hopefully you can see that there is a root group indicated by a slash (‘/’), then below that there is a group named ‘22’ containing all of the data from Chromosome 22, then there is a group called ‘calldata’ and another called ‘variants’.
Within the ‘calldata’ group there is an array named ‘GT’ which has the genotypes.
Within the ‘variants’ group there are arrays named ‘CHROM’, ‘POS’, ‘DP’, etc., containing information about the variants that have been called.
Let’s get some diagnostics on the Zarr genotype array.
Name | /22/calldata/GT |
---|---|
Type | zarr.core.Array |
Data type | int8 |
Shape | (1103547, 2504, 2) |
Chunk shape | (65536, 64, 2) |
Order | C |
Read-only | True |
Compressor | Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) |
Store type | zarr.storage.DirectoryStore |
No. bytes | 5526563376 (5.1G) |
No. bytes stored | 293489697 (279.9M) |
Storage ratio | 18.8 |
Chunks initialized | 680/680 |
This tells us the uncompressed size of the genotype array is 5.1 gigabytes. The actual size on disk is much smaller (279.9 megabytes) because the data are compressed.
Loading data for a gene
If you want to work with data from a single gene, or any other contiguous region within a chromosome, here’s how to do it.
First you need the positions of the variants, wrapped as a scikit-allel SortedIndex
:
0 | 1 | 2 | 3 | 4 | ... | 1103542 | 1103543 | 1103544 | 1103545 | 1103546 |
---|---|---|---|---|---|---|---|---|---|---|
16050075 | 16050115 | 16050213 | 16050319 | 16050527 | ... | 51241342 | 51241386 | 51244163 | 51244205 | 51244237 |
The numbers at the top are the variant indices (starting from 0). The numbers at the bottom are the variant positions as genomic coordinates (i.e., number of base pairs from the chromosome start, starting from 1).
To extract data for a chromosome region, you can use the pos
object to translate genomic coordinates into variant indices. For example, if you want to locate data for the region from position 20,000,000 to 20,100,000, you can do this:
slice(108029, 111127, None)
The loc_region
object is a slice, which is simply a way of storing a start and stop index. Here the start index is 108,029 and the stop index is 111,127. So, we need data starting from the 108,030th variant up to but not including the 111,128th variant (remembering that indices start from zero).
We can now use this slice to extract genotypes for our genome region of interest:
0 | 1 | 2 | 3 | 4 | ... | 2499 | 2500 | 2501 | 2502 | 2503 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
3095 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
3096 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
3097 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
Note that gt_region
is a genotype array with 3,098 variants and 2,504 samples.
Breaking this down, the gt_zarr[loc_region]
code loads data for the requested slice of the genotype data from disk into memory as a numpy array. For convenience, I have wrapped this numpy array using the allel.GenotypeArray
class, which provides a nicer visual representation of the data and gives some useful methods.
Filtering variants
Filtering variants is a very common task. Each analysis typically needs a different set of variants to work with. For example, you may need to filter variants based on some metrics of quality, or on other conditions like allele frequency.
When filtering variants, you first need to identify which variants you require. To help with this, I’m first going to load up some variant information.
The ‘MULTI_ALLELIC’ array in this callset is a Boolean array indicating whether a variants is multi-allelic (has more than one alternate allele) or not. The following code loads this array from disk into memory:
array([False, False, False, ..., False, False, False], dtype=bool)
The ‘AFR_AF’ array in this callset has alternate allele frequencies for the African samples within the cohort:
array([[ 0. , nan, nan],
[ 0.0234, nan, nan],
[ 0.0272, nan, nan],
...,
[ 0. , nan, nan],
[ 0. , nan, nan],
[ 0. , nan, nan]], dtype=float32)
Here the afr_af
array has multiple columns, one for each alternate allele.
Let’s locate variants that are not multi-allelic and have an African allele frequency above 5%:
array([False, False, False, ..., False, False, False], dtype=bool)
How many variants match our query?
138275
Now, to extract genotype data for these variants, there are a couple of ways to do it.
If the full genotype array is not too big, you can start by loading the whole lot into memory. In this case, the genotype array is 5.1G, and I have enough RAM on my laptop to handle that, so let’s do it:
0 | 1 | 2 | 3 | 4 | ... | 2499 | 2500 | 2501 | 2502 | 2503 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
1103544 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1103545 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1103546 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
Now to extract genotypes for the selection, you can use the compress()
method, e.g.:
0 | 1 | 2 | 3 | 4 | ... | 2499 | 2500 | 2501 | 2502 | 2503 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/1 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 1/0 | 0/1 | 0/0 | 0/0 | 0/0 | |
1 | 0/1 | 0/0 | 0/0 | 1/0 | 1/0 | ... | 1/1 | 0/1 | 1/1 | 0/0 | 1/1 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
138272 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
138273 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
138274 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 1/0 | 0/0 | 0/0 | 0/0 |
Alternatively, if your data are larger and/or your computer doesn’t have much RAM, there is another way to do this. This makes use of a Python package called Dask, which allows you to run computations without loading all data into memory. To use Dask, we can start by wrapping the full genotype array with the allel.GenotypeDaskArray
class:
0 | 1 | 2 | 3 | 4 | ... | 2499 | 2500 | 2501 | 2502 | 2503 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
1103544 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1103545 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1103546 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
Now we can apply the selection, using almost the same syntax, except that when working via Dask we need to call the compute()
method to get the final result:
0 | 1 | 2 | 3 | 4 | ... | 2499 | 2500 | 2501 | 2502 | 2503 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/1 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 1/0 | 0/1 | 0/0 | 0/0 | 0/0 | |
1 | 0/1 | 0/0 | 0/0 | 1/0 | 1/0 | ... | 1/1 | 0/1 | 1/1 | 0/0 | 1/1 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
138272 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
138273 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
138274 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 1/0 | 0/0 | 0/0 | 0/0 |
Selecting samples
Another common requirement is selecting data for a specific subset of samples. For example, we might want to select samples from a specific population or set of populations, or we might want to exclude samples with poor data.
Here’s an example of selecting samples from a specific population. To do this we first need to know which population the samples belong to. For the 1000 genomes phase 3 dataset, there is a tab-delimited file with this information, which I’ve downloaded to my computer:
sample pop super_pop gender
HG00096 GBR EUR male
HG00097 GBR EUR female
HG00099 GBR EUR female
HG00100 GBR EUR female
HG00101 GBR EUR male
HG00102 GBR EUR female
HG00103 GBR EUR male
HG00105 GBR EUR male
HG00106 GBR EUR female
Let’s load this into a pandas DataFrame:
sample | pop | super_pop | |
---|---|---|---|
0 | HG00096 | GBR | EUR |
1 | HG00097 | GBR | EUR |
2 | HG00099 | GBR | EUR |
3 | HG00100 | GBR | EUR |
4 | HG00101 | GBR | EUR |
Out of interest, how many samples are there from each population?
sample | ||
---|---|---|
super_pop | pop | |
AFR | ACB | 96 |
ASW | 61 | |
ESN | 99 | |
GWD | 113 | |
LWK | 99 | |
MSL | 85 | |
YRI | 108 | |
AMR | CLM | 94 |
MXL | 64 | |
PEL | 85 | |
PUR | 104 | |
EAS | CDX | 93 |
CHB | 103 | |
CHS | 105 | |
JPT | 104 | |
KHV | 99 | |
EUR | CEU | 99 |
FIN | 99 | |
GBR | 91 | |
IBS | 107 | |
TSI | 107 | |
SAS | BEB | 86 |
GIH | 103 | |
ITU | 102 | |
PJL | 96 | |
STU | 102 |
Before we can use this information to select samples from the genotype data, we need to match up the order of samples between this panel file and the callset.
Here’s the sample IDs in the order they appeared in the original VCF, and thus in the order that corresponds to columns in the genotype array:
array(['HG00096', 'HG00097', 'HG00099', ..., 'NA21142', 'NA21143',
'NA21144'], dtype=object)
Are they in the same order as given in the panel file?
True
This is the ideal situation, because samples are given in the same order in the panel file and in the original VCF. This might not be the case with your dataset, however, and so it’s important to check before going any further. If data are not in the same order, you can add a column to the dataframe with the index of each sample as it appears in the callset, e.g.:
sample | pop | super_pop | callset_index | |
---|---|---|---|---|
0 | HG00096 | GBR | EUR | 0 |
1 | HG00097 | GBR | EUR | 1 |
2 | HG00099 | GBR | EUR | 2 |
3 | HG00100 | GBR | EUR | 3 |
4 | HG00101 | GBR | EUR | 4 |
Here you can see that the values in the ‘callset_index’ column are the same as the values in the dataframe index (first column) because the samples in the panel file and the callset are ordered the same.
Now, we can obtain the indices of the samples within a particular population. Let’s locate all of the African samples:
array([ 673, 674, 675, 676, 677, 678, 679, 680, 683, 684, 685,
686, 687, 712, 713, 727, 728, 729, 730, 731, 739, 740,
741, 742, 743, 761, 762, 763, 764, 788, 792, 793, 794,
812, 813, 854, 855, 867, 868, 869, 870, 879, 880, 881,
883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893,
894, 895, 933, 934, 936, 937, 938, 939, 940, 941, 942,
943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953,
954, 955, 956, 957, 962, 963, 964, 965, 966, 967, 968,
973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983,
984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994,
995, 996, 997, 998, 999, 1005, 1006, 1007, 1008, 1009, 1010,
1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1031,
1032, 1033, 1034, 1035, 1036, 1050, 1051, 1052, 1053, 1054, 1055,
1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1086, 1087,
1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098,
1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109,
1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120,
1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131,
1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142,
1143, 1154, 1155, 1156, 1157, 1158, 1159, 1160, 1161, 1162, 1163,
1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1172, 1173, 1174,
1175, 1176, 1177, 1178, 1179, 1180, 1181, 1182, 1183, 1184, 1185,
1186, 1187, 1188, 1189, 1190, 1191, 1192, 1193, 1194, 1195, 1196,
1197, 1198, 1199, 1200, 1201, 1202, 1203, 1204, 1205, 1206, 1207,
1208, 1209, 1210, 1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218,
1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1229,
1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1244, 1245, 1246,
1247, 1248, 1249, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257,
1258, 1259, 1260, 1261, 1262, 1263, 1264, 1265, 1266, 1267, 1268,
1269, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278, 1279,
1280, 1281, 1282, 1283, 1284, 1285, 1286, 1287, 1288, 1289, 1290,
1291, 1292, 1293, 1294, 1295, 1296, 1297, 1298, 1299, 1300, 1301,
1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309, 1310, 1311, 1312,
1313, 1314, 1315, 1316, 1317, 1321, 1322, 1323, 1324, 1325, 1326,
1327, 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1335, 1336, 1337,
1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1755, 1756, 1757,
1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768,
1769, 1770, 1771, 1772, 1773, 1877, 1878, 1879, 1880, 1881, 1882,
1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893,
1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904,
1905, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985,
1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 2028, 2029, 2030,
2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2040, 2041,
2042, 2043, 2044, 2045, 2046, 2047, 2048, 2049, 2050, 2051, 2052,
2053, 2054, 2055, 2056, 2057, 2058, 2059, 2060, 2061, 2062, 2063,
2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074,
2075, 2076, 2077, 2078, 2079, 2080, 2081, 2082, 2083, 2084, 2085,
2086, 2087, 2088, 2089, 2090, 2091, 2092, 2093, 2094, 2095, 2096,
2097, 2098, 2099, 2100, 2101, 2102, 2103, 2104, 2105, 2106, 2107,
2108, 2109, 2110, 2111, 2112, 2113, 2114, 2115, 2116, 2117, 2118,
2119, 2120, 2121, 2122, 2123, 2124, 2125, 2126, 2127, 2128, 2129,
2130, 2131, 2132, 2133, 2134, 2135, 2136, 2137, 2138, 2139, 2140,
2141, 2142, 2143, 2144, 2145, 2146, 2147, 2148, 2149, 2150, 2151,
2152, 2153, 2154, 2155, 2156, 2157, 2158, 2159, 2160, 2161, 2162,
2163, 2164, 2165, 2166, 2167, 2168, 2169, 2189, 2190, 2191, 2192,
2193, 2194, 2195, 2196, 2242, 2243, 2244, 2245, 2246, 2247, 2248,
2249, 2250, 2251, 2252, 2253, 2254, 2255, 2256, 2257, 2258, 2259,
2260, 2261, 2262, 2263, 2264, 2265, 2266, 2267, 2268, 2269, 2270,
2271, 2272, 2273, 2274, 2275, 2276, 2277, 2278, 2279, 2280, 2281,
2282, 2283, 2284, 2285, 2286, 2287, 2288, 2289, 2290, 2291, 2292,
2293])
So loc_samples_afr
is an array of sample indices. How many samples?
661
We can use this to extract columns from the genotype data. E.g., let’s apply this selection to the genotype array we previously obtained after selecting variants:
0 | 1 | 2 | 3 | 4 | ... | 656 | 657 | 658 | 659 | 660 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | 0/0 | 1/1 | 1/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
138272 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
138273 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 1/0 | |
138274 | 1/1 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 1/1 |
Here we use the take()
method instead of compress()
because we are using indices rather than a Boolean array to make the selection, and we use axis=1
because we are selecting from the second axis, i.e., columns, i.e., samples.
Here’s another way to do it, using the subset()
method to apply both the variant and sample selections simultaneously, and using Dask to avoid loading the whole genotype array into memory:
0 | 1 | 2 | 3 | 4 | ... | 656 | 657 | 658 | 659 | 660 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | 0/0 | 1/1 | 1/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
138272 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
138273 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 1/0 | |
138274 | 1/1 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 1/1 |
Further reading
Hopefully these examples have been helpful. For further information and more examples, the following may be useful:
- Extracting data from VCF files
- A tour of scikit-allel (N.B., this is an older article and uses slightly different techniques from the ones used here, but still relevant.)
- Estimating FST
- Principal components analysis