Querying multidimensional data with xarray
It felt great getting my first pull request to the xarray project merged this week. The PR adds a query()
method to the Dataset
class, which provides the ability to apply a selection to all arrays within a dataset, based on evaluating a query expression against one or more of those arrays. This is analogous to the query()
method that the pandas DataFrame
class provides, but for multidimensional data. In fact it leverages a lot on functionality already available in pandas and xarray, the actual implementation was straightforward.
What’s the use case? In our mosquito population genomics work we have datasets comprising genotype calls at ~100 million variants in ~10,000 mosquito samples. However, not all of those variants are reliable, and we very often begin an analysis by selecting a subset of high quality variants, or variants above a given allele frequency, or some combination of multiple criteria. We also often want to analyse a subset of mosquito samples, e.g., samples coming from a particular country, collected in a given year. Here’s an illustration with a dummy dataset:
<xarray.Dataset> Dimensions: (alleles: 2, ploidy: 2, samples: 4, variants: 5) Coordinates: variant_position (variants) int64 1 3 7 12 25 sample_id (samples) <U3 'foo' 'bar' 'baz' 'qux' Dimensions without coordinates: alleles, ploidy, samples, variants Data variables: variant_alleles (variants, alleles) <U1 'A' 'T' 'C' 'A' ... 'G' 'C' 'T' variant_MQ (variants) int64 45 34 12 50 55 variant_QD (variants) float64 23.4 3.2 34.9 7.6 15.7 sample_country (samples) <U12 'Burkina Faso' 'Burkina Faso' ... 'Angola' sample_year (samples) int64 2019 2018 2018 2021 call_genotype (variants, samples, ploidy) int64 0 0 0 0 1 ... 0 0 1 1 1
- alleles: 2
- ploidy: 2
- samples: 4
- variants: 5
- variant_position(variants)int641 3 7 12 25
array([ 1, 3, 7, 12, 25])
- sample_id(samples)<U3'foo' 'bar' 'baz' 'qux'
array(['foo', 'bar', 'baz', 'qux'], dtype='<U3')
- variant_alleles(variants, alleles)<U1'A' 'T' 'C' 'A' ... 'A' 'G' 'C' 'T'
array([['A', 'T'], ['C', 'A'], ['G', 'T'], ['A', 'G'], ['C', 'T']], dtype='<U1')
- variant_MQ(variants)int6445 34 12 50 55
array([45, 34, 12, 50, 55])
- variant_QD(variants)float6423.4 3.2 34.9 7.6 15.7
array([23.4, 3.2, 34.9, 7.6, 15.7])
- sample_country(samples)<U12'Burkina Faso' ... 'Angola'
array(['Burkina Faso', 'Burkina Faso', 'Cameroon', 'Angola'], dtype='<U12')
- sample_year(samples)int642019 2018 2018 2021
array([2019, 2018, 2018, 2021])
- call_genotype(variants, samples, ploidy)int640 0 0 0 1 0 1 1 ... 1 0 1 0 0 1 1 1
array([[[0, 0], [0, 0], [1, 0], [1, 1]], [[1, 1], [1, 1], [1, 0], [0, 0]], [[0, 1], [0, 0], [1, 1], [0, 0]], [[0, 0], [1, 0], [0, 0], [0, 1]], [[1, 0], [1, 0], [0, 1], [1, 1]]])
Now we can make a query to select variants. E.g., we could select variants where MQ and QD are above some threshold.
<xarray.Dataset> Dimensions: (alleles: 2, ploidy: 2, samples: 4, variants: 3) Coordinates: variant_position (variants) int64 1 12 25 sample_id (samples) <U3 'foo' 'bar' 'baz' 'qux' Dimensions without coordinates: alleles, ploidy, samples, variants Data variables: variant_alleles (variants, alleles) <U1 'A' 'T' 'A' 'G' 'C' 'T' variant_MQ (variants) int64 45 50 55 variant_QD (variants) float64 23.4 7.6 15.7 sample_country (samples) <U12 'Burkina Faso' 'Burkina Faso' ... 'Angola' sample_year (samples) int64 2019 2018 2018 2021 call_genotype (variants, samples, ploidy) int64 0 0 0 0 1 ... 0 0 1 1 1
- alleles: 2
- ploidy: 2
- samples: 4
- variants: 3
- variant_position(variants)int641 12 25
array([ 1, 12, 25])
- sample_id(samples)<U3'foo' 'bar' 'baz' 'qux'
array(['foo', 'bar', 'baz', 'qux'], dtype='<U3')
- variant_alleles(variants, alleles)<U1'A' 'T' 'A' 'G' 'C' 'T'
array([['A', 'T'], ['A', 'G'], ['C', 'T']], dtype='<U1')
- variant_MQ(variants)int6445 50 55
array([45, 50, 55])
- variant_QD(variants)float6423.4 7.6 15.7
array([23.4, 7.6, 15.7])
- sample_country(samples)<U12'Burkina Faso' ... 'Angola'
array(['Burkina Faso', 'Burkina Faso', 'Cameroon', 'Angola'], dtype='<U12')
- sample_year(samples)int642019 2018 2018 2021
array([2019, 2018, 2018, 2021])
- call_genotype(variants, samples, ploidy)int640 0 0 0 1 0 1 1 ... 1 0 1 0 0 1 1 1
array([[[0, 0], [0, 0], [1, 0], [1, 1]], [[0, 0], [1, 0], [0, 0], [0, 1]], [[1, 0], [1, 0], [0, 1], [1, 1]]])
We could also select samples, e.g., collected in Burkina Faso since 2018.
<xarray.Dataset> Dimensions: (alleles: 2, ploidy: 2, samples: 1, variants: 5) Coordinates: variant_position (variants) int64 1 3 7 12 25 sample_id (samples) <U3 'foo' Dimensions without coordinates: alleles, ploidy, samples, variants Data variables: variant_alleles (variants, alleles) <U1 'A' 'T' 'C' 'A' ... 'G' 'C' 'T' variant_MQ (variants) int64 45 34 12 50 55 variant_QD (variants) float64 23.4 3.2 34.9 7.6 15.7 sample_country (samples) <U12 'Burkina Faso' sample_year (samples) int64 2019 call_genotype (variants, samples, ploidy) int64 0 0 1 1 0 1 0 0 1 0
- alleles: 2
- ploidy: 2
- samples: 1
- variants: 5
- variant_position(variants)int641 3 7 12 25
array([ 1, 3, 7, 12, 25])
- sample_id(samples)<U3'foo'
array(['foo'], dtype='<U3')
- variant_alleles(variants, alleles)<U1'A' 'T' 'C' 'A' ... 'A' 'G' 'C' 'T'
array([['A', 'T'], ['C', 'A'], ['G', 'T'], ['A', 'G'], ['C', 'T']], dtype='<U1')
- variant_MQ(variants)int6445 34 12 50 55
array([45, 34, 12, 50, 55])
- variant_QD(variants)float6423.4 3.2 34.9 7.6 15.7
array([23.4, 3.2, 34.9, 7.6, 15.7])
- sample_country(samples)<U12'Burkina Faso'
array(['Burkina Faso'], dtype='<U12')
- sample_year(samples)int642019
array([2019])
- call_genotype(variants, samples, ploidy)int640 0 1 1 0 1 0 0 1 0
array([[[0, 0]], [[1, 1]], [[0, 1]], [[0, 0]], [[1, 0]]])
Also, both of these queries could be applied at the same time.
<xarray.Dataset> Dimensions: (alleles: 2, ploidy: 2, samples: 1, variants: 3) Coordinates: variant_position (variants) int64 1 12 25 sample_id (samples) <U3 'foo' Dimensions without coordinates: alleles, ploidy, samples, variants Data variables: variant_alleles (variants, alleles) <U1 'A' 'T' 'A' 'G' 'C' 'T' variant_MQ (variants) int64 45 50 55 variant_QD (variants) float64 23.4 7.6 15.7 sample_country (samples) <U12 'Burkina Faso' sample_year (samples) int64 2019 call_genotype (variants, samples, ploidy) int64 0 0 0 0 1 0
- alleles: 2
- ploidy: 2
- samples: 1
- variants: 3
- variant_position(variants)int641 12 25
array([ 1, 12, 25])
- sample_id(samples)<U3'foo'
array(['foo'], dtype='<U3')
- variant_alleles(variants, alleles)<U1'A' 'T' 'A' 'G' 'C' 'T'
array([['A', 'T'], ['A', 'G'], ['C', 'T']], dtype='<U1')
- variant_MQ(variants)int6445 50 55
array([45, 50, 55])
- variant_QD(variants)float6423.4 7.6 15.7
array([23.4, 7.6, 15.7])
- sample_country(samples)<U12'Burkina Faso'
array(['Burkina Faso'], dtype='<U12')
- sample_year(samples)int642019
array([2019])
- call_genotype(variants, samples, ploidy)int640 0 0 0 1 0
array([[[0, 0]], [[0, 0]], [[1, 0]]])
Of course the same thing can be achieved by other means using the existing isel()
method, but I’ve found having this query functionality can be useful for several reasons. It is concise and relatively easy to explain to newer users. It is also convenient to be able to have the query as a string, as this can be stored in metadata files if you ever need to keep a record of queries applied for different analyses.
I’m looking forward to using this new feature, and making more use of xarray generally, it’s a great package for managing scientific data of all different shapes and sizes. The xarray docs are here if you’d like to learn more about it.