Querying multidimensional data with xarray

It felt great getting my first pull request to the xarray project merged this week. The PR adds a query() method to the Dataset class, which provides the ability to apply a selection to all arrays within a dataset, based on evaluating a query expression against one or more of those arrays. This is analogous to the query() method that the pandas DataFrame class provides, but for multidimensional data. In fact it leverages a lot on functionality already available in pandas and xarray, the actual implementation was straightforward.

What’s the use case? In our mosquito population genomics work we have datasets comprising genotype calls at ~100 million variants in ~10,000 mosquito samples. However, not all of those variants are reliable, and we very often begin an analysis by selecting a subset of high quality variants, or variants above a given allele frequency, or some combination of multiple criteria. We also often want to analyse a subset of mosquito samples, e.g., samples coming from a particular country, collected in a given year. Here’s an illustration with a dummy dataset:

!pip install -q -U git+git://github.com/pydata/xarray.git

import numpy as np
import xarray as xr

# dimension names
DIM_VARIANTS = "variants"
DIM_SAMPLES = "samples"
DIM_PLOIDY = "ploidy"
DIM_ALLELES = "alleles"

# coordinate arrays
coords = dict(

    # genomic positions of the variants
    variant_position = ([DIM_VARIANTS], 
                        np.array([1, 3, 7, 12, 25])),

    # sample identifiers
    sample_id = ([DIM_SAMPLES], np.array(['foo', 'bar', 'baz', 'qux'])),

)

# data variables
data_vars = dict(

    # reference and alternate alleles (these are biallelic SNPs)
    variant_alleles = ([DIM_VARIANTS, DIM_ALLELES],
                       np.array([['A', 'T'], 
                                 ['C', 'A'], 
                                 ['G', 'T'], 
                                 ['A', 'G'], 
                                 ['C', 'T']])),
                 
    # variant average mapping quality
    variant_MQ = ([DIM_VARIANTS], 
                  np.array([45, 34, 12, 50, 55])),

    # variant quality by depth
    variant_QD = ([DIM_VARIANTS],
                  np.array([23.4, 3.2, 34.9, 7.6, 15.7])),

    # country of collection
    sample_country = ([DIM_SAMPLES], 
                      np.array(['Burkina Faso', 'Burkina Faso', 'Cameroon', 'Angola'])),
    
    # year of collection
    sample_year = ([DIM_SAMPLES],
                   np.array([2019, 2018, 2018, 2021])),

    # simulate some genotype calls
    call_genotype = ([DIM_VARIANTS, DIM_SAMPLES, DIM_PLOIDY],
                     np.random.randint(0, 2, size=(5, 4, 2))),
)

ds = xr.Dataset(data_vars=data_vars, coords=coords)
ds

<xarray.Dataset>
Dimensions:           (alleles: 2, ploidy: 2, samples: 4, variants: 5)
Coordinates:
    variant_position  (variants) int64 1 3 7 12 25
    sample_id         (samples) <U3 'foo' 'bar' 'baz' 'qux'
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_alleles   (variants, alleles) <U1 'A' 'T' 'C' 'A' ... 'G' 'C' 'T'
    variant_MQ        (variants) int64 45 34 12 50 55
    variant_QD        (variants) float64 23.4 3.2 34.9 7.6 15.7
    sample_country    (samples) <U12 'Burkina Faso' 'Burkina Faso' ... 'Angola'
    sample_year       (samples) int64 2019 2018 2018 2021
    call_genotype     (variants, samples, ploidy) int64 0 0 0 0 1 ... 0 0 1 1 1

Now we can make a query to select variants. E.g., we could select variants where MQ and QD are above some threshold.

ds.query(variants="variant_MQ > 30 and variant_QD > 5")

<xarray.Dataset>
Dimensions:           (alleles: 2, ploidy: 2, samples: 4, variants: 3)
Coordinates:
    variant_position  (variants) int64 1 12 25
    sample_id         (samples) <U3 'foo' 'bar' 'baz' 'qux'
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_alleles   (variants, alleles) <U1 'A' 'T' 'A' 'G' 'C' 'T'
    variant_MQ        (variants) int64 45 50 55
    variant_QD        (variants) float64 23.4 7.6 15.7
    sample_country    (samples) <U12 'Burkina Faso' 'Burkina Faso' ... 'Angola'
    sample_year       (samples) int64 2019 2018 2018 2021
    call_genotype     (variants, samples, ploidy) int64 0 0 0 0 1 ... 0 0 1 1 1

We could also select samples, e.g., collected in Burkina Faso since 2018.

ds.query(samples="sample_country == 'Burkina Faso' and sample_year > 2018")

<xarray.Dataset>
Dimensions:           (alleles: 2, ploidy: 2, samples: 1, variants: 5)
Coordinates:
    variant_position  (variants) int64 1 3 7 12 25
    sample_id         (samples) <U3 'foo'
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_alleles   (variants, alleles) <U1 'A' 'T' 'C' 'A' ... 'G' 'C' 'T'
    variant_MQ        (variants) int64 45 34 12 50 55
    variant_QD        (variants) float64 23.4 3.2 34.9 7.6 15.7
    sample_country    (samples) <U12 'Burkina Faso'
    sample_year       (samples) int64 2019
    call_genotype     (variants, samples, ploidy) int64 0 0 1 1 0 1 0 0 1 0

Also, both of these queries could be applied at the same time.

ds.query(
    variants="variant_MQ > 30 and variant_QD > 5",
    samples="sample_country == 'Burkina Faso' and sample_year > 2018"
)

<xarray.Dataset>
Dimensions:           (alleles: 2, ploidy: 2, samples: 1, variants: 3)
Coordinates:
    variant_position  (variants) int64 1 12 25
    sample_id         (samples) <U3 'foo'
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_alleles   (variants, alleles) <U1 'A' 'T' 'A' 'G' 'C' 'T'
    variant_MQ        (variants) int64 45 50 55
    variant_QD        (variants) float64 23.4 7.6 15.7
    sample_country    (samples) <U12 'Burkina Faso'
    sample_year       (samples) int64 2019
    call_genotype     (variants, samples, ploidy) int64 0 0 0 0 1 0

Of course the same thing can be achieved by other means using the existing isel() method, but I’ve found having this query functionality can be useful for several reasons. It is concise and relatively easy to explain to newer users. It is also convenient to be able to have the query as a string, as this can be stored in metadata files if you ever need to keep a record of queries applied for different analyses.

I’m looking forward to using this new feature, and making more use of xarray generally, it’s a great package for managing scientific data of all different shapes and sizes. The xarray docs are here if you’d like to learn more about it.