This post replicates the neighbour-joining tree analysis from the paper by Kamau et al. (2024) on Anopheles coluzzii in northwestern Kenya, and demonstrates the plot_njt() function in the malariagen_data Python package.


We recently worked with Luna Kamau's group at KEMRI to analyse genomic data from a survey of Anopheles mosquitoes across Kenya, which turned up the unexpected discovery of Anopheles coluzzii in northwestern Kenya. The finding was published earlier this year in Kamau et al. 2024, and Kelly Bennett from my team who led the analysis presented the results at a recent MalariaGEN journal club.

In this post I thought I'd replicate one of the analyses and figures from the paper, which is a neighbour-joining tree investigating how the Kenyan An. coluzzii were genetically related to An. coluzzii from countries in West and Central Africa. It's also a good opportunity to demonstrate the plot_njt() function in the malariagen_data Python API, which was recently added and used for this analysis.

Setup

To begin with, set up access to data from the Malaria Vector Genome Observatory.

import malariagen_data
ag3 = malariagen_data.Ag3(
    results_cache="results_cache",
)

The data are stored in Google Cloud Storage (GCS) in the us-central1 region. You can run this analysis on Google Colab, although you might need a bit of patience as Colab VMs can be in a different US region so may not be super close to the data, and also don't have a huge amount of compute power. I ran this analysis on a Vertex AI Workbench VM with 32 vCPU and 32 GB of RAM (e2-highcpu-32) and it took a couple of minutes. If you'd like to run this analysis yourself, see Vector Observatory Data Access for more info.

Sample selection

The samples we're interested in are in sample set 1274-VO-KE-KAMAU-VMF00246 which comes from Luna Kamau's study 1274-VO-KE-KAMAU. This sample set is included in the Ag3.9 release. I'm going to also include samples from the Anopheles gambiae 1000 Genomes Project (included in the Ag3.0 release) and from Fontaine et al. (2015) (included in Ag3.10) for comparison.

sample_sets = [
    # Samples from Kenya from Luna Kamau at KEMRI.
    "1274-VO-KE-KAMAU-VMF00246",
    # Ag1000G sample sets for comparison.
    "AG1000G-BF-A",
    "AG1000G-ML-A",
    "AG1000G-GH",
    "AG1000G-CI",
    "AG1000G-CF",
    "AG1000G-CM-B",
    "AG1000G-CM-C",
    "AG1000G-AO",
    # Samples from Fontaine et al. (2015).
    "fontaine-2015-rebuild",
]

For this analysis we are only interested in Anopheles coluzzii, so define a sample metadata query to select this taxon.

sample_query = "taxon == 'coluzzii'"

Take a look at how many samples will be included in the analysis.

df_count = ag3.count_samples(
    sample_sets=sample_sets,
    sample_query=sample_query,
    index=["sample_set", "country", "admin1_name"],
)
df_count
taxon coluzzii
sample_set country admin1_name
1274-VO-KE-KAMAU-VMF00246 Kenya Turkana 26
AG1000G-AO Angola Luanda 81
AG1000G-BF-A Burkina Faso Hauts-Bassins 82
AG1000G-CF Central African Republic Bangui 18
AG1000G-CI Cote d'Ivoire Lagunes 80
AG1000G-CM-B Cameroon Far North 2
North 5
AG1000G-CM-C Cameroon Adamaoua 2
Centre 12
Littoral 2
North 1
South 2
AG1000G-GH Ghana Central Region 25
Eastern Region 1
Greater Accra Region 14
Western Region 24
AG1000G-ML-A Mali Koulikouro 27
fontaine-2015-rebuild Burkina Faso Hauts-Bassins 3
Cameroon Centre 4
South 4

How many samples in total?

df_count["coluzzii"].sum()
np.int64(415)

Neighbour-joining tree analysis

For this analysis, I'm going to use data from Chromosome 3 (contig="3RL"). I'll request 100,000 SNPs, requiring SNPs to have no missing genotype calls (max_missing_an=0) and a minimum minor allele count of 10 (min_minor_ac=10) because rare variants are less informative. All SNPs meeting these criteria will be found and then thinned to approximately the requested number.

# Define the order that countries should appear in the legend.
category_orders = {
    "country": [
        "Kenya",
        "Burkina Faso",
        "Mali",
        "Cote d'Ivoire",
        "Ghana",
        "Central African Republic",
        "Cameroon", 
        "Angola",
    ],
}

# Compute and plot the neighbour-joining tree.
fig = ag3.plot_njt(
    metric="cityblock",
    title="Anopheles coluzzii",
    width=800,
    height=800,
    show=False,
    color="country",
    marker_size=5,
    line_width=1,
    category_orders=category_orders,
    region="3RL",
    n_snps=100_000,
    sample_sets=sample_sets,
    sample_query=sample_query,
    site_mask="gamb_colu",
    max_missing_an=0,
    min_minor_ac=10,
)

# Tidy up the plot.
fig.update_layout(
    legend=dict(
        title="Country",
    ),
    title_font=dict(
        size=20,
    ),
)

# Show the plot.
fig

This plot is interactive, you can hover over the leaf nodes (circle markers) which represent the individual mosquitoes and get more information about them such as location and date of sampling. The legend is also interactive, you can click on legend entries to show or hide different countries.

The Kenyan mosquitoes are plotted in blue, and are most closely connected to mosquitoes from Burkina Faso, Mali and northern Cameroon.

Topography and climate of northwestern Kenya

The Kenyan An. coluzzii came from Turkana which is in the northwestern part of the country, bordering South Sudan. More work is needed to understand how much connectivity and migration of An. coluzzii mosquitoes between Kenya and other countries, but I found it interesting to look at some maps of topography and climate and see how Turkana relates to surrounding areas.

Topographically, northwestern Kenya is more connected with South Sudan than with other regions of Kenya. Here's a map of topography: