Anopheles coluzzii in Kenya - neighbour-joining tree analysis
This post replicates the neighbour-joining tree analysis from the paper by Kamau et al. (2024) on Anopheles coluzzii in northwestern Kenya, and demonstrates the plot_njt()
function in the malariagen_data
Python package.
We recently worked with Luna Kamau's group at KEMRI to analyse genomic data from a survey of Anopheles mosquitoes across Kenya, which turned up the unexpected discovery of Anopheles coluzzii in northwestern Kenya. The finding was published earlier this year in Kamau et al. 2024, and Kelly Bennett from my team who led the analysis presented the results at a recent MalariaGEN journal club.
In this post I thought I'd replicate one of the analyses and figures from the paper, which is a neighbour-joining tree investigating how the Kenyan An. coluzzii were genetically related to An. coluzzii from countries in West and Central Africa. It's also a good opportunity to demonstrate the plot_njt() function in the malariagen_data Python API, which was recently added and used for this analysis.
Setup¶
To begin with, set up access to data from the Malaria Vector Genome Observatory.
import malariagen_data
ag3 = malariagen_data.Ag3(
results_cache="results_cache",
)
The data are stored in Google Cloud Storage (GCS) in the us-central1 region. You can run this analysis on Google Colab, although you might need a bit of patience as Colab VMs can be in a different US region so may not be super close to the data, and also don't have a huge amount of compute power. I ran this analysis on a Vertex AI Workbench VM with 32 vCPU and 32 GB of RAM (e2-highcpu-32) and it took a couple of minutes. If you'd like to run this analysis yourself, see Vector Observatory Data Access for more info.
Sample selection¶
The samples we're interested in are in sample set 1274-VO-KE-KAMAU-VMF00246
which comes from Luna Kamau's study 1274-VO-KE-KAMAU. This sample set is included in the Ag3.9 release. I'm going to also include samples from the Anopheles gambiae 1000 Genomes Project (included in the Ag3.0 release) and from Fontaine et al. (2015) (included in Ag3.10) for comparison.
sample_sets = [
# Samples from Kenya from Luna Kamau at KEMRI.
"1274-VO-KE-KAMAU-VMF00246",
# Ag1000G sample sets for comparison.
"AG1000G-BF-A",
"AG1000G-ML-A",
"AG1000G-GH",
"AG1000G-CI",
"AG1000G-CF",
"AG1000G-CM-B",
"AG1000G-CM-C",
"AG1000G-AO",
# Samples from Fontaine et al. (2015).
"fontaine-2015-rebuild",
]
For this analysis we are only interested in Anopheles coluzzii, so define a sample metadata query to select this taxon.
sample_query = "taxon == 'coluzzii'"
Take a look at how many samples will be included in the analysis.
df_count = ag3.count_samples(
sample_sets=sample_sets,
sample_query=sample_query,
index=["sample_set", "country", "admin1_name"],
)
df_count
How many samples in total?
df_count["coluzzii"].sum()
Neighbour-joining tree analysis¶
For this analysis, I'm going to use data from Chromosome 3 (contig="3RL"
). I'll request 100,000 SNPs, requiring SNPs to have no missing genotype calls (max_missing_an=0
) and a minimum minor allele count of 10 (min_minor_ac=10
) because rare variants are less informative. All SNPs meeting these criteria will be found and then thinned to approximately the requested number.
# Define the order that countries should appear in the legend.
category_orders = {
"country": [
"Kenya",
"Burkina Faso",
"Mali",
"Cote d'Ivoire",
"Ghana",
"Central African Republic",
"Cameroon",
"Angola",
],
}
# Compute and plot the neighbour-joining tree.
fig = ag3.plot_njt(
metric="cityblock",
title="Anopheles coluzzii",
width=800,
height=800,
show=False,
color="country",
marker_size=5,
line_width=1,
category_orders=category_orders,
region="3RL",
n_snps=100_000,
sample_sets=sample_sets,
sample_query=sample_query,
site_mask="gamb_colu",
max_missing_an=0,
min_minor_ac=10,
)
# Tidy up the plot.
fig.update_layout(
legend=dict(
title="Country",
),
title_font=dict(
size=20,
),
)
# Show the plot.
fig
This plot is interactive, you can hover over the leaf nodes (circle markers) which represent the individual mosquitoes and get more information about them such as location and date of sampling. The legend is also interactive, you can click on legend entries to show or hide different countries.
The Kenyan mosquitoes are plotted in blue, and are most closely connected to mosquitoes from Burkina Faso, Mali and northern Cameroon.
Topography and climate of northwestern Kenya¶
The Kenyan An. coluzzii came from Turkana which is in the northwestern part of the country, bordering South Sudan. More work is needed to understand how much connectivity and migration of An. coluzzii mosquitoes between Kenya and other countries, but I found it interesting to look at some maps of topography and climate and see how Turkana relates to surrounding areas.
Topographically, northwestern Kenya is more connected with South Sudan than with other regions of Kenya. Here's a map of topography:
Climate is also interesting, because northwestern Kenya is very arid for at least part of the year, similar to Mali, Burkina Faso, northern Ghana and northern Cameroon. E.g., here's a map of IPCC climate data from January which is the middle of the dry season north of the equator:
An. coluzzii is known for its ability to survive in regions with a longer dry season.
Neighbour-joining tree implementation notes¶
The implementation of the plot_njt()
function in the malariagen_data
package uses biotite
under the hood to compute the neighbour-joining tree from a distance matrix. There are also implementation of the neighbour-joining algorithm in BioPython and scikit-bio, but I did some benchmarking and biotite
was the fastest.
To create an interactive plot of the resulting tree, I implemented the equal angles layout, which is my favourite layout for unrooted trees, and plotting the results using Plotly. It would be amazing if there was a Python library that made interactive plots of different types of trees and layouts, like ggtree but using Plotly or Bokeh, but I couldn't find anything yet. Interactivity is so useful for exploratory work, being able to hover over samples and zoom in and out really helps when you're trying to make sense of complicated datasets.