Large genomic datasets like the vector observatory can be difficult to analyse. To make analysing these data easier, we've created analytical software and an online training course, and these have definitely helped. But many potential users of these data are coming from a background without much experience of genomics or programming, and so coming up with the right code can still be a challenge.

AI coding assistants are getting very good. E.g., I recently saw a talk and paper by Josh Proctor on using an LLM to generate disease modelling code which performed scarily well. Would it be possible to create a specialised AI assistant to support generation of code to analyse genomic data on malaria mosquitoes?

Use case: investigating insecticide resistance¶

Consider a use case where an analyst would like to investigate evidence for insecticide resistance in mosquito populations within a country of interest. Assume the analyst has good knowledge of vector biology but patchy knowledge of genomics and the molecular biology of insecticide resistance. Assume the analyst has completed basic training in vector genomics data analysis but is not an experienced programmer and finds writing data analysis code challenging.

Typically an analyst might start with a relatively broad question like, "Is there resistance to pyrethroid insecticides in Uganda?" But there are multiple possible mechanisms of pyrethroid resistance that need to be considered, and there are also multiple species of malaria vector present in Uganda. So this question is likely to need some refinement. If I were assisting this user, I would ask a series of questions to narrow down the focus, and to help the user become aware of the relevant data that are available. I would then be able to identify a smaller set of analysis function calls and parameter values that are relevant.

Would it be possible to create a chatbot assistant to help with this refinement? E.g., here is a hypothetical conversation between a user and an assistant that would refine the scope of the initial question to reach a point where code generation would be easier:

User	Assistant
Is there resistance to pyrethroid insecticides in Uganda?
	Which mosquito species are you interested in? In Uganda there are data available for Anopheles gambiae, Anopheles arabiensis and Anopheles funestus.
Anopheles gambiae.
	Which molecular mechanism of pyrethroid resistance are you interested in? E.g., target-site resistance, Cyp-mediated metabolic resistance, or other resistance mechanisms?
Cyp-mediated metabolic resistance.

Code generation¶

If we can get to this point, we probably have just enough information to start suggesting some code. E.g., we know that copy number amplification of Cyp genes has been associated with pyrethroid resistance. If I were assisting the analyst, I might suggest to try analysing CNV frequencies at a selection of Cyp genes which have been previously linked to pyrethroid resistance. Here is some code which sets up an API to access data on the Anopheles gambiae complex, then uses two function calls to compute and then visualise gene CNV frequencies in Uganda, grouping mosquitoes by year and level one administrative units.

# Set up the API to access data from the Anopheles gambiae complex.
import malariagen_data
ag3 = malariagen_data.Ag3()

# Define genes of interest.
interesting_cyp_genes = [
    "AGAP002862",  # Cyp6aa1
    "AGAP013128",  # Cyp6aa2
    "AGAP002865",  # Cyp6p3
    "AGAP000818",  # Cyp9k1
    "AGAP008212",  # Cyp6m2
    "AGAP008218",  # Cyp6z2    
]

# Compute gene CNV frequencies.
df_cyp_cnv_frq = ag3.gene_cnv_frequencies(
    region=interesting_cyp_genes,
    sample_query="country == 'Uganda' and taxon == 'gambiae'",
    cohorts="admin1_year",
)

# Visualise CNV frequencies as a heatmap.
ag3.plot_frequencies_heatmap(df_cyp_cnv_frq)

This is a useful starting point for exploratory data analysis, because we can see for example that amplifications of Cyp6aa1 and Cyp69k1 are very high or fixed in almost all regions and years. Both of these genes are associated with pyrethroid resistance, and so this combination implies very strong metabolic resistance.

For Uganda there are data for multiple years and regions, and the user might also be interested in changes over time. I might therefore be tempted to suggest some code to visualise frequency time series, e.g.:

# Compute gene CNV frequencies.
ds_cyp_cnv_frq = ag3.gene_cnv_frequencies_advanced(
    region=interesting_cyp_genes,
    sample_query="country == 'Uganda' and taxon == 'gambiae'",
    area_by="admin1_iso",
    period_by="year",
)

# Plot frequency time series.
ag3.plot_frequencies_time_series(
    ds_cyp_cnv_frq,
    height=900,  # need enough height for three sub-plots
)

These time series give us some more information, e.g., they show that frequencies of Cyp6aa1 and Cyp9k1 have generally increased after 2012, although there are some geographical differences between the three regions of Uganda for which we have data.

Training data / examples¶

What training data or existing examples exist that could be used to train an assistant?

Probably the best source would be the learning materials for the data analysis training course we've developed. These materials comprise a collection of Jupyter notebooks with code examples weaved in with explanatory text.

We also create many Jupyter notebooks as part of previous and ongoing studies where we're collaborating with scientific colleagues to analyse real datasets. These tend not to have nearly as much explanatory text, rather they are focused on just analysing data to produce tables and figures for reports and papers. However, if we knew they were to be used to train an assistant, we could potentially go through and annotate some with more explanatory text.

In general, I wonder what the best way of using Jupyter notebooks to train an assistant would be.

For the API itself, there are docstrings and type annotations for all public methods. The docstrings generally don't include examples, but will have some brief text explaining what the methods are used for. These are also used to generate an API documentation website. What would be the best way to make use of this?

Finally, I wonder how an assistant could learn more about the content of the actual data. E.g., how to teach an assistant something about the samples that have been sequenced from different places, time points and mosquito species? This information is very useful when trying to craft parameter values for API calls. I guess this might be roughly analogous to problems where you want to train an assistant to help generate SQL queries for a specific database.

Learning methods¶

What would be the best way to train a generative model?

If we have a library of Jupyter notebook with text and code examples, could this be used for some form of retrieval-augmented generation? I.e., the user's question is used to retrieve the most relevant snippets of notebooks, then these are included in the prompt?

What about few-shot learning or fine-tuning? For one specific use case like the one described above, perhapse few-shot learning would be sufficient? But to cover a breadth of different use cases, perhaps fine-tuning is needed?

Integration with Google colab or Jupyter lab?¶

Colab has code generation integrated within the notebook UI, which is exactly where you'd want this assistant to be:

...is there a way to achieve this, ideally for colab but alternatively for any Jupyter lab interface?

Challenges and subtleties¶

If it was possible to generate code examples like the ones above, then this would definitely be getting us towards the original question, but it's worth mentioning that there are a lot of variations and other analyses we might suggest, as well as some subtleties.

E.g., if we wanted to analyse Anopheles funestus instead of Anopheles gambiae, we would need to provide a different set of gene identifiers for the interesting Cyp genes to include.

Also, there are actually 107 cytochrome P450 (Cyp) genes in the Anopheles gambiae genome in total, and although only some of these have been associated with resistance previously, it's possible that prior knowledge is incomplete. Should we analyse all 107 genes, or just the smaller set of validated genes? Furthermore, some studies are also starting to find SNPs in Cyp genes are markers of resistance. Should we analyse SNPs as well as CNVs? These are tricky decisions, because we could come up some analysis code that is perfectly sensible from a biological point of view, but would overwhelm the user with data. E.g., there are often dozens of SNPs within each Cyp gene, even after filtering out low frequency variants, and 107 genes each with dozens of SNPs is hard to visualise.

These are just a couple of potential challenges involved in considering what would be the most appropriate code examples to generate to address the original question, there are others.

How does Gemini do now?¶

For fun, how does vanilla Gemini do now without any additional context or specialisation?

Starting with something simple:

Good job. Try something a little more specific:

Nailed that one too. OK, now for a challenge:

The first couple of lines are not far off.

In the sample_metadata() call the parameter where should be sample_query, although to be fair the parameter value is correct.

In the line filtering down to Anopheles gambiae samples, the "species" field should be "taxon", but otherwise that code would work.

From here on it's making stuff up. E.g., there is no function gene_variants(). Although Gemini does seem to know that there is a gene called "Vgsc" which you would want to analyse, which is correct as Vgsc is the target site of pyrethroid insecticides.

How about a challenging but more specific prompt:

Lots of mistakes here too. E.g., there's no function gene_metadata(). There are functions sample_metadata() and gene_cnv() but these haven't been called correctly. The attempt to calculate CNV frequencies is entertaining!

Clearly a lot of room for improvement, but also a lot of base knowledge that could be built on.

Last word¶

Hopefully this is a useful use case to start thinking about what sort of assistant would be useful and if/how it could be implemented. Suggestions and thoughts very welcome.

A coding assistant for genomic data analysis?