Training course in data analysis for genomic surveillance of African malaria vectors
This post introduces a new training course and explains some of the rationale and technology choices we made to create, publish and deliver the course online.
Part of our mission is to make malaria genomics as accessible as possible. Over the past year my team and I have been working hard to develop a training course that introduces scientists to the elements of mosquito genomic data analysis. The goal was to create a course that can help someone get as quickly as possible from little or no prior experience of genomic data analysis to a point where they can independently plan and run a suite of common analysis methods required for surveillance of mosquito populations. This post gives an introduction to the scope and structure of the course, along with some discussion of choices we made to develop, publish and deliver the course online.
Scope and structureΒΆ
The course is structured into 8 workshops, each of which requires around 6-8 hours of learning time and so can be delivered in a day. Each workshop is focused around a single topic, such as population structure or detecting genes under recent positive selection. The final workshop then focuses on how to plan, execute and present a complete analysis. Here's the full workshop programme:
The topics we cover include some basic population genetics, such as analysing population structure, identifying crypting species and quantifying genetic diversity, and evolutionary threats to malaria vector control, such as analysing insecticide resistance mutations and how they spread between mosquito populations. There is obviously a very applied focus for these topics, as most members of our community are interested in using genomics to help monitor and control malaria vectors better.
Each workshop is divided into 4 modules which cover the topic from different angles. The modules have different themes. Everything builds towards the Analysis module, where we learn how to run one or more commonly needed analysis methods on some real data from the Malaria Vector Genome Observatory. However, I wanted to also include modules on the underlying Tools & Technology, Biology and Data so that trainees get some insight into everything that goes into that analysis. E.g., here are the modules from Workshop 2:
Some workshops also have a Journal Club module, where an existing paper relevant to the topic is presented by one of the authors, so trainees see how the ideas and methods have been applied.
Each module comprises a lecture video in French or English plus a lecture notebook which includes fully worked and executable code examples. After watching the lecture video, trainees launch the notebook in Google Colab and execute all the code examples for themselves using real genomic data, attempting some additional exercises along the way.
Executable lecture notebooksΒΆ
The lecture notebooks from all workshops and modules have been used to build the training course website. This website was built using Jupyter Book, a relatively new technology that fortuitously arrived just before we started this project, and has been great. The source code for the training website is hosted on GitHub and the website is built and deployed to GitHub Pages. If you browse into the docs
folder you'll see a file called _toc.yml
which defines the table of contents for the site, and _config.yml
which provides the site configuration. There is then one folder for each workshop (e.g., workshop-2
), and inside each folder are some Jupyter notebooks, one for each workshop module. Jupyter Book renders each notebook to an HTML file and organises all the files into a static website according to the table of contents you define. As course developers, all we needed to do was create the notebooks and update the table of contents, and everything would be rendered and published automatically via a GitHub action.
A super nice feature of Jupyter Book is that it adds a button to each page to launch the notebook in Google Colab:
In one click, anyone following the training course can have the code examples opened and ready to run in Google Colab. All examples use real data which is hosted in Google Cloud Storage, and use the malariagen_data Python API which accesses data directly from GCS, so there is no need to move any data around before starting an analysis, and a single pip install
command brings in all the software needed. This makes the transition from passive learning to active learning via practical, hands-on analysis about as easy as you can make it.
English and FrenchΒΆ
A substantial fraction of our community is based in francophone Africa, and we really wanted to do something to make the course accessible to different language groups. I'm extremely fortunate to have a native French speaker, Jon Brenas, in my team, as well as support from Eric Lucas at LSTM who is also bilingual. So we decided to translate and rerecord all of the lecture videos from English into French. There are 32 modules in total, each with a lecture between 30-50 minutes in length, often with very technical language. Jon was also often working to tight deadlines, as we typically only managed to finish the English content for each workshop about a week before we were due to deliver the workshop to the first cohort of trainees. It was a heroic effort, Jon is a legend.
We decided not to translate all of the text in the lecture notebooks as well, as this would have meant maintaining two different versions of each notebook, which would have been too much. However, we did translate all of the practical exercises into French within each notebook. At some point it would be amazing if Jupyter Book had some kind of support for internationalisation, so you could have text for different languages within the same notebook/page.
Running the course onlineΒΆ
All of the lecture videos and notebooks are available via the course website, so anyone can follow the course in their own time via self-directed learning. But we also wanted to run the course as a series of online workshops, to provide a more structured experience with support from experienced teaching assistants. To make this happen we worked together with Vikki Simpson's team at GSU, particularly the wonderful Paballo Chauke, and with the PAMCA Anopheles genomics team led by Elijah Juma.
We decided to deliver the workshops via Zoom, because it was the platform we were most familiar with, and supports breakout rooms and video sharing. We also decided to prerecord all of the lecture videos, primarily so we could translate them into French, but also so that we could easily rerun the course multiple times in future, and publish them on YouTube to provide a self-directed learning option. At the beginning of each workshop, all trainees would log into Zoom, and we started with a plenary opening session of around 30 minutes where we welcomed everyone, gave a high-level introduction to the main topic of the workshop, and gave some logistical information about how the workshop would be run. For each module, we then split the 50 trainees into 5 pre-assigned breakout rooms, with around 10 trainees and 2 teaching assistants per room. One of the teaching assistants would play the lecture video, and then we would allow 30-50 minutes for trainees to run the lecture notebook for themselves and attempt the practical exercises, asking questions or support from TAs as needed. We ran two modules back-to-back in the morning, had a break for lunch, then ran two more modules in the afternoon, before bringing everyone back to the main room to give a short wrap-up talk and provide some final information to trainees.