This post introduces malariagen.net/vobs - a new web page for the Malaria Vector Genome Observatory and an entry point to data, training, software, research and more.


The main focus of my team and I over the last few years has been building the Malaria Vector Genome Observatory, which is a collaborative effort to create large-scale open data resources for research and surveillance of mosquitoes carrying malaria. The observatory is more than data, however, because we've also built software tools and training resources to help scientists in our community access and analyse these data. On top of that we've collaborated with a good number of other groups to analyse and publish research using these data.

Because the observatory is a large and relatively complex project, it's been difficult to communicate an overview of the whole project, and it's also been difficult for users and interested folks to find and navigate relevant information. Part of the problem was that the observatory didn't have a single web page giving an overview and entry point to all of these different parts, so we decided to make one, working together with the MalariaGEN communications team and the Sanger Institute web team. The result is now live at malariagen.net/vobs. This post gives a breakdown of the different parts and why they're there.

screenshot

Here's the page header and navigation bar. Data is the main focus of the observatory, so this is front and centre, but there are five other components I wanted to highlight. The Training and API are key resources to help folks use the data. All data from the observatory is in Google Cloud, so I also wanted to include something about Compute resources that can be used to access data. Of course, the observatory is the result of contributions by many different Partner groups, and that needs to be represented. And there's an increasing amount of new Research coming out.

Problem statement

screenshot

Next is a high-level statement of the big questions the observatory is aiming to help solve. This is mainly for people who've never heard of the observatory before, to give a sense of why it's relevant. But it's super hard to boil so much complexity down into a few short sentences, so I struggled with this. The different groups contributing to the observatory actually have a very broad range of interests and questions they're trying to answer, and there's also a range from basic biology and blue sky research through to very applied and short term disease control issues. That said, these three questions hopefully do a decent job of capturing the major focus of the vector observatory community and why we're interested in genomics.

Call to action

screenshot

This is the call to action, a single statement trying to capture the central goal of the observatory. This was even harder to write than the problem statement! But the fun part was creating an image to go behind it. This was the first time I tried to use generative AI to do something serious, more on that below.

At a glance

screenshot

Next is an overview of the six main components of the observatory, which correspond to the six parts of the page below. This is partly repeating information from the navigation bar, but I wanted to reinforce this structure, and provide a sentence for each to explain what it is. This also serves as another menu, people can click through to the section they're most interested in.

Data

screenshot

Now we get into the main content. The first section covers data, giving some numbers to convey the size of the data resource, and providing links through to the data user guide which is a documentation site that has more technical information about the data and how to access it.

Training

screenshot

Training comes next, this is a key entry point for many first time users of the data resource. The main link is through to the data analysis training course website.

API

screenshot

We've built a Python API to access the data, but it's become much more than a simple access tool, there's a growing range of analytical and data visualisation functions, so I made a gallery of screenshots from various plots you can make with it to try and convey this capability. The bottom of the section has a link through to the API docs website.

Compute

screenshot

The Compute section highlights three services, all of which can be used to access compute resources inside Google Cloud which then have good locality with the data. I put Colab first because it's free and so provides the best starting point for most users. But I also wanted to highlight some other options for folks who need more compute power. There are more options too like Vertex AI Workbench, but I thought I wouldn't overload.

Partners

screenshot

Lots of people have contributed to the observatory, it's been a big community effort. I'd ultimately like to put an interactive map here, which would also be a bit easier to update as new contributors join.

Research

screenshot

A key output from the observatory is new research, and so I wanted to highlight this. We're trying to keep this up to date, so this can be a reference point for papers using data from the observatory.

Implementation notes

Prototyping with Jekyll

I built a first prototype of the landing page using GitHub pages, starting from this one page creative Jekyll theme. It's a pretty old theme now and uses an older version of Bootstrap, but it's free and worked well for prototyping purposes. I was actually away from my computer when I built this, but I really had a bee in my bonnet and wanted to make something, so I built it on my phone with the GitHub file editor - not the easiest thing I've ever done, but sometimes you get an idea and it's hard to put it down! Working on my phone at least it forced me to learn something about the Bootstrap grid system and how to do mobile-first responsive design. The production landing page was then built by the MalariaGEN comms team (Jon and Sree) and the Sanger web team, using Wordpress which hosts the MalariaGEN website.

Generative AI images

I mentioned above I used generative AI to create one of the images. A colleague had pointed me at MidJourney previously and I'd had a tinker, but this seemed like a good excuse to try it out more extensively. I wanted something cool that gave a sense of scope and scale. The two themes I tried to prompt around where "A galaxy made of DNA" (riffing on the analogy with space observatories) and "A sea of data" (conveying that, hey, there's a shed-load of data here). Here are my two favourites.

A galaxy made from DNA

screenshot

A vast swirling colourful sea of data

screenshot

Bias

On the slightly less fun side, I also learned just how bad the gender and racial biases are in MidJourney. E.g., when I prompted with "computer scientist" almost all images came out with a bearded white man, unless I explicitly added information about diversity in the prompt. There's a lot written about this problem, e.g., here's one article talking about gender bias in MidJourney I found. Bottom line, if you're using generative AI, bear this in mind.

Future work

There's more to be done here, but hopefully this is a useful page for now. A couple of things in particular I'd like to add...

Contributing to the observatory

A key question that comes up often is how people can contribute to the observatory, and so I think we need some documentation about that, and to add a section to the landing page so it's easy to find.

Technical and scientific updates

We also need some way to surface and communicate updates, particularly about new data releases, new data features, new API releases, new research, etc., so there's an easy way to see what's new and get notified about important changes. We might need to make a separate site for that, and link through from the landing page, so it doesn't get cluttered.