The Anopheles gambiae voltage-gated sodium channel gene (a.k.a. vgsc, para, AgNaV) is the target for DDT and pyrethroid insecticides. Mutations in this gene cause insecticide resistance, so it’s an important gene for malaria vector control.
In 2007, Emyr Davies et al. published a complete cDNA sequence for the An. gambiaevgsc gene. They inferred 35 exons and found evidence for alternate splicing involving at least five optional exons and two sets of mutually exclusive exons.
The canonical source for An. gambiae gene annotations is VectorBase. The AgamP4.4 gene annotations include three transcripts for vgsc. However, these transcripts were derived from a different source and do not represent the larger set of exons and splice variants reported by Davies et al.
In our analyses of vgsc for the Ag1000G project we’ve been using the AgamP4.4 gene model. I was concerned we could be missing important functional variation, so I went back to the Davies et al. paper and constructed a GFF file with a set of 10 putative transcripts based on the cDNAs they observed.
The first part of this article compares the exons and splice variants observed by Davies et al. with the transcripts in the AgamP4.4 gene annotations. At the end of the article I’ll explain the steps I went through to build a GFF from the information given in the Davies et al. paper.
If there are any mistakes in what’s below, or there are other sources of information on vgsc splice variation in An. gambiae that we should also be considering, I’d be very grateful if you could send me an email or drop a comment in at the bottom of the article.
This article was generated from a Jupyter notebook. It includes some Python code used to load data and generate plots. If you’re only interested in the biology you can safely skip over the code.
Identify the exons that are optional or variable in size between different transcripts, to highlight in plots below.
Compare AgamP4.4 and Davies gene models
Here’s a plot of the entire gene, showing all transcripts together. Exons that are either optional or variable in size between transcripts are highlighted in red.
Let’s work through the gene in detail, taking a few exons at a time. In the plots below, the text above the exons (e.g., “3 (156)”) show the index of the exon within the transcript (e.g., 3rd exon) and the exon length (e.g., 156 bp long). The text within the exons shows the exon number according to Davies et al. supplementary table S1, along with a lower-case letter if the exon corresponds to a variable sequence previously identified in Drosophila (for a review of vgsc studies across insect species see Dong et al. (2014)).
In the text below I will refer to exons using the exon numbering according to Davies et al. table S1.
Exons 1, 2 (j), 3
Exon 2 (also known as optional exon j) is observed in Davies cDNA C8 but not in any AgamP4.4 transcripts.
Exon 3 is longer in all Davies cDNAs (156 bp) than the AgamP4.4 transcripts (138 bp).
Exons 4-6
Exon 5 is not present in Davies cDNA C3. Davies et al. mention that this exon is also optional in the German cockroach (Blattella germanica) but should render the channel non-functional because it would eliminate a key region of the voltage sensor.
Exons 7-10
Exon 10 is missing in Davies cDNA C5. Davies et al. state this should also render the channel non-functional because it would eliminate a key region of the channel pore.
Exons 11 (i+), 12, 13 (a), 14
Exon 12 is present in Davies cDNA C1 but is not in any other transcripts.
Exon 13 (also known as optional exon a) is present in Davies cDNA C1 and in all AgamP4.4 transcripts but missing from other Davies cDNAs.
Exons 15-17
No splice variation.
Exons 18 (b+), 19, 20 (c/d)
For exon 20, Davies finds alternative exon c in the genomic sequence but does not observe it in any cDNAs (the “Davies-C1N9ck” transcript is a hypothetical transcript I’ve invented to represent the alternative splice variants for which Davies et al. only find genomic evidence). Two AgamP4.4 transcripts use exon c and one uses exon d.
Exon 23 has an optional region (f) which is missing in Davies cDNAs C5 and C7. All AgamP4.4 transcripts include this region.
Exon 24 has an optional region (h) which is missing in some Davies cDNAs and in two of the three AgamP4.4 transcripts.
Exon 27 (k/l)
For exon 27, Davies et al. finding a potential mutually exclusive alternative exon (k) within the genomic DNA sequence, although all their cDNAs use exon l, as do all AgamP4.4 transcripts.
Exon 29 is slightly shorter in all of the Davies transcripts than in the AgamP4.4 transcripts.
Methods
Exon coordinates
Davies et al. Table S1 gives coordinates for all of the exons they infer, both from the cDNAs and from comparative analysis of the genome sequence. The genomic coordinates are based on some previous version of the An. gambiae reference sequence and do not match the current (AgamP3/4) coordinates for the vgsc gene. The Davies coordinates look like they’re based on a region of the reference sequence that was subsequently inverted, so I transformed the exon coordinates to the AgamP3/4 reference sequence by assuming the AgamP4.4 start coordinate of 2,358,158 for the vgsc gene then using the relative exon positions given by Davies.
I cross-checked the coordinates by comparing the DNA sequence for each exon obtained using the genomic coordinates and the AgamP3/4 reference sequence against the DNA sequence obtained using the mRNA coordinates from Davies et al. Table S1 and the Davies complete cDNA sequence in GenBank. For the optional exons, I also compared with the amino acid sequences given in Table S2.
To obtain the best possible concordance between all sources I made the following manual corrections to the exon coordinates:
I changed the mRNA coordinates for exon 13 to 1645-1707 as the coordinates given in Table S1 are out of sequence and look like a mistaken repetition of the coordinates for exon 15.
I changed the end coordinate for exon 20c so the translated DNA sequence matched Table S2 and the exon length matched 20d.
I changed the start and end coordinates for exon 27k so the translated DNA sequence matched Table S2 and the exon length matched 27l.
Here are the AgamP3/4 coordinates for all exons after applying these transformations:
exon
seqid
start
end
phase
1
2L
2358158
2358304
0
2j
2L
2359640
2359672
0
3
2L
2361989
2362144
0
4
2L
2381065
2381270
0
5
2L
2382270
2382398
1
6
2L
2385694
2385785
1
7
2L
2390129
2390341
2
8
2L
2390425
2390485
2
9
2L
2390594
2390738
1
10
2L
2391156
2391320
0
11i+
2L
2399898
2400173
0
12
2L
2401549
2401569
0
13a
2L
2402447
2402509
0
14
2L
2403086
2403269
0
15
2L
2407622
2407818
2
16
2L
2407894
2407993
0
17
2L
2408071
2408139
2
18b+
2L
2416794
2417071
2
19
2L
2417185
2417358
0
20c
2L
2417637
2417799
0
20d
2L
2421385
2421547
2
21
2L
2422468
2422655
1
22
2L
2422713
2422920
2
23f+
2L
2424207
2424418
2
23f-
2L
2424237
2424418
1
24h+
2L
2424651
2424870
2
24h-
2L
2424729
2424870
0
25
2L
2424946
2425211
1
26
2L
2425278
2425451
2
27k
2L
2425770
2425892
2
27l
2L
2427988
2428110
2
28
2L
2429097
2429219
2
29
2L
2429282
2429476
2
30
2L
2429556
2429801
2
31
2L
2429872
2430142
2
32
2L
2430224
2430528
1
33
2L
2430601
2431617
2
Inferring transcripts from Davies et al. cDNAs
Davies et al. (Figure 2) report two sets of cDNAs, one set (prefix ‘C’) covering the first two domains of the protein (exons 1-22) and a second set (prefix ‘N’) covering the second two domains (exons 23-33). Because they did not have any cDNAs covering the entire vgsc gene, they could not infer any complete transcripts. To make it easier to compare the results from Davies et al. with the AgamP4.4 gene annotations, I invented 9 putative transcripts by combining all unique combinations of exon usage observed in the first (C) set of cDNAs with all unique combinations from the second (N) set of cDNAs. I also invented a 10th transcript (“Davies-C1N9ck”) to represent the c and k exons inferred from the genomic sequence.
Note that there is a bit of inconsistency in Davies et al. between the text and Figure 2C regarding whether exon 10 is also missing in cDNA C3. In Figure 2C it looks like C3 might be missing some other bits as well, and so maybe C3 was incomplete or data quality was poor. To construct the putative transcripts I have assumed C3 is as described in the text, i.e., is missing exon 5 but is otherwise complete.
I formed the transcript IDs by concatenating the cDNA IDs from the first and second regions of the gene. So, i.e., “Davies-C1N2” is a putative transcript assuming exon usage observed in cDNAs C1 and N2.