Genes and SNP effects

This notebook illustrates how to use data from Ag1000G phase 3 to investigate SNPs in a gene of interest. This includes computing the effects of SNPs on the protein sequence, and computing allele frequencies in different mosquito populations.

As usual this notebook is executable, click the launch icon above () to run it for yourself. Note in particular that, although this notebook uses the Rdl gene as an example, if you run this notebook yourself you can change this to investigate any gene you are interested in.

Setup

Install the packages we’ll need. If you’re working on MyBinder, you can skip this step, packages will already be installed.

# recommended package installs for Google Colab
!pip install -q \
    zarr==2.6.1 \
    fsspec==0.8.7 \
    gcsfs==0.7.2 \
    dask==2021.03.0 \
    xarray==0.18.0 \
    scikit-allel==1.3.5 \
    bokeh==2.3.2 \
    malariagen_data==0.8.0

Import packages.

import numpy as np
import bokeh.plotting
import bokeh.models
import bokeh.layouts
import bokeh.io
import bokeh.palettes
import pandas as pd
import malariagen_data

In this notebook we’ll be making some interactive plots using the bokeh package. Previously, in some of the other analysis notebooks, we used a different plotting package called plotly. We’re using bokeh here rather than plotly simply because for some of the plots we want to create here, bokeh has slightly better functionality. Each of these plotting libraries has its own strengths and limitations, and it’s worth having some familiarity with both, so you can pick the best tool for the job at hand.

Set up plotting with bokeh in the notebook.

bokeh.io.output_notebook()

Setup access to MalariaGEN data in Google Cloud.

ag3 = malariagen_data.Ag3("gs://vo_agam_release/")

Preamble: gene annotations

In this notebook we’ll be making use of gene annotations. Gene annotations provide information on which regions of the genome contain DNA sequences that encode genes, which are transcribed and spliced into messenger RNA (mRNA) and then translated to make proteins.

Gene annotations for the Anopheles gambiae (AgamP4) genome are available from VectorBase. For convenience, we’ve added some functionality to the malariagen_data package for loading these gene annotations into a pandas data frame. Let’s take a look.

df_geneset = ag3.geneset().set_index("ID")
df_geneset
contig source type start end score strand phase Parent Name description
ID
2L 2L VectorBase chromosome 1 49364325 NaN NaN NaN NaN NaN NaN
AGAP004677 2L VectorBase gene 157348 186936 NaN - NaN NaN NaN methylenetetrahydrofolate dehydrogenase(NAD ) ...
AGAP004677-RA 2L VectorBase mRNA 157348 181305 NaN - NaN AGAP004677 NaN NaN
NaN 2L VectorBase three_prime_UTR 157348 157495 NaN - NaN AGAP004677-RA NaN NaN
NaN 2L VectorBase exon 157348 157623 NaN - NaN AGAP004677-RA AGAP004677-RB-E4 NaN
... ... ... ... ... ... ... ... ... ... ... ...
NaN Y_unplaced VectorBase five_prime_UTR 47932 48111 NaN + NaN AGAP029375-RA NaN NaN
NaN Y_unplaced VectorBase exon 47932 48138 NaN + NaN AGAP029375-RA AGAP029375-RA-E2 NaN
AGAP029375-PA Y_unplaced VectorBase CDS 48112 48138 NaN + 0.0 AGAP029375-RA NaN NaN
NaN Y_unplaced VectorBase exon 48301 48385 NaN + NaN AGAP029375-RA AGAP029375-RA-E3 NaN
AGAP029375-PA Y_unplaced VectorBase CDS 48301 48385 NaN + 0.0 AGAP029375-RA NaN NaN

196145 rows × 11 columns

Each row of this data frame is a “feature”, which is a general term meaning some kind of annotation on the reference genome. The type column tells us what kind of feature, and the seqid, start and end columns tell us where the feature is located within the reference genome sequence. The strand column tells us whether this feature is read using the forward (“+”) or reverse (“-“) strand of the DNA sequence.

For example, the second row is a “gene” feature, which occurs on chromosome arm 2L, starting at position 157,348 and ending at position 186,936. The identifier for this gene is “AGAP004677”. This gene hasn’t been given a name, but does have a description, which may tell us something about the function of the protein that this gene encodes (in this case, a methylenetetrahydrofolate dehydrogenase).

Genes

To get a bit more intuition for gene annotations, let’s make an interactive plot of all the gene annotations in the geneset. We will plot each gene as a rectangle showing it’s location on a chromosome arm.

def plot_genes(contig, width=750, height=150):

    # select the gene rows within the given contig
    df_geneset = ag3.geneset(attributes=["ID", "Name", "Parent", "description"]).set_index("ID")
    data = df_geneset.query(f"type == 'gene' and contig == '{contig}'").copy()

    # plot each gene as a rectangle - add some columns to define rectangle
    # coordinates
    data['left'] = data['start'] / 1e6  # plot in Mbp coordinates
    data['right'] = data['end'] / 1e6  # plot in Mbp coordinates
    data['bottom'] = np.where(data['strand'] == '+', 1, 0)
    data['top'] = data['bottom'] + 0.8

    # tidy up some columns for presentation
    data['Name'].fillna('', inplace=True)
    data['description'].fillna('', inplace=True)

    # determine how long the contig is
    contig_length = len(ag3.genome_sequence(contig))

    # define tooltips for hover
    tooltips = [
        ("ID", '@ID'),
        ("Name", '@Name'),
        ("Description", '@description'),
    ]

    # make a figure
    fig = bokeh.plotting.figure(
        title=f'Genes - {contig}',
        plot_width=width, 
        plot_height=height,
        tools='xpan,xzoom_in,xzoom_out,xwheel_zoom,reset,tap,hover',
        toolbar_location='above',
        active_scroll='xwheel_zoom',
        active_drag='xpan',
        tooltips=tooltips,
    )

    # add functionality to click through to vectorbase
    url = f'https://vectorbase.org/vectorbase/app/record/gene/@ID'
    taptool = fig.select(type=bokeh.models.TapTool)
    taptool.callback = bokeh.models.OpenURL(url=url)

    # now plot the genes as rectangles
    fig.quad(bottom='bottom', top='top', left='left', right='right',
             source=data, line_width=.5, fill_alpha=.5)

    # tidy up the plot
    fig.x_range = bokeh.models.Range1d(0, contig_length/1e6, bounds='auto')
    fig.xaxis.axis_label = f'Position (Mbp)'
    fig.y_range = bokeh.models.Range1d(-.5, 2.3)
    fig.ygrid.visible = False
    yticks = [0.4, 1.4]
    yticklabels = ['rev', 'fwd']
    fig.yaxis.ticker = yticks
    fig.yaxis.major_label_overrides = {k: v for k, v in zip(yticks, yticklabels)}
    fig.yaxis.axis_label = f'Strand'

    # show the plot
    bokeh.plotting.show(fig)
plot_genes("2L")

When zoomed out to an entire chromosome arm like in the plot above, it is hard to see the individual genes. However, you can zoom in to any region by using the mouse wheel or the tool icons at the top right of the plot. If you hover over a gene, you’ll hopefully see its ID, as well as its name and description if it has them. If you click on a gene, you’ll be taken to the corresponding gene page on VectorBase.

If you’d like a challenge, try find the resistance to dieldrin (Rdl) gene, it’s on 2L and starts at around position 25.36 Mbp.

Transcripts

Genes are transcribed into mRNA, which means that the DNA sequence from the gene region is “read” and used as a template to build an mRNA molecule with an equivalent nucleotide sequence. For many genes, this process of transcription is followed by splicing, which means that parts of the sequence are cut out before the mRNA is translated into protein. The parts of the gene sequence that are cut out are called “introns”. The parts of the gene sequence that are kept and included in the final mRNA molecule are called “exons”.

Mostly the exons comprise DNA sequence that is translated into an amino acid sequence, and these protein-coding segments are called “CDSs”, which stands for coding sequences. Some genes also include “UTRs” which are untranslated regions at the start and end of the gene sequence. These UTRs are included in the exons but do not get translated into protein. A UTR at the start of the gene is called a “five prime UTR” and a UTR at the end of a gene is called a “three prime UTR”.

Let’s build some intuition by looking up the transcripts for the doublesex gene, AGAP004050.

df_geneset.query("Parent == 'AGAP004050'")
contig source type start end score strand phase Parent Name description
ID
AGAP004050-RA 2R VectorBase mRNA 48703664 48788460 NaN - NaN AGAP004050 NaN NaN
AGAP004050-RB 2R VectorBase mRNA 48703664 48788460 NaN - NaN AGAP004050 NaN NaN

Note that there are two rows here, each with type “mRNA”. This means that the doublesex gene has two mRNA transcripts. This is a complexity that arises for many genes, where the gene may be transcribed and then spliced in different ways. Each transcript may include different exons, or exons may vary in length. The purpose of alternative splicing is to allow the same gene to encode different proteins, which may perform different functions. In this case, the gene is part of the sex determination pathway, and which of these two transcripts is produced will determine whether a mosquito is male or female.

Let’s make a function to visualise these transcripts.

def plot_transcript(transcript, width=700, height=120, show=True, x_range=None,
                    toolbar_location='above'):

    # find the gene
    df_geneset = ag3.geneset().set_index("ID")
    parent  = df_geneset.loc[transcript]

    # define tooltips for hover
    tooltips = [
        ("Type", '@type'),
        ("Location", '@contig:@start{,}..@end{,}'),
    ]

    # make a figure
    fig = bokeh.plotting.figure(
        title=f'Transcript - {transcript} ({parent.strand})',
        plot_width=width, 
        plot_height=height,
        tools='xpan,xzoom_in,xzoom_out,xwheel_zoom,reset,hover',
        toolbar_location=toolbar_location,
        active_scroll='xwheel_zoom',
        active_drag='xpan',
        tooltips=tooltips,
        x_range=x_range,
    )

    # find child components of the transcript
    data = df_geneset.query(f"Parent == '{transcript}'").copy()
    data['left'] = data['start'] / 1e6  # plot in Mbp coordinates
    data['right'] = data['end'] / 1e6  # plot in Mbp coordinates
    data['bottom'] = -0.4
    data['top'] = 0.4

    # plot exons
    exons = data.query("type == 'exon'")
    fig.quad(bottom='bottom', top='top', left='left', right='right',
             source=exons, fill_color=None, line_color='black', line_width=.5, 
             fill_alpha=0)
    
    # plot introns
    for l, r in zip(exons[:-1]['right'], exons[1:]['left']):
        m = (l + r) / 2
        fig.line([l, m, r], [0, .1, 0], line_width=1, line_color="black")

    # plot UTRs
    fig.quad(bottom='bottom', top='top', left='left', right='right',
                source=data.query("type == 'five_prime_UTR'"), 
                fill_color='green', line_width=0, fill_alpha=.5)
    fig.quad(bottom='bottom', top='top', left='left', right='right',
                source=data.query("type == 'three_prime_UTR'"), 
                fill_color='red', line_width=0, fill_alpha=.5)

    # plot CDSs
    fig.quad(bottom='bottom', top='top', left='left', right='right',
             source=data.query("type == 'CDS'"), 
             fill_color='blue', line_width=0, fill_alpha=.5)

    fig.yaxis.ticker = []
    fig.xaxis.axis_label = f'Position (Mbp)'
    fig.y_range = bokeh.models.Range1d(-.6, .6)

    fig.xaxis.axis_label = f'Contig {parent.contig} position (Mbp)'

    # show the figure
    if show:
        bokeh.plotting.show(fig)

    return fig

Visualise transcripts for the doublesex gene (AGAP004050).

plot_transcript("AGAP004050-RA");
plot_transcript("AGAP004050-RB");

The boxes are exons and the connecting lines are introns. CDSs are coloured in blue, five prime UTRs coloured in green, and three prime UTRs coloured in red. You can hopefully see that AGAP004050-RB includes an exon which is missing in AGAP004050-RA - this difference in splicing is what determines gender. Note this gene is on the negative (-) strand, meaning it is read from right to left.

Let’s look at another example, the resistance to dieldrin (Rdl) gene AGAP006028.

df_geneset.query("Parent == 'AGAP006028'")
contig source type start end score strand phase Parent Name description
ID
AGAP006028-RA 2L VectorBase mRNA 25363652 25434556 NaN + NaN AGAP006028 NaN NaN
AGAP006028-RB 2L VectorBase mRNA 25363652 25434556 NaN + NaN AGAP006028 NaN NaN
AGAP006028-RC 2L VectorBase mRNA 25363652 25434556 NaN + NaN AGAP006028 NaN NaN
plot_transcript("AGAP006028-RA");
plot_transcript("AGAP006028-RB");
plot_transcript("AGAP006028-RC");

Here there are three transcripts. Some exons are present in some transcripts but not others. Note that there are no untranslated regions (UTRs) in any of the transcripts, this is the case for some genes.

OK, now we’ve introduced gene annotations, let’s look at SNP effects.

SNP effects

Once we have identified a gene of interest, and chosen which transcript to study, we can compute the effect that any given SNP will have on the resulting protein sequence. The effect will depend on two things. Firstly, what is the nucleotide change? Secondly, where does the SNP fall within the transcript?

If a SNP falls within a CDS then it may change the protein sequence. However, not all SNPs in CDSs will do so, because the genetic code has some redundancy, meaning that different DNA sequences can encode the same amino acid sequence. If a SNP in a CDS does change the amino acid sequence, it is called a “non-synonymous” or “missense” SNP. If the SNP does not change the amino acid sequence, it is called a “synonymous” SNP.

If a SNP falls within a UTR then it won’t have any effect on the amino acid sequence. It might still affect the level of transcription of the gene, but that is harder to predict.

If a SNP falls within an intron, again it won’t change the amino acid sequence. It might, however, affect how the gene is spliced, if the SNP falls close to an intron/exon boundary. Any change in splicing might have a big effect on the amino acid sequence, because it could result in parts of the amino acid sequence being added or removed.

Based on this information, the effect of any given SNP can be categorised based on the predicted impact on the protein structure. Here is a list of the main effects we’ll be interested in, grouped by their predicted impact.

  • HIGH impact - may significantly disrupt or alter protein function.

    • START_LOST - SNP causes loss of the start codon. The start codon in mRNA is the first to be translated by the ribosome. If the start codon is changed to a non-start codon, no translation will take place.

    • STOP_LOST - SNP causes loss of the start codon. The stop codon in mRNA signals the end of translation. If the variant causes the stop codon to be lost, elongated proteins will be formed.

    • STOP_GAINED - SNP causes a new stop codon to be formed, and the protein may be truncated prematurely.

    • SPLICE_CORE - The SNP occurs within 2 bp of an intron boundary. This can affect splicing and result in excluded exons or included introns, altering the protein sequence.

  • MODERATE impact - may change protein function.

    • NON_SYNONYMOUS_CODING - SNP is within a CDS and causes an amino acid change. This protein-altering variant is often what we are interested in when looking at insecticide resistance mutations such as kdr.

    • SPLICE_REGION - SNP is close to a splice site so may affect exon splicing.

  • LOW impact - unlikely to change protein function.

    • SYNONYMOUS_CODING - SNP is within a CDS but does not cause an amino acid change.

    • FIVE_PRIME_UTR - SNP is within the five prime untranslated region.

    • THREE_PRIME_UTR - SNP is within the three prime untranslated region.

  • MODIFIER - very unlikely to change protein function.

    • INTRONIC - SNP falls with an intron, not near any splice regions.

To compute SNP effects for a given gene transcript, we’ve added a snp_effects() method to the Ag3 Python API. Let’s see it in action, again using the Rdl gene (AGAP006028) as an example.

# Rdl gene, first transcript - change this to investigate a different gene or transcript
transcript = "AGAP006028-RA"

# compute effects for all SNPs in chosen transcript
df_effects = ag3.snp_effects(
    transcript=transcript, 
)
df_effects
contig position ref_allele alt_allele pass_gamb_colu_arab pass_gamb_colu pass_arab effect impact ref_codon alt_codon aa_pos ref_aa alt_aa aa_change
0 2L 25363652 A C True True True START_LOST HIGH Atg Ctg 1.0 M L M1L
1 2L 25363652 A T True True True START_LOST HIGH Atg Ttg 1.0 M L M1L
2 2L 25363652 A G True True True START_LOST HIGH Atg Gtg 1.0 M V M1V
3 2L 25363653 T A True True True NON_SYNONYMOUS_CODING MODERATE aTg aAg 1.0 M K M1K
4 2L 25363653 T C True True True NON_SYNONYMOUS_CODING MODERATE aTg aCg 1.0 M T M1T
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
212710 2L 25434555 A T True True True STOP_LOST HIGH tAa tTa 556.0 * L *556L
212711 2L 25434555 A G True True True SYNONYMOUS_CODING LOW tAa tGa 556.0 * * *556*
212712 2L 25434556 A C True True True STOP_LOST HIGH taA taC 556.0 * Y *556Y
212713 2L 25434556 A T True True True STOP_LOST HIGH taA taT 556.0 * Y *556Y
212714 2L 25434556 A G True True True SYNONYMOUS_CODING LOW taA taG 556.0 * * *556*

212715 rows × 15 columns

The snp_effects() method requires a transcript parameter, and here we used “AGAP006028-RA”, which is the first transcript listed in the geneset for the Rdl gene. Recall that there are two other transcripts for this gene, and the effects may be different for different transcripts.

The data frame that is returned contains the SNP effects for every alternative allele at every position within the transcript. The effect and impact columns provide information about the likely effect of the SNP. If the SNP is non-synonymous, the aa_change column gives the amino acid substitution.

Using pandas, we can take a closer look at features of this data frame, like the counts of different effects.

df_effects.groupby(['impact', 'effect']).size()
impact    effect               
HIGH      SPLICE_CORE                  96
          START_LOST                    3
          STOP_GAINED                 197
          STOP_LOST                     7
LOW       SYNONYMOUS_CODING          1163
MODERATE  NON_SYNONYMOUS_CODING      3634
          SPLICE_REGION               240
MODIFIER  INTRONIC                 207375
dtype: int64

Note that there are SNPs here that could alter the protein function. However, it is important to note that not all of the SNPs will have been observed in the Ag3 dataset. In other words, we also need some information about the allele frequencies in order to determine if any of these SNPs are present in natural mosquito populations.

SNP allele frequencies

Here, our aim is to discover any SNPs in our gene of interest that (1) are present in one or more mosquito populations, and (2) may affect protein function. For example, if we are studying an insecticide resistance gene like Rdl or Vgsc, we want to discover non-synonymous SNPs that might be causing insecticide resistance.

To calculate allele frequencies, use the snp_allele_frequencies() method. First, we need to define some cohorts of interest. Here, a “cohort” is any group of sampled mosquitoes. These cohorts can be defined by querying the sample metadata.

For example, let’s define four cohorts.

cohorts = {
    "ug_nago_2012_gamb": "country == 'Uganda' and location == 'Nagongera' and year == 2012 and species == 'gambiae'",
    "bf_2012_gamb": "country == 'Burkina Faso' and year == 2012 and species == 'gambiae'",
    "bf_2012_colu": "country == 'Burkina Faso' and year == 2012 and species == 'coluzzii'",
    "bf_2014_gamb": "country == 'Burkina Faso' and year == 2014 and species == 'gambiae'",
    "bf_2014_colu": "country == 'Burkina Faso' and year == 2014 and species == 'coluzzii'",
}

These queries use the pandas query syntax. Here’s the full list of columns you can query on.

df_samples = ag3.sample_metadata()
df_samples.columns
Index(['sample_id', 'partner_sample_id', 'contributor', 'country', 'location',
       'year', 'month', 'latitude', 'longitude', 'sex_call', 'sample_set',
       'release', 'aim_fraction_colu', 'aim_fraction_arab',
       'species_gambcolu_arabiensis', 'species_gambiae_coluzzii', 'species'],
      dtype='object')

Now compute allele frequencies for these populations in our gene of interest (may take a minute or two).

df_af = ag3.snp_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
)
df_af
contig position ref_allele alt_allele pass_gamb_colu_arab pass_gamb_colu pass_arab ug_nago_2012_gamb bf_2012_gamb bf_2012_colu bf_2014_gamb bf_2014_colu max_af
0 2L 25363656 C T True True True 0.000000 0.005102 0.000000 0.000000 0.0 0.005102
1 2L 25363659 T A True True True 0.008929 0.000000 0.000000 0.000000 0.0 0.008929
2 2L 25363670 G A True True True 0.013393 0.000000 0.000000 0.000000 0.0 0.013393
3 2L 25363674 C A True True True 0.000000 0.000000 0.006098 0.000000 0.0 0.006098
4 2L 25363681 G A True True True 0.000000 0.005102 0.000000 0.000000 0.0 0.005102
... ... ... ... ... ... ... ... ... ... ... ... ... ...
32519 2L 25434508 C T True True True 0.000000 0.010204 0.000000 0.000000 0.0 0.010204
32520 2L 25434523 C A True True True 0.000000 0.137755 0.000000 0.152174 0.0 0.152174
32521 2L 25434526 C T True True True 0.004464 0.020408 0.000000 0.010870 0.0 0.020408
32522 2L 25434532 G A True True True 0.008929 0.000000 0.000000 0.000000 0.0 0.008929
32523 2L 25434550 G A True True True 0.004464 0.000000 0.000000 0.000000 0.0 0.004464

32524 rows × 13 columns

The allele frequencies are given in columns named after the populations. E.g., the “bf_2012_gamb” column has allele frequencies for the Burkina Faso 2012 An. gambiae samples. The allele frequency is given as a fraction, e.g., 1.0 means all samples carry the alternate allele, 0.5 means half the samples carry the alternate allele, and 0 means all samples carry the reference allele.

OK, now we’re ready to analyse our gene of interest for important SNPs.

Analysis

We can use the SNP effects and allele frequencies we have calculated to investigate the genetic variation present in our gene and populations of interest. To make this analysis easier, first we will merge the effect and allele frequency data frames.

df_snps = pd.merge(df_effects, df_af)
df_snps
contig position ref_allele alt_allele pass_gamb_colu_arab pass_gamb_colu pass_arab effect impact ref_codon alt_codon aa_pos ref_aa alt_aa aa_change ug_nago_2012_gamb bf_2012_gamb bf_2012_colu bf_2014_gamb bf_2014_colu max_af
0 2L 25363656 C T True True True NON_SYNONYMOUS_CODING MODERATE tCg tTg 2.0 S L S2L 0.000000 0.005102 0.000000 0.000000 0.0 0.005102
1 2L 25363659 T A True True True NON_SYNONYMOUS_CODING MODERATE cTa cAa 3.0 L Q L3Q 0.008929 0.000000 0.000000 0.000000 0.0 0.008929
2 2L 25363670 G A True True True NON_SYNONYMOUS_CODING MODERATE Gtt Att 7.0 V I V7I 0.013393 0.000000 0.000000 0.000000 0.0 0.013393
3 2L 25363674 C A True True True NON_SYNONYMOUS_CODING MODERATE cCg cAg 8.0 P Q P8Q 0.000000 0.000000 0.006098 0.000000 0.0 0.006098
4 2L 25363681 G A True True True SYNONYMOUS_CODING LOW gcG gcA 10.0 A A A10A 0.000000 0.005102 0.000000 0.000000 0.0 0.005102
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32519 2L 25434508 C T True True True SYNONYMOUS_CODING LOW gtC gtT 540.0 V V V540V 0.000000 0.010204 0.000000 0.000000 0.0 0.010204
32520 2L 25434523 C A True True True SYNONYMOUS_CODING LOW gcC gcA 545.0 A A A545A 0.000000 0.137755 0.000000 0.152174 0.0 0.152174
32521 2L 25434526 C T True True True SYNONYMOUS_CODING LOW gaC gaT 546.0 D D D546D 0.004464 0.020408 0.000000 0.010870 0.0 0.020408
32522 2L 25434532 G A True True True SYNONYMOUS_CODING LOW ctG ctA 548.0 L L L548L 0.008929 0.000000 0.000000 0.000000 0.0 0.008929
32523 2L 25434550 G A True True True SYNONYMOUS_CODING LOW gaG gaA 554.0 E E E554E 0.004464 0.000000 0.000000 0.000000 0.0 0.004464

32524 rows × 21 columns

We can now filter the data frame, e.g., below we keep only non-synonymous SNPs. We can also remove any low frequency SNPs, where there is an allele frequency of less than 2% across all populations of interest.

df_snps_filtered = df_snps.query("effect == 'NON_SYNONYMOUS_CODING' and max_af > 0.02")
df_snps_filtered
contig position ref_allele alt_allele pass_gamb_colu_arab pass_gamb_colu pass_arab effect impact ref_codon alt_codon aa_pos ref_aa alt_aa aa_change ug_nago_2012_gamb bf_2012_gamb bf_2012_colu bf_2014_gamb bf_2014_colu max_af
6 2L 25363685 A T True True True NON_SYNONYMOUS_CODING MODERATE Agc Tgc 12.0 S C S12C 0.0 0.000000 0.048780 0.000000 0.037736 0.048780
11442 2L 25387104 A C True True True NON_SYNONYMOUS_CODING MODERATE gaA gaC 79.0 E D E79D 0.0 0.000000 0.024390 0.000000 0.028302 0.028302
30053 2L 25429235 G T True True True NON_SYNONYMOUS_CODING MODERATE Gca Tca 296.0 A S A296S 0.0 0.000000 0.634146 0.000000 0.443396 0.634146
30054 2L 25429236 C G True True True NON_SYNONYMOUS_CODING MODERATE gCa gGa 296.0 A G A296G 0.0 0.408163 0.006098 0.347826 0.000000 0.408163
31615 2L 25433170 A T False False False NON_SYNONYMOUS_CODING MODERATE Acg Tcg 345.0 T S T345S 0.0 0.000000 0.585366 0.000000 0.386792 0.585366
31616 2L 25433171 C T True True True NON_SYNONYMOUS_CODING MODERATE aCg aTg 345.0 T M T345M 0.0 0.403061 0.006098 0.358696 0.000000 0.403061
31621 2L 25433207 A T True True True NON_SYNONYMOUS_CODING MODERATE aAg aTg 357.0 K M K357M 0.0 0.000000 0.012195 0.000000 0.028302 0.028302
32514 2L 25434478 T A True True True NON_SYNONYMOUS_CODING MODERATE aaT aaA 530.0 N K N530K 0.0 0.081633 0.000000 0.076087 0.000000 0.081633
32518 2L 25434505 C G True True True NON_SYNONYMOUS_CODING MODERATE caC caG 539.0 H Q H539Q 0.0 0.147959 0.000000 0.152174 0.000000 0.152174

We can also make a plot to help visualise these data.

def plot_snps(transcript, data, width=750, height=300, palette='Category10'):

    # hover tooltips
    tooltips = [
        ("position", '@contig:@position{,}'),
        ("alleles", '@ref_allele>@alt_allele'),
        ("pass", "@pass_gamb_colu_arab, @pass_gamb_colu, @pass_arab"),
        ("impact", '@impact'),
        ("effect", '@effect'),
        ("aa_change", '@aa_change'),
        ("frequency", '@frequency{%f} (@cohort)'),
    ]

    fig1 = bokeh.plotting.figure(
        title=f'Transcript - {transcript}',
        tools='xpan,xzoom_in,xzoom_out,xwheel_zoom,reset,hover',
        active_scroll='xwheel_zoom',
        active_drag='xpan',
        plot_width=width, 
        plot_height=height, 
        tooltips=tooltips,
        toolbar_location="above")

    # set up colors
    palette = bokeh.palettes.all_palettes[palette]
    colors = palette[len(cohorts)]

    # plot allele frequencies
    for coh, color in zip(cohorts, colors):
        df = data.copy()
        # add X coordinate in Mbp
        df['x'] = df['position'] / 1e6
        df['frequency'] = df[coh]
        df['cohort'] = coh
        fig1.triangle("x", coh, 
                      size=8, 
                      color=color,
                      source=df,
                      legend_label=coh)

    # tidy up the plot
    fig1.y_range = bokeh.models.Range1d(0, 1)
    fig1.yaxis.axis_label = f'Alt allele frequency'
    fig1.xaxis.visible = False
    fig1.add_layout(fig1.legend[0], 'right')
    fig1.legend.click_policy="hide"

    # plot transcript
    fig2 = plot_transcript(transcript, width=width, height=80, show=False, 
                           x_range=fig1.x_range)
    fig2.toolbar.logo = None 
    fig2.toolbar_location = None
    fig2.title = None

    bokeh.plotting.show(bokeh.layouts.column(fig1, fig2))
plot_snps(transcript, data=df_snps_filtered)

A few points to note about this plot:

  • Each triangular marker on the plot shows the frequency of a SNP allele in one of the five cohorts we selected, and its position on the X axis shows its location within the Rdl gene.

  • Zooming in using the mouse wheel or plot toolbar we can see that SNPs only occur within CDSs. This is because we only selected SNPs with a NON_SYNONYMOUS_CODING effect.

  • Hovering the mouse pointer over one of the markers gives detailed information about that SNP, including frequency, position and amino acid change.

  • If overplotting becomes an issue due to SNPs at the same position and frequency in different cohorts, individual cohorts be turned on and off by clicking the colour legend.

  • By zooming in to around 25.429Mbp we can see the Burkina Faso An. gambiae cohorts carry a known insecticide resistance SNP A296G, conferring resistance to dieldrin. In 2012 it was at a frequency of 41%, falling to 35% in 2014. Nearby, the T345M variant can be seen at very similar frequencies in these cohorts.

  • In the Burkina Faso An. coluzzii cohorts, a different SNP A296S is at high frequency, present at 63% in 2012 and 44% in 2012. This SNP also confers dieldrin resistance. These cohorts also carry a T345S SNP at similar frequencies.

The overall picture here appears to be that there are two combinations of resistance alleles. The combination A296G+T345M is common in Burkina Faso An. gambiae, whereas the combination A296S+T345S is common in An. coluzzii. Further details of these resistance SNPs can be found in Grau-Bové et al. (2020).

The presence of these resistance alleles is interesting for two reasons.

Firstly, the tight linkage of SNPs in codons 296 and 345 strongly suggests that these amino acids have an important association within the protein structure, and that mutations in both codons are required to confer resistance.

Secondly, dieldrin hasn’t been used since the 1970s due to it’s high environmental persistance, and so the presence of these SNPs at relatively high frequency is surprising. In general, we expect that resistance alleles will disappear from mosquito populations over time, once the insecticides are no longer used. The temporal data from Burkina Faso suggests these alleles may be falling in frequency, but it is perhaps surprising that they had not already fallen to lower levels.

One final technical point, the T345S SNP fails our site filters. In general it is a good idea to only study SNPs that pass site filters, but site filters can in some cases be too conservative, and this may be the case here.

Further reading

Hopefully this notebook has been a useful introduction to analysing SNPs in genes. Don’t forget you can use this notebook to analyse any gene of interest. If you have any comments, questions or suggestions, please feel free to get in touch.

This notebook is part of the Ag1000G phase 3 user guide. See the menu at the left for more documentation and example analyses.

If you’re interested to read more about the Rdl gene, check out:

  • Grau-Bové et al. (2020) Evolution of the Insecticide Target Rdl in African Anopheles Is Driven by Interspecific and Interkaryotypic Introgression

API docs

Here’s the API docs for the functions we used in this notebook.

help(ag3.geneset)
Help on method geneset in module malariagen_data.ag3:

geneset(attributes=('ID', 'Parent', 'Name', 'description')) method of malariagen_data.ag3.Ag3 instance
    Access genome feature annotations (AgamP4.12).
    
    Parameters
    ----------
    attributes : list of str, optional
        Attribute keys to unpack into columns. Provide "*" to unpack all attributes.
    
    Returns
    -------
    df : pandas.DataFrame
help(ag3.snp_effects)
Help on method snp_effects in module malariagen_data.ag3:

snp_effects(transcript, site_mask=None, site_filters='dt_20200416') method of malariagen_data.ag3.Ag3 instance
    Compute variant effects for a gene transcript.
    
    Parameters
    ----------
    transcript : str
        Gene transcript ID (AgamP4.12), e.g., "AGAP004707-RA".
    site_mask : {"gamb_colu_arab", "gamb_colu", "arab"}, optional
        Site filters mask to apply.
    site_filters : str, optional
        Site filters analysis version.
    
    Returns
    -------
    df : pandas.DataFrame
help(ag3.snp_allele_frequencies)
Help on method snp_allele_frequencies in module malariagen_data.ag3:

snp_allele_frequencies(transcript, cohorts, site_mask=None, site_filters='dt_20200416', species_calls=('20200422', 'aim'), sample_sets='v3_wild', drop_invariant=True) method of malariagen_data.ag3.Ag3 instance
    Compute per variant allele frequencies for a gene transcript.
    
    Parameters
    ----------
    transcript : str
        Gene transcript ID (AgamP4.12), e.g., "AGAP004707-RA".
    cohorts : dict
        Dictionary to map cohort IDs to sample queries, e.g.,
        {"bf_2012_col": "country == 'Burkina Faso' and year == 2012 and species == 'coluzzii'"}
    site_mask : {"gamb_colu_arab", "gamb_colu", "arab"}
        Site filters mask to apply.
    site_filters : str, optional
        Site filters analysis version.
    species_calls : (str, str), optional
        Include species calls in metadata.
    sample_sets : str or list of str, optional
        Can be a sample set identifier (e.g., "AG1000G-AO") or a list of sample set
        identifiers (e.g., ["AG1000G-BF-A", "AG1000G-BF-B"]) or a release identifier (e.g.,
        "v3") or a list of release identifiers.
    drop_invariant : bool, optional
        If True, variants with no alternate allele calls in any cohorts are dropped from
        the result.
    
    Returns
    -------
    df_snps : pandas.DataFrame