Single nucleotide polymorphisms (SNPs)

This notebook looks in more detail at how single nucleotide polymorphisms (SNPs) are represented in the Ag1000G phase 3 (Ag3) data release. We’ll also look at the different site filters that we’ve created to help select high quality SNPs for robust population genetic analyses. As always, this notebook is executable, click the launch icon () at the top of the page and select a cloud computing service if you’d like to try running it for yourself.


Install some packages. (Skip this step if running on MyBinder, these will be already installed.)

!pip install -q numpy pandas dask matplotlib seaborn
!pip install -q malariagen_data
!pip install -q scikit-allel  # this will take about a minute or so to compile 

Import the packages we’ll need.

from collections import Counter
from functools import lru_cache
import numpy as np
import pandas as pd
import dask
import dask.array as da
# silence some dask warnings
dask.config.set(**{'array.slicing.split_large_chunks': False})
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import allel
from dask.diagnostics.progress import ProgressBar
import malariagen_data
%matplotlib inline

Set up access to the Ag3 data.

ag3 = malariagen_data.Ag3("gs://vo_agam_release/")

Preface: navigating the reference genome

In case you’re not already familiar with the Anopheles gambiae genome, here’s a quick primer on how it’s structured and how we refer to specific sites (positions) within the genome.

For Ag1000G analyses we use the AgamP4 reference genome. This genome is comprised of five main sequences:

contigs = "2R", "2L", "3R", "3L", "X"

Each of these identifiers (e.g., “2R”) refers to a sequence of nucleotides, i.e., a sequence of the letters “A”, “C”, “T” and “G”. Let’s see how long these sequences are.

for contig in contigs:
    seq = ag3.genome_sequence(contig)
    print(f"{contig}: {len(seq):,} bp")
2R: 61,545,105 bp
2L: 49,364,325 bp
3R: 53,200,684 bp
3L: 41,963,435 bp
X: 24,393,108 bp

Let’s take a peek at the first 100 letters in one of the sequences (2R).

seq = ag3.genome_sequence("2R")

Here “bp” is short for base pair, which is the unit of length of nucleotide sequences. I.e., if a sequence is 10 bp then it is 10 nucleotides (letters) long.

In the case of Anopheles gambiae, these sequences correspond to chromosome arms. “2R” and “2L” are the two parts (arms) of Chromosome 2. “3R” and “3L” are the arms of Chromosome 3. “X” is the X chromosome. The X chromosome is a sex chromosome, present in two copies in females and one copy in males. Chromosomes 2 and 3 are autosomes, meaning that both males and females carry two copies.

To refer to a specific site (position) within a sequence, we typically use 1-based coordinates. E.g., “2R:10” refers to the tenth nucleotide (letter) in the 2R sequence. We can access this nucleotide via Python, but when indexing in Python we need to remember to use zero-based indices. E.g., access the nucleotide at position 10 on chromosome arm 2R:


This nucleotide is a “T” (thymine).

Note that there are also some “gaps” in the reference genome, where we don’t know what the nucleotide is. These are represented in the sequence with the letter “N”. Let’s take a look at how common these are, for chromosome arm 2R:

def count_nucleotides(contig):
    seq = ag3.genome_sequence(contig).compute()
    return Counter(seq).most_common()
[(b'T', 14379951),
 (b'A', 14377315),
 (b'C', 12197820),
 (b'G', 12183243),
 (b't', 2077134),
 (b'a', 2060370),
 (b'g', 1430905),
 (b'c', 1425715),
 (b'n', 1412612),
 (b'N', 40)]

There are two things to notice here. First, there are both upper case and lower case letters. For most purposes you can ignore this, e.g., both “T” and “t” mean the same thing. Second, there are 1,412,652 nucleotides which are unknown (either “N” or “n”).

To gain a bit more intuition for how the genome is structured and the relative sizes of the different sequences, let’s make a plot of the five chromosome arms, summarising the relative abundance of the different nucleotides.

def plot_sequence_composition(seq_id, window_size=100_000, ax=None):
    # load reference sequence
    seq = ag3.genome_sequence(seq_id).compute()
    if ax is None:
        # make the figure size relative to largest contig
        figw = 11 * len(seq) / len(ag3.genome_sequence("2R"))
        fig, ax = plt.subplots(figsize=(figw, 1))
    # convert to upper-case
    seq = np.char.upper(seq)

    # locate nucleotides
    is_a = seq == b'A'
    is_c = seq == b'C'
    is_g = seq == b'G'
    is_t = seq == b'T'
    is_n = seq == b'N'
    # check there's nothing unexpected
    is_other = ~is_a & ~is_c & ~is_g & ~is_t & ~is_n
    assert np.sum(is_other) == 0

    # construct windows
    bins = np.arange(0, len(seq), window_size)

    # count nucleotides in each window
    h_a, _ = np.histogram(np.nonzero(is_a)[0] + 1, bins=bins)
    h_c, _ = np.histogram(np.nonzero(is_c)[0] + 1, bins=bins)
    h_g, _ = np.histogram(np.nonzero(is_g)[0] + 1, bins=bins)
    h_t, _ = np.histogram(np.nonzero(is_t)[0] + 1, bins=bins)
    h_n, _ = np.histogram(np.nonzero(is_n)[0] + 1, bins=bins)

    # plot frequence of nucleotides within each bin
    left = bins[:-1]
    bottom = 0
    width = np.diff(bins)
    palette = sns.color_palette('colorblind')
    colors = [palette[i] for i in [2, 0, 3, 8]] + ['k']
    for h, c, l in zip([h_a, h_t, h_g, h_c, h_n], colors, 'ATGCN'):, h, width=width, bottom=bottom, color=c, align='edge', label=l)
        bottom += h
    # tidy up plot
    ax.set_xlim(0, len(seq))
    ax.set_yticklabels(['0%', '100%'])
    # convert X axis to Mbp
    ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: x//1e6))
    ax.set_xlabel("Position (Mbp)")
    # add centromere/telomere annotation
    if seq_id in {'2L', '3L'}:
        ltxt = "centromere"
        rtxt = "telomere"
        ltxt = "telomere"
        rtxt = "centromere"
    ax.annotate(ltxt, xy=(0, 1), xycoords="axes fraction", 
                xytext=(0, 2), textcoords='offset points', va='bottom', ha='left')
    ax.annotate(rtxt, xy=(1, 1), xycoords="axes fraction", 
                xytext=(0, 2), textcoords='offset points', va='bottom', ha='right')
    # add legend
    ax.legend(loc='center left', bbox_to_anchor=(1, .5), 
              prop=dict(family='monospace'), ncol=2,

A few points to note about the plots above:

  1. All of the chromosome arms are made up of A, C, T and G in roughly equal proportion, although A and T are a little more common than C and G.

  2. We annotated each chromosome arm with “centromere” and “telomere”. You probably don’t need to know what these mean, but it is worth knowing that the sequences at these two different ends of the chromosome tend to be quite different, especially towards the centromere.

  3. Gaps (N) are relatively uncommon, except towards the centromere of all chromosome arms where they are more common. This is typically because sequences are more repetitive towards the centromeres and so assembling the genome is more difficult.

  4. We made the size of each figure proportional to the length of the sequence. This was to help give a feeling for the relative size of the different chromosome arms.

  5. We converted the X-axis coordinates to “Mbp” which means mega-base pairs, i.e., millions of letters.

Now we’re acquainted with the reference genome, let’s look at nucleotide variation.

SNP positions and alleles

When studying genetic variation in natural populations, there are two fundamental questions:

  1. Where in the genome do we observe differences (i.e., nucleotide variation) between individual mosquitoes.

  2. Where in the genome do we observe no differences (i.e., nucleotide conservation) between individuals.

Sometimes this second question is overlooked, but both questions are important.

In Ag1000G phase 3 we have taken a slightly different approach to calling single nucleotide polymorphisms (SNPs - single letter differences) than in previous project phases. Because we are interested in both of the above questions, we have genotyped all individual mosquitoes at all genomic sites where the reference genotype nucleotide is known (i.e., not an “N”), regardless of whether any variation is present. This is sometimes known as an “all sites” analysis.

In order to simplify SNP calling across a large number of individuals, we also declared the set of all possible alternate (variant) SNP alleles ahead of time. Because we are analysing SNPs and we are not considering any other types of variation at this stage, then at each site in the genome there is one reference allele (i.e., the nucleotide in the reference genome) and three possible alternate alleles. E.g., at position 10 on chromosome arm 2R, the reference allele is “T”, and so there are three possible alternate alleles: “A”, “C” and “G”.

Let’s look at how these data are represented for chromosome arm 2R:

pos, ref, alt = ag3.snp_sites("2R")

The pos array here holds the 1-based positions (coordinates) where the SNPs were called. E.g., here’s the first 10 SNP positions:

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

Note that this is simply the positions 1-10. However, these positions do not continue increasing by 1 all the way through the array, i.e., there are gaps, because of gaps in the reference genome. We can quickly visualise that, by plotting the distance between adjacent sites:

def plot_gaps():

    p = pos.compute()
    d = np.diff(p)
    loc_gap = d > 1
    x = p[1:][loc_gap]
    y = d[loc_gap]
    fig, ax = plt.subplots(figsize=(11, 2))
    ax.plot(x, y, color='black', linewidth=.5, marker='o', linestyle=' ', markersize=2, mfc='none')
    # convert X axis to Mbp
    ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: x//1e6))
    ax.set_xlabel("Position (Mbp)")
    ax.set_ylabel('Gap size (bp)')

Note also that the length of the pos array is less than the length of the reference genome sequence:

len(seq) - len(pos)

This difference is the number of unknown nucleotides (“N”s) in the reference genome that we computed earlier. This is another way of confirming that we have called SNP genotypes at all genomic sites where the reference genome nucleotide is known.

The ref array holds the reference alleles, i.e., the reference genome nucleotide for each SNP site. E.g., here’s the reference alleles at the first 10 SNP sites:

array([b'C', b'T', b'C', b'T', b'A', b'A', b'A', b'C', b'A', b'T'],

We can confirm that this corresponds to the nucleotides in the reference sequence by using the SNP positions to index the sequence (remembering to convert to zero-based coordinates):

seq[pos[:10] - 1].compute()
array([b'C', b'T', b'c', b't', b'a', b'a', b'a', b'c', b'a', b't'],

The alt array holds the predefined alternate alleles at each site. E.g., here are the alternate alleles for the first 10 SNP sites:

array([[b'A', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'C', b'T', b'G'],
       [b'C', b'T', b'G'],
       [b'C', b'T', b'G'],
       [b'A', b'T', b'G'],
       [b'C', b'T', b'G'],
       [b'A', b'C', b'G']], dtype='|S1')

Each row represents a site, and the columns show the three possible alternate alleles.

The same applies if you are working with the data in VCF format. E.g., here’s the first 10 rows of the VCF file with SNP genotypes for sample AR0001-C (see the download guide for how to get this file), note the three alternate (ALT) alleles at every site:

2R	1	.	C	A,T,G	0.08	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=14.72	GT:AD:DP:GQ:PL	0/0:5,0,0,0:5:6:0,6,52,6,52,52,6,52,52,52
2R	2	.	T	A,C,G	0.08	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=13.44	GT:AD:DP:GQ:PL	0/0:6,0,0,0:6:6:0,6,53,6,53,53,6,53,53,53
2R	3	.	C	A,T,G	0.08	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=12.44	GT:AD:DP:GQ:PL	0/0:7,0,0,0:7:6:0,6,53,6,53,53,6,53,53,53
2R	4	.	T	A,C,G	0.08	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=12.44	GT:AD:DP:GQ:PL	0/0:7,0,0,0:7:6:0,6,53,6,53,53,6,53,53,53
2R	5	.	A	C,T,G	0.08	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=22.81	GT:AD:DP:GQ:PL	0/0:9,0,0,0:9:6:0,6,53,6,53,53,6,53,53,53
2R	6	.	A	C,T,G	0.02	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=28.78	GT:AD:DP:GQ:PL	0/0:10,0,0,0:10:12:0,12,113,12,113,113,12,113,113,113
2R	7	.	A	C,T,G	0.02	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=27.68	GT:AD:DP:GQ:PL	0/0:11,0,0,0:11:12:0,12,118,12,118,118,12,118,118,118
2R	8	.	C	A,T,G	0.02	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=27.68	GT:AD:DP:GQ:PL	0/0:11,0,0,0:11:12:0,12,130,12,130,130,12,130,130,130
2R	9	.	A	C,T,G	0.01	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=31.66	GT:AD:DP:GQ:PL	0/0:12,0,0,0:12:15:0,15,156,15,156,156,15,156,156,156
2R	10	.	T	A,C,G	0.01	.	ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=34.67	GT:AD:DP:GQ:PL	0/0:13,0,0,0:13:18:0,18,181,18,181,181,18,181,181,181

The most important thing to note about the SNP data in Ag3 is that not all of these alternate alleles will have been observed in the mosquitoes we sequenced.

To determine which alternate alleles are observed in Ag3, we need to look at the SNP genotypes. Let’s take a look at how SNP genotypes are represented, then return to the question of which alternate alleles have been observed.


Here a SNP “genotype” means the pair of nucleotides (alleles) carried by an individual mosquito at a specific site in the genome. Note that this is a pair of alleles because mosquitoes are diploid like us, meaning that they carry two copies of their genome, one inherited from each parent (except for males on the X chromosome, which inherit one copy from their mother only).

Rather than storing the actual nucleotides for each genotype, it is conventional to use a numerical encoding of genotypes, where 0 stands for the reference allele, 1 stands for the first alternate allele, 2 stands for the second alternate allele, 3 stands for the third alternate allele, etc.

E.g., at site 2R:10 the reference allele is “T”. If an individual mosquito carries a homozygous reference genotype (both genome copies have a “T”), we represent this in Python as the pair of integers (0, 0). In a VCF file this is represented as the string “0/0”.

At 2R:10 we have arbitrarily declared that “A” is the first alternate allele. If an individual carries a heterozygous genotype, where one genome copy has “T” and the other has “A”, this is represented as the pair (0, 1). The ordering of this pair has no meaning.

Similarly, at 2R:10 we have arbitrarily declared that “C” is the second alternate allele and “G” is the third alternate allele. So, a genotype homozygous for the “C” allele would be represented as (2, 2), a genotype homozygous for “G” would be (3, 3) and a genotype heterozygous for the “C” and “G” alleles would be (2, 3).

There’s one more thing to know. Sometimes a genotype is called as missing, meaning that there were no sequence reads for the given individual at a given site. In Python we represent this as a pair of negative numbers (-1, -1). In VCF this is represented as the string “./.”.

Here are those example genotypes again, shown as a table with the different representations:

Reference allele Alternate alleles Genotype Genotype (Python) Genotype (VCF) Description
T A,C,G T/T (0, 0) 0/0 homozygous reference
T/A (0, 1) 0/1 heterozygous
A/A (1, 1) 1/1 homozygous alternate
C/C (2, 2) 2/2 homozygous alternate
G/G (3, 3) 3/3 homozygous alternate
C/G (2, 3) 2/3 heterozygous
missing (-1, -1) ./. missing

To see this in action, let’s access the SNP genotypes for chromosome arm 2R for all wild samples in Ag3.

gt = ag3.snp_genotypes(contig="2R", sample_sets="v3")
(60132453, 3081, 2)

The gt array has three dimensions. The first dimension corresponds to the genomic sites we’ve analysed (in this case, there are 60,132,453 sites on chromosome arm 2R). The second dimension (3,081) is the number of individual mosquitoes (a.k.a. samples) sequenced. The third dimension (2) represents the two alleles in each individual genotype.

So, e.g., let’s look up the genotype call at the 10th site in the 1st sample:

gt[9, 0].compute()
array([0, 0], dtype=int8)

This is a homozygous reference genotype.

Let’s try another, the 10th site and the 2,393rd sample.

gt[9, 2392].compute()
array([0, 2], dtype=int8)

This is a heterozygous genotype, where the reference allele and the second alternate allele are both present.

Let’s look at one more genotype, the 10th site and the 249th sample.

gt[9, 248].compute()
array([-1, -1], dtype=int8)

This is a missing genotype call.

When working with genotype arrays it can be useful to use the scikit-allel package. Let’s wrap the gt variable, which is a general purpose dask array, with the scikit-allel GenotypeDaskArray class, which provides some useful functionality for analysing genotype data.

gt = allel.GenotypeDaskArray(gt)
<GenotypeDaskArray shape=(60132453, 3081, 2) dtype=int8>

This wrapper class provided a display showing a preview of genotype calls for the first and last sites and samples, using the VCF-style representation of genotype calls (e.g., “0/0” is homozygous reference, etc.).

Once we have access to the genotype calls, we can run a variety of different computations. A common computation needed for many analyses is an allele count, which we’ll look at next.

Allele counts

An allele count involves scanning across all the genotypes at a given site and counting the number of times each allele is observed. I.e., how many 0s, 1s, 2s and 3s are observed. Let’s do an allele count for the first 10 sites on chromosome arm 2R.

ac = gt[:10].count_alleles(max_allele=3).compute()
<AlleleCountsArray shape=(10, 4) dtype=int32>
05871 2 1 0
15930 0 5 1
25981 7 2 0
36033 1 2 0
46050 0 0 0
56070 1 0 3
66080 4 0 0
76073 2 6 15
86118 0 1 1
96120 1 4 1

In the ac array above, each row represents a site, and the four columns represent the four alleles. The values are the number of times each allele was observed at each site. So, now we can see which alleles have been observed in the Ag3 data.

E.g., for the 10th site, the reference allele (0) was observed 6120 times, the first alternate allele (1) was observed once, the second alternate allele (2) was observed four times, and the third alternate allele (3) was observed once. So, this is a site where in fact all four possible alleles (nucleotides) were observed in Ag3 samples. This is an example of a multiallelic site.

Let’s look at some other sites.

At the fifth site, only the reference allele was observed. This is an example of an invariant site. I.e., there is no variation there, all samples were homozygous for the reference allele or had a missing genotype call.

At the seventh site, two alleles were observed, the reference allele and the first alternate allele. This is an example of a biallelic site. Note that biallelic sites might involve any pair of alleles, not always 0 and 1.

Let’s use this allele counts array to compute some properties of these 10 sites.

# how many alleles were observed
array([3, 3, 3, 3, 1, 3, 2, 4, 3, 4])
# whether any alternate alleles were observed
array([ True,  True,  True,  True, False,  True,  True,  True,  True,
# whether more than one allele was observed
array([ True,  True,  True,  True, False,  True,  True,  True,  True,
# whether exactly two alleles were observed
array([False, False, False, False, False, False,  True, False, False,
# total number of alleles called
an_called = ac.sum(axis=1)
array([5874, 5936, 5990, 6036, 6050, 6074, 6084, 6096, 6120, 6126])
# total number of missing allele calls
an_missing = (gt.shape[1] * 2) - an_called
array([288, 226, 172, 126, 112,  88,  78,  66,  42,  36])

These arrays can be useful when selecting which sites to use for a given analysis. For example, you might want to select only segregating sites, with at most a certain number of missing allele calls.

Now we’ve seen how SNP positions, alleles and genotypes are represented, let’s look at the site filters we’ve created for Ag3 to help select sites for robust population genetic analyses.

Site filters

When analysing genome variation, there are some sites in the genome where it is difficult to reliably ascertain nucleotide variation and call genotypes. This can be for a variety of reasons. For example, some genome regions are very repetitive, and short sequence reads cannot be aligned unambiguously. Another issue can be that some samples have structural variation in some regions of the genome, where different samples have different numbers of copies of a given sequence.

To identify and avoid sites like these, we have created site filters. These are Boolean masks which identify sites where SNP genotyping is reliable, based on the fact that most sequence reads are unambiguously aligned, and there is minimal evidence for structural variation.

Because we have multiple mosquito species included in the Ag3 cohort (An. gambiae, An. coluzzii and An. arabiensis) we have created three different site filters which are appropriate in different circumstances, depending on which samples you’re analysing. Here is a summary of the three site filters:

  • “gamb_colu_arab” - These site filters are appropriate when performing a joint analysis of samples from any of the three species. They are, however, more stringent the other filters, which we’ll explore further below.

  • “gamb_colu” - These site filters are appropriate when analysing only samples that are not An. arabiensis.

  • “arab” - These sige filters are appropriate when analysing only samples that are An. arabiensis.

For each of these filters, there is one Boolean mask per chromosome arm. The Boolean masks are True where a site passes the filter and False otherwise. E.g., here’s the gamb_colu_arab site filter for chromosome arm 2R:

loc_pass = ag3.site_filters(contig="2R", mask="gamb_colu_arab")
Array Chunk
Bytes 60.13 MB 300.00 kB
Shape (60132453,) (300000,)
Count 202 Tasks 201 Chunks
Type bool numpy.ndarray
60132453 1
# examine the first ten values
array([False, False, False, False, False, False, False, False, False,
# how many sites in total pass the filter?

To get some intuition for what these site filters look like and how they differ from each other, let’s build a summary table showing how many sites pass in each chromosome arm. This may take a minute or two to run.

def count_filter_pass(contig, mask):
    loc_pass = ag3.site_filters(contig=contig, mask=mask)
    n_pass = loc_pass.sum().compute()
    return n_pass
def tabulate_filter_counts():

    rows = []
    for contig in contigs:
        contig_length = len(ag3.genome_sequence(contig))
        n_sites = len(ag3.snp_sites(contig=contig, field="POS"))
        row = [contig, contig_length, n_sites]
        for mask in "gamb_colu_arab", "gamb_colu", "arab":
            n_pass = count_filter_pass(contig=contig, mask=mask)
            pc_pass = "{:.1%}".format(n_pass / n_sites)
    df = pd.DataFrame.from_records(
        columns=["contig", "contig length", "no. sites", "pass gamb_colu_arab", "pass gamb_colu", "pass arab"]
    return df

df_filter_counts = tabulate_filter_counts()
contig contig length no. sites pass gamb_colu_arab pass gamb_colu pass arab
0 2R 61545105 60132453 67.5% 73.9% 73.6%
1 2L 49364325 48525747 67.0% 74.2% 72.6%
2 3R 53200684 52226568 63.9% 71.2% 69.8%
3 3L 41963435 40758473 63.5% 70.4% 69.7%
4 X 24393108 23385349 48.0% 70.0% 53.3%

There’s a couple of things to note about these numbers:

  1. The percentage of sites passing the gamb_colu_arab filter is always lower than either the gamb_colu or arab filters. This is because the gamb_colu_arab filter is the intersection of the two other filters. I.e., a site passes the gamb_colu_arab filter if it passes both the gamb_colu and the arab filters.

  2. The percentage of sites passing the gamb_colu and arab filters is similar on all chromosome arms, except for the X chromosome. This is because of a large inversion on the X chromosome known as Xag which is a fixed difference between An. arabiensis and the other two species, and where there is a higher level of divergence in An. arabiensis relative to the reference genome we’ve used.

To gain a bit more intuition for these filters, let’s plot the fraction of genome sites that pass the site filters in moving windows over each of the chromosome arms.

def plot_site_filters(contig, window_size=200_000, ax=None):

    # setup figure and axes
    if ax is None:
        contig_length = len(ag3.genome_sequence(contig))
        figw = 11 * contig_length / len(ag3.genome_sequence("2R"))
        fig, ax = plt.subplots(figsize=(figw, 2))
        sns.despine(ax=ax, offset=5)

    # setup X variable
    pos = ag3.snp_sites(contig=contig, field="POS").compute()
    x = allel.moving_statistic(pos, statistic=np.mean, size=window_size)

    # setup Y variables and plot
    for mask in "gamb_colu", "arab", "gamb_colu_arab":
        loc_pass = ag3.site_filters(contig=contig, mask=mask).compute()
        y = allel.moving_statistic(loc_pass, statistic=np.mean, size=window_size)
        ax.plot(x, y * 100, label=mask)

    # tidy plot
    ax.set_xlim(0, contig_length)
    ax.set_ylim(0, 100)
    # convert X axis to Mbp
    ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: x//1e6))
    ax.set_xlabel("Position (Mbp)")
    ax.set_ylabel('% pass')
    ax.legend(loc='upper left', bbox_to_anchor=(1, 1), title="Filter",

The methods in the malariagen_data.Ag3 class allow you to access data on sites or genotypes with one of these filters already applied. E.g., let’s access the SNP positions with the gamb_colu_arab filter applied.

pos = ag3.snp_sites(contig="2R", site_mask="gamb_colu_arab", field="POS")
Array Chunk
Bytes 162.25 MB 1.82 MB
Shape (40561667,) (455384,)
Count 231 Tasks 115 Chunks
Type int32 numpy.ndarray
40561667 1

Notice the length of this array is 40,561,667 which we calculated earlier as the number of sites passing the gamb_colu_arab filter on chromosome arm 2R.

Similarly, here are the genotypes with the gamb_colu_arab site filter applied.

gt = ag3.snp_genotypes(contig="2R", sample_sets="v3", site_mask="gamb_colu_arab")
Array Chunk
Bytes 249.94 GB 26.25 MB
Shape (40561667, 3081, 2) (262530, 50, 2)
Count 44650 Tasks 14874 Chunks
Type int8 numpy.ndarray
2 3081 40561667

Locating SNPs

Locating a single SNP

For some analyses you may have a particular genomic site you are investigating, and want to know what variation occurs at that site in the Ag3 data resource. To locate that site, a convenient approach is to use an index such as the scikit-allel SortedIndex class.

For example, let’s locate data for the site 2L:2422652.

pos, ref, alt = ag3.snp_sites(contig="2L")
pos = allel.SortedIndex(pos)
<SortedIndex shape=(48525747,) dtype=int32>
# locate the position of interest
loc_site = pos.locate_key(2422652)
# check the position is correct
# access the reference alleles
array(b'A', dtype='|S1')
# access the alternate alleles
array([b'C', b'T', b'G'], dtype='|S1')
# access the genotypes
gt = ag3.snp_genotypes(contig="2L", sample_sets="v3")
gt_site = gt[loc_site].compute()
array([[2, 2],
       [2, 2],
       [2, 2],
       [0, 0],
       [0, 2],
       [0, 2]], dtype=int8)

From the output above we can see there are some samples with a (2, 2) genotype, which correspond to the T/T genotype, and some samples with a (0, 2) which corresponds to the A/T genotype.

Locating SNPs in a genome region

For other analyses you may want to extract data for a specific genome region, such as a gene. Again a good approach is to use an index. E.g., let’s locate data for the Vgsc gene, which spans the region 2L:2358158-2431617.

loc_region = pos.locate_range(2358158, 2431617)
slice(2083135, 2152802, None)
# obtain SNP positions within selected region
pos_region = pos[loc_region]
<SortedIndex shape=(69667,) dtype=int32>
# reference alleles within selected region
ref_region = ref[loc_region].compute()
array([b'A', b'T', b'G', ..., b'T', b'G', b'A'], dtype='|S1')
# alternate alleles within selected region
alt_region = alt[loc_region].compute()
array([[b'C', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'A', b'C', b'T'],
       [b'A', b'C', b'G'],
       [b'A', b'C', b'T'],
       [b'C', b'T', b'G']], dtype='|S1')
# genotypes within selected region
gt_region = gt[loc_region]
# wrap for convenience
gt_region = allel.GenotypeDaskArray(gt_region)
<GenotypeDaskArray shape=(69667, 3081, 2) dtype=int8>
# allele counts
ac_region = gt_region.count_alleles(max_allele=3).compute()
<AlleleCountsArray shape=(69667, 4) dtype=int32>
06162 0 0 0
16162 0 0 0
26161 0 1 0
696646162 0 0 0
696656162 0 0 0
696666162 0 0 0
# how many segregating SNPs within our region of interest?

Locating samples

In many cases you may want to access genotypes for an individual mosquito sample, or a group of samples. This usually requires linking the sample metadata to the genotypes. Let’s look at how to do that. First, load sample metadata.

df_samples = ag3.sample_metadata(sample_sets="v3")
sample_id partner_sample_id contributor country location year month latitude longitude sex_call sample_set release aim_fraction_colu aim_fraction_arab species_gambcolu_arabiensis species_gambiae_coluzzii species
0 AR0047-C LUA047 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.945 0.001 gamb_colu coluzzii coluzzii
1 AR0049-C LUA049 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.933 0.001 gamb_colu coluzzii coluzzii
2 AR0051-C LUA051 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.937 0.002 gamb_colu coluzzii coluzzii
3 AR0061-C LUA061 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.938 0.002 gamb_colu coluzzii coluzzii
4 AR0078-C LUA078 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.926 0.001 gamb_colu coluzzii coluzzii
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3076 AD0494-C 80-2-o-16 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 F AG1000G-X v3 NaN NaN NaN NaN NaN
3077 AD0495-C 80-2-o-17 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 M AG1000G-X v3 NaN NaN NaN NaN NaN
3078 AD0496-C 80-2-o-18 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 M AG1000G-X v3 NaN NaN NaN NaN NaN
3079 AD0497-C 80-2-o-19 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 F AG1000G-X v3 NaN NaN NaN NaN NaN
3080 AD0498-C 80-2-o-20 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 M AG1000G-X v3 NaN NaN NaN NaN NaN

3081 rows × 17 columns

Now access genotypes for whatever chromosome arm you are analysing, e.g., 2L.

gt = ag3.snp_genotypes(contig="2L", sample_sets="v3")
Array Chunk
Bytes 299.02 GB 30.00 MB
Shape (48525747, 3081, 2) (300000, 50, 2)
Count 24004 Tasks 11988 Chunks
Type int8 numpy.ndarray
2 3081 48525747

The key point to know here is that the number of rows in the sample metadata dataframe and the number of columns in the genotype array are the same.

len(df_samples) == gt.shape[1]

Also, the order of samples in the sample metadata is the same as the order of the columns in the genotype array. We can use this correspondance to look up samples and access their genotypes.

Locating a single sample

Let’s extract both sample metadata and genotypes for a single sample, e.g., sample AB0110-C.

loc_sample = df_samples.query("sample_id == 'AB0110-C'").index[0]
sample_id                          AB0110-C
partner_sample_id                     BF9-2
contributor                     Austin Burt
country                        Burkina Faso
location                               Bana
year                                   2012
month                                     7
latitude                             11.233
longitude                            -4.472
sex_call                                  F
sample_set                     AG1000G-BF-A
release                                  v3
aim_fraction_colu                     0.974
aim_fraction_arab                     0.002
species_gambcolu_arabiensis       gamb_colu
species_gambiae_coluzzii           coluzzii
species                            coluzzii
Name: 100, dtype: object
gt_sample = gt[:, loc_sample]
Array Chunk
Bytes 97.05 MB 600.00 kB
Shape (48525747, 2) (300000, 2)
Count 24166 Tasks 162 Chunks
Type int8 numpy.ndarray
2 48525747
# compute some genotype counts for our sample of interest
gt_sample = allel.GenotypeVector(gt_sample.compute())

Locating multiple samples

We will also often want to locate a group of samples based on their metadata, e.g., all samples from a given country, collected in a certain year, of a certain species; and then analyse their genotypes. Let’s look at an example.

# select An. gambiae samples from Burkina Faso collected in 2012
query = "(country == 'Burkina Faso') and (year == 2012) and (species == 'gambiae')"
loc_samples = df_samples.query(query).index.values
array([ 81,  82,  91,  97,  98, 106, 107, 108, 109, 114, 115, 116, 117,
       118, 121, 122, 123, 124, 130, 131, 132, 133, 134, 135, 136, 137,
       138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150,
       151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 177, 178,
       179, 180, 181, 182, 183, 185, 186, 187, 188, 191, 197, 198, 204,
       207, 210, 211, 212, 214, 215, 217, 218, 220, 223, 229, 230, 231,
       232, 233, 237, 238, 241, 242, 245, 246, 247, 248, 249, 250, 251,
       252, 254, 255, 257, 258, 260, 261])
# examing their metadata
df_samples_selected = df_samples.iloc[loc_samples]
sample_id partner_sample_id contributor country location year month latitude longitude sex_call sample_set release aim_fraction_colu aim_fraction_arab species_gambcolu_arabiensis species_gambiae_coluzzii species
81 AB0085-Cx BF2-4 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.024 0.002 gamb_colu gambiae gambiae
82 AB0086-Cx BF2-6 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.038 0.002 gamb_colu gambiae gambiae
91 AB0096-C BF4-4 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.032 0.002 gamb_colu gambiae gambiae
97 AB0103-C BF6-7 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.040 0.002 gamb_colu gambiae gambiae
98 AB0104-Cx BF6-8 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.026 0.002 gamb_colu gambiae gambiae
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
255 AB0278-C BF12-29 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.017 0.002 gamb_colu gambiae gambiae
257 AB0280-Cx BF12-31 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.065 0.003 gamb_colu gambiae gambiae
258 AB0281-Cx BF12-32 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.029 0.002 gamb_colu gambiae gambiae
260 AB0283-C BF10-12 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.028 0.002 gamb_colu gambiae gambiae
261 AB0284-C BF10-13 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.024 0.002 gamb_colu gambiae gambiae

98 rows × 17 columns

# select genotypes
gt_samples = gt[:, loc_samples]
Array Chunk
Bytes 9.51 GB 21.00 MB
Shape (48525747, 98, 2) (300000, 35, 2)
Count 24652 Tasks 648 Chunks
Type int8 numpy.ndarray
2 98 48525747

Note above that we now have a genotype array with 98 columns, which corresponds to the 98 samples we selected with our sample metadata query.

Selecting sites and samples

We can also combine site selections with sample selections, e.g.:

# obtain genotype at 2L:2422652 in sample AB0110-C
gt[loc_site, loc_sample].compute()
array([0, 2], dtype=int8)
# access genotypes in the Vgsc gene in An. gambiae samples from Burkina Faso collected in 2012
gt_sel = gt[loc_region, loc_samples]
Array Chunk
Bytes 13.65 MB 3.70 MB
Shape (69667, 98, 2) (52802, 35, 2)
Count 24160 Tasks 8 Chunks
Type int8 numpy.ndarray
2 98 69667
# how many segregating sites now?
ac = allel.GenotypeDaskArray(gt_sel).count_alleles(max_allele=3).compute()

Differences from previous data releases

In case you have worked with previous data releases from the Ag1000G project, there are a couple of important differences to be aware of. These are touched on in the text above, but let’s expand a little here.

Genotyping at all sites

In Ag3 we have genotyped all samples at all genomic sites where the reference genome is not “N”, and those genotypes are included in the released data. This includes genotypes at sites where no variation from the reference genome was observed. This is different from previous releases, where we only released genotypes at sites where some variation was observed.

The reason why we made this change is because several important use cases for these data involve identifying genomic sites which are highly conserved, i.e., where samples do not vary at all. Because we now report genotypes at all sites, it is possible to identify these highly conserved sites with more confidence.

Predefined alternate alleles

In Ag3 we predefined the set of possible alternate SNP alleles at all sites in the genome, before we ran our SNP genotyping pipeline on the samples sequenced. This means that our data includes three alternate alleles at every site in the genome. Not all of these alleles will have been observed in the data, and determining which alleles are observed requires examination of the genotypes. This is different from previous releases, where alternate alleles were declared during the SNP calling process, and only observed alternate alleles were included in the data release.

The reason why we have made this change is because it simplifies the SNP genotyping pipeline. Because we fix the set of alleles ahead of time, all samples can be genotyped independently, and merging the resulting data is straightforward. This also makes it easier to integrate with new data in the future, because it allows merging Ag3 data with data from future samples where alleles may be observed that were not observed in the Ag3 cohort.

Site filters

In Ag3 we have defined site filters, which are Boolean masks that apply to all sites in the genome, and which identify sites where SNP genotyping is reliable.These site filters replace the separate variant filters and genome accessibility masks defined in previous project phases.

The reason why we took this approach is that it is simpler. All sites are treated the same, regardless of whether variation is observed or not. Instead of having to work two different things (variant filters and genome accessibility masks) there is just a single site filter to work with.

Note that some software tools such as scikit-allel may require you to provide a genome accessibility mask, which is a Boolean array of the same length as the chromosomal sequence you’re analysing. In case you need this, the Ag3 class provides a convenience method to obtain this from the site filters. E.g.:

is_accessible = ag3.is_accessible(contig="2R", site_mask="gamb_colu_arab")

This is an array the same length as the chromosomal sequence, and is True where the reference genome is not an “N” and the site filter passes.

seq = ag3.genome_sequence(contig="2R").compute()
len(seq) == len(is_accessible)
loc_pass = ag3.site_filters(contig="2R", mask="gamb_colu_arab").compute()
pos = ag3.snp_sites(contig="2R", field="POS").compute()
np.all(loc_pass == is_accessible[pos - 1])

Further reading

Hopefully the examples above have helped to give some sense of how to access and work with SNP data from Ag3. If you have any comments or questions, please get in touch.

This notebook is part of the Ag1000G phase 3 user guide. See the menu at the left for more documentation and example analyses.