Single nucleotide polymorphisms (SNPs)
Contents
Single nucleotide polymorphisms (SNPs)¶
This notebook looks in more detail at how single nucleotide polymorphisms (SNPs) are represented in the Ag1000G phase 3 (Ag3.0
) data release. We’ll also look at the different site filters that we’ve created to help select high quality SNPs for robust population genetic analyses. As always, this notebook is executable, click the launch icon () at the top of the page and select a cloud computing service if you’d like to try running it for yourself.
Setup¶
Install some packages. You can skip this step if running on MyBinder, these will be already installed.
!pip install -qU malariagen_data seaborn
Import the packages we’ll need.
from collections import Counter
from functools import lru_cache
import numpy as np
import pandas as pd
import dask
import dask.array as da
# silence some dask warnings
dask.config.set(**{'array.slicing.split_large_chunks': True})
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import allel
from dask.diagnostics.progress import ProgressBar
import malariagen_data
Set up access to the Ag3 data.
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 data resource API | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0 |
Cohorts analysis | 20211101 |
Species analysis | aim_20200422 |
Site filters analysis | dt_20200416 |
Software version | 2.2.0 |
SNP positions and alleles¶
When studying genetic variation in natural populations, there are two fundamental questions:
Where in the genome do we observe differences (i.e., nucleotide variation) between individual mosquitoes.
Where in the genome do we observe no differences (i.e., nucleotide conservation) between individuals.
Sometimes this second question is overlooked, but both questions are important.
In Ag1000G phase 3 we have taken a slightly different approach to calling single nucleotide polymorphisms (SNPs - single letter differences) than in previous project phases. Because we are interested in both of the above questions, we have genotyped all individual mosquitoes at all genomic sites where the reference genotype nucleotide is known (i.e., not an “N”), regardless of whether any variation is present. This is sometimes known as an “all sites” analysis.
In order to simplify SNP calling across a large number of individuals, we also declared the set of all possible alternate (variant) SNP alleles ahead of time. Because we are analysing SNPs and we are not considering any other types of variation at this stage, then at each site in the genome there is one reference allele (i.e., the nucleotide in the reference genome) and three possible alternate alleles. E.g., at position 10 on chromosome arm 2R, the reference allele is “T”, and so there are three possible alternate alleles: “A”, “C” and “G”.
Let’s look at how these data are represented for chromosome arm 2R:
ds_snps = ag3.snp_calls(region="2R", sample_sets="3.0")
ds_snps
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, samples: 3081, variants: 60132453) Coordinates: variant_position (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray> variant_contig (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray> sample_id (samples) <U24 dask.array<chunksize=(81,), meta=np.ndarray> Dimensions without coordinates: alleles, ploidy, samples, variants Data variables: variant_allele (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray> variant_filter_pass_gamb_colu_arab (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> variant_filter_pass_gamb_colu (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> variant_filter_pass_arab (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> call_genotype (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> call_GQ (variants, samples) int16 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_MQ (variants, samples) int16 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_AD (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray> call_genotype_mask (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> Attributes: contigs: ('2R', '2L', '3R', '3L', 'X')
- alleles: 4
- ploidy: 2
- samples: 3081
- variants: 60132453
- variant_position(variants)int32dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 240.53 MB 2.10 MB Shape (60132453,) (524288,) Count 116 Tasks 115 Chunks Type int32 numpy.ndarray - variant_contig(variants)uint8dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 60.13 MB 524.29 kB Shape (60132453,) (524288,) Count 115 Tasks 115 Chunks Type uint8 numpy.ndarray - sample_id(samples)<U24dask.array<chunksize=(81,), meta=np.ndarray>
Array Chunk Bytes 295.78 kB 29.09 kB Shape (3081,) (303,) Count 112 Tasks 28 Chunks Type numpy.ndarray
- variant_allele(variants, alleles)|S1dask.array<chunksize=(524288, 1), meta=np.ndarray>
Array Chunk Bytes 240.53 MB 1.57 MB Shape (60132453, 4) (524288, 3) Count 577 Tasks 230 Chunks Type |S1 numpy.ndarray - variant_filter_pass_gamb_colu_arab(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 60.13 MB 300.00 kB Shape (60132453,) (300000,) Count 202 Tasks 201 Chunks Type bool numpy.ndarray - variant_filter_pass_gamb_colu(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 60.13 MB 300.00 kB Shape (60132453,) (300000,) Count 202 Tasks 201 Chunks Type bool numpy.ndarray - variant_filter_pass_arab(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 60.13 MB 300.00 kB Shape (60132453,) (300000,) Count 202 Tasks 201 Chunks Type bool numpy.ndarray - call_genotype(variants, samples, ploidy)int8dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 370.54 GB 30.00 MB Shape (60132453, 3081, 2) (300000, 50, 2) Count 29776 Tasks 14874 Chunks Type int8 numpy.ndarray - call_GQ(variants, samples)int16dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 370.54 GB 30.00 MB Shape (60132453, 3081) (300000, 50) Count 29776 Tasks 14874 Chunks Type int16 numpy.ndarray - call_MQ(variants, samples)int16dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 370.54 GB 30.00 MB Shape (60132453, 3081) (300000, 50) Count 29776 Tasks 14874 Chunks Type int16 numpy.ndarray - call_AD(variants, samples, alleles)int16dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
Array Chunk Bytes 1.48 TB 120.00 MB Shape (60132453, 3081, 4) (300000, 50, 4) Count 29776 Tasks 14874 Chunks Type int16 numpy.ndarray - call_genotype_mask(variants, samples, ploidy)booldask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 370.54 GB 30.00 MB Shape (60132453, 3081, 2) (300000, 50, 2) Count 44650 Tasks 14874 Chunks Type bool numpy.ndarray
- contigs :
- ('2R', '2L', '3R', '3L', 'X')
The “variant_position” array here holds the 1-based positions (coordinates) where the SNPs were called. E.g., here’s the first 10 SNP positions:
pos = ds_snps["variant_position"].data
pos[:10].compute()
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=int32)
Note that this is simply the positions 1-10. However, these positions do not continue increasing by 1 all the way through the array, i.e., there are gaps, because of gaps in the reference genome. We can quickly visualise that, by plotting the distance between adjacent sites:
def plot_gaps(contig):
ds_snps = ag3.snp_calls(region=contig)
pos = ds_snps["variant_position"].data.compute()
d = np.diff(pos)
loc_gap = d > 1
x = pos[1:][loc_gap]
y = d[loc_gap]
fig, ax = plt.subplots(figsize=(7, 2))
ax.plot(x, y, color='black', linewidth=.5, marker='o', linestyle=' ', markersize=2, mfc='none')
ax.set_title("2R")
# convert X axis to Mbp
ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: x//1e6))
ax.set_xlabel("Position (Mbp)")
ax.set_ylabel('Gap size (bp)')
plot_gaps(contig="2R")

Note also that the length of the pos
array is less than the length of the reference genome sequence:
len(seq) - len(pos)
1412652
This difference is the number of unknown nucleotides (“N”s) in the reference genome that we computed earlier. This is another way of confirming that we have called SNP genotypes at all genomic sites where the reference genome nucleotide is known.
The “variant_alleles” array holds the reference and alternate alleles, i.e., the possible nucleotides for each SNP site. E.g., are the alleles at the first 10 SNP sites:
alleles = ds_snps["variant_allele"].data
alleles[:10].compute()
array([[b'C', b'A', b'T', b'G'],
[b'T', b'A', b'C', b'G'],
[b'C', b'A', b'T', b'G'],
[b'T', b'A', b'C', b'G'],
[b'A', b'C', b'T', b'G'],
[b'A', b'C', b'T', b'G'],
[b'A', b'C', b'T', b'G'],
[b'C', b'A', b'T', b'G'],
[b'A', b'C', b'T', b'G'],
[b'T', b'A', b'C', b'G']], dtype='|S1')
The first column contains the reference alleles. We can confirm that this corresponds to the nucleotides in the reference sequence by using the SNP positions to index the sequence (remembering to convert to zero-based coordinates):
seq[pos[:10].compute() - 1]
array([b'C', b'T', b'c', b't', b'a', b'a', b'a', b'c', b'a', b't'],
dtype='|S1')
Each row in the “variant_allele” array represents a site, and the columns show the possible alleles.
The same applies if you are working with the data in VCF format. E.g., here’s the first 10 rows of the VCF file with SNP genotypes for sample AR0001-C (see the download guide for how to get this file), note the three alternate (ALT) alleles at every site:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AR0001-C
2R 1 . C A,T,G 0.08 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=14.72 GT:AD:DP:GQ:PL 0/0:5,0,0,0:5:6:0,6,52,6,52,52,6,52,52,52
2R 2 . T A,C,G 0.08 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=13.44 GT:AD:DP:GQ:PL 0/0:6,0,0,0:6:6:0,6,53,6,53,53,6,53,53,53
2R 3 . C A,T,G 0.08 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=12.44 GT:AD:DP:GQ:PL 0/0:7,0,0,0:7:6:0,6,53,6,53,53,6,53,53,53
2R 4 . T A,C,G 0.08 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=12.44 GT:AD:DP:GQ:PL 0/0:7,0,0,0:7:6:0,6,53,6,53,53,6,53,53,53
2R 5 . A C,T,G 0.08 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=22.81 GT:AD:DP:GQ:PL 0/0:9,0,0,0:9:6:0,6,53,6,53,53,6,53,53,53
2R 6 . A C,T,G 0.02 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=28.78 GT:AD:DP:GQ:PL 0/0:10,0,0,0:10:12:0,12,113,12,113,113,12,113,113,113
2R 7 . A C,T,G 0.02 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=27.68 GT:AD:DP:GQ:PL 0/0:11,0,0,0:11:12:0,12,118,12,118,118,12,118,118,118
2R 8 . C A,T,G 0.02 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=27.68 GT:AD:DP:GQ:PL 0/0:11,0,0,0:11:12:0,12,130,12,130,130,12,130,130,130
2R 9 . A C,T,G 0.01 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=31.66 GT:AD:DP:GQ:PL 0/0:12,0,0,0:12:15:0,15,156,15,156,156,15,156,156,156
2R 10 . T A,C,G 0.01 . ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0.00,0.00,0.00;MQ=34.67 GT:AD:DP:GQ:PL 0/0:13,0,0,0:13:18:0,18,181,18,181,181,18,181,181,181
The most important thing to note about the SNP data in Ag3 is that not all of these alternate alleles will have been observed in the mosquitoes we sequenced.
To determine which alternate alleles are observed in Ag3, we need to look at the SNP genotypes. Let’s take a look at how SNP genotypes are represented, then return to the question of which alternate alleles have been observed.
Genotypes¶
Here a SNP “genotype” means the pair of nucleotides (alleles) carried by an individual mosquito at a specific site in the genome. Note that this is a pair of alleles because mosquitoes are diploid like us, meaning that they carry two copies of their genome, one inherited from each parent (except for males on the X chromosome, which inherit one copy from their mother only).
Rather than storing the actual nucleotides for each genotype, it is conventional to use a numerical encoding of genotypes, where 0 stands for the reference allele, 1 stands for the first alternate allele, 2 stands for the second alternate allele, 3 stands for the third alternate allele, etc.
E.g., at site 2R:10 the reference allele is “T”. If an individual mosquito carries a homozygous reference genotype (both genome copies have a “T”), we represent this in Python as the pair of integers (0, 0)
. In a VCF file this is represented as the string “0/0”.
At 2R:10 we have arbitrarily declared that “A” is the first alternate allele. If an individual carries a heterozygous genotype, where one genome copy has “T” and the other has “A”, this is represented as the pair (0, 1)
. The ordering of this pair has no meaning.
Similarly, at 2R:10 we have arbitrarily declared that “C” is the second alternate allele and “G” is the third alternate allele. So, a genotype homozygous for the “C” allele would be represented as (2, 2)
, a genotype homozygous for “G” would be (3, 3)
and a genotype heterozygous for the “C” and “G” alleles would be (2, 3)
.
There’s one more thing to know. Sometimes a genotype is called as missing, meaning that there were no sequence reads for the given individual at a given site. In Python we represent this as a pair of negative numbers (-1, -1)
. In VCF this is represented as the string “./.”.
Here are those example genotypes again, shown as a table with the different representations:
Reference allele | Alternate alleles | Genotype | Genotype (Python) | Genotype (VCF) | Description |
---|---|---|---|---|---|
T | A,C,G | T/T | (0, 0) |
0/0 | homozygous reference |
T/A | (0, 1) |
0/1 | heterozygous | ||
A/A | (1, 1) |
1/1 | homozygous alternate | ||
C/C | (2, 2) |
2/2 | homozygous alternate | ||
G/G | (3, 3) |
3/3 | homozygous alternate | ||
C/G | (2, 3) |
2/3 | heterozygous | ||
missing | (-1, -1) |
./. | missing |
To see this in action, let’s access the SNP genotypes for chromosome arm 2R for all wild samples in Ag3.
ds_snps = ag3.snp_calls(region="2R", sample_sets="3.0")
gt = ds_snps["call_genotype"].data
gt.shape
(60132453, 3081, 2)
The gt
array has three dimensions. The first dimension corresponds to the genomic sites we’ve analysed (in this case, there are 60,132,453 sites on chromosome arm 2R). The second dimension (3,081) is the number of individual mosquitoes (a.k.a. samples) sequenced. The third dimension (2) represents the two alleles in each individual genotype.
So, e.g., let’s look up the genotype call at the 10th site in the 1st sample:
gt[9, 0].compute()
array([0, 0], dtype=int8)
This is a homozygous reference genotype.
Let’s try another, the 10th site and the 2,393rd sample.
gt[9, 2392].compute()
array([0, 2], dtype=int8)
This is a heterozygous genotype, where the reference allele and the second alternate allele are both present.
Let’s look at one more genotype, the 10th site and the 249th sample.
gt[9, 248].compute()
array([-1, -1], dtype=int8)
This is a missing genotype call.
When working with genotype arrays it can be useful to use the scikit-allel package. Let’s wrap the gt
variable, which is a general purpose dask array, with the scikit-allel GenotypeDaskArray
class, which provides some useful functionality for analysing genotype data.
gt = allel.GenotypeDaskArray(ds_snps["call_genotype"].data)
gt
0 | 1 | 2 | 3 | 4 | ... | 3076 | 3077 | 3078 | 3079 | 3080 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | ./. | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | ./. | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | ./. | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
60132450 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. | |
60132451 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. | |
60132452 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. |
This wrapper class provided a display showing a preview of genotype calls for the first and last sites and samples, using the VCF-style representation of genotype calls (e.g., “0/0” is homozygous reference, etc.).
Once we have access to the genotype calls, we can run a variety of different computations. A common computation needed for many analyses is an allele count, which we’ll look at next.
Allele counts¶
An allele count involves scanning across all the genotypes at a given site and counting the number of times each allele is observed. I.e., how many 0s, 1s, 2s and 3s are observed. Let’s do an allele count for the first 10 sites on chromosome arm 2R.
ac = gt[:10].count_alleles(max_allele=3).compute()
ac.displayall()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 5871 | 2 | 1 | 0 |
1 | 5930 | 0 | 5 | 1 |
2 | 5981 | 7 | 2 | 0 |
3 | 6033 | 1 | 2 | 0 |
4 | 6050 | 0 | 0 | 0 |
5 | 6070 | 1 | 0 | 3 |
6 | 6080 | 4 | 0 | 0 |
7 | 6073 | 2 | 6 | 15 |
8 | 6118 | 0 | 1 | 1 |
9 | 6120 | 1 | 4 | 1 |
In the ac
array above, each row represents a site, and the four columns represent the four alleles. The values are the number of times each allele was observed at each site. So, now we can see which alleles have been observed in the Ag3 data.
E.g., for the 10th site, the reference allele (0
) was observed 6120 times, the first alternate allele (1
) was observed once, the second alternate allele (2
) was observed four times, and the third alternate allele (3
) was observed once. So, this is a site where in fact all four possible alleles (nucleotides) were observed in Ag3 samples. This is an example of a multiallelic site.
Let’s look at some other sites.
At the fifth site, only the reference allele was observed. This is an example of an invariant site. I.e., there is no variation there, all samples were homozygous for the reference allele or had a missing genotype call.
At the seventh site, two alleles were observed, the reference allele and the first alternate allele. This is an example of a biallelic site. Note that biallelic sites might involve any pair of alleles, not always 0
and 1
.
Let’s use this allele counts array to compute some properties of these 10 sites.
# how many alleles were observed
ac.allelism()
array([3, 3, 3, 3, 1, 3, 2, 4, 3, 4])
# whether any alternate alleles were observed
ac.is_variant()
array([ True, True, True, True, False, True, True, True, True,
True])
# whether more than one allele was observed
ac.is_segregating()
array([ True, True, True, True, False, True, True, True, True,
True])
# whether exactly two alleles were observed
ac.is_biallelic()
array([False, False, False, False, False, False, True, False, False,
False])
# total number of alleles called
an_called = ac.sum(axis=1)
an_called
array([5874, 5936, 5990, 6036, 6050, 6074, 6084, 6096, 6120, 6126])
# total number of missing allele calls
an_missing = (gt.shape[1] * 2) - an_called
an_missing
array([288, 226, 172, 126, 112, 88, 78, 66, 42, 36])
These arrays can be useful when selecting which sites to use for a given analysis. For example, you might want to select only segregating sites, with at most a certain number of missing allele calls.
Now we’ve seen how SNP positions, alleles and genotypes are represented, let’s look at the site filters we’ve created for Ag3 to help select sites for robust population genetic analyses.
Site filters¶
When analysing genome variation, there are some sites in the genome where it is difficult to reliably ascertain nucleotide variation and call genotypes. This can be for a variety of reasons. For example, some genome regions are very repetitive, and short sequence reads cannot be aligned unambiguously. Another issue can be that some samples have structural variation in some regions of the genome, where different samples have different numbers of copies of a given sequence.
To identify and avoid sites like these, we have created site filters. These are Boolean masks which identify sites where SNP genotyping is reliable, based on the fact that most sequence reads are unambiguously aligned, and there is minimal evidence for structural variation.
Because we have multiple mosquito species included in the Ag3 cohort (An. gambiae, An. coluzzii and An. arabiensis) we have created three different site filters which are appropriate in different circumstances, depending on which samples you’re analysing. Here is a summary of the three site filters:
“gamb_colu_arab” - These site filters are appropriate when performing a joint analysis of samples from any of the three species. They are, however, more stringent the other filters, which we’ll explore further below.
“gamb_colu” - These site filters are appropriate when analysing only samples that are not An. arabiensis.
“arab” - These sige filters are appropriate when analysing only samples that are An. arabiensis.
For each of these filters, there is one Boolean mask per chromosome arm. The Boolean masks are True where a site passes the filter and False otherwise. E.g., here’s the gamb_colu_arab
site filter for chromosome arm 2R:
ds_snps = ag3.snp_calls(region="2R")
loc_pass = ds_snps["variant_filter_pass_gamb_colu_arab"].data
loc_pass
|
# examine the first ten values
loc_pass[:10].compute()
array([False, False, False, False, False, False, False, False, False,
False])
# how many sites in total pass the filter?
loc_pass.sum().compute()
40561667
To get some intuition for what these site filters look like and how they differ from each other, let’s build a summary table showing how many sites pass in each chromosome arm. This may take a minute or two to run.
def tabulate_filter_counts():
rows = []
for contig in ag3.contigs:
contig_length = len(ag3.genome_sequence(contig))
ds_snps = ag3.snp_calls(region=contig)
n_sites = ds_snps.dims["variants"]
row = [contig, contig_length, n_sites]
for mask in "gamb_colu_arab", "gamb_colu", "arab":
loc_pass = ds_snps[f"variant_filter_pass_{mask}"].data
n_pass = loc_pass.sum().compute()
pc_pass = "{:.1%}".format(n_pass / n_sites)
row.append(pc_pass)
rows.append(row)
df = pd.DataFrame.from_records(
rows,
columns=["contig", "contig length", "no. sites", "pass gamb_colu_arab", "pass gamb_colu", "pass arab"]
)
return df
df_filter_counts = tabulate_filter_counts()
df_filter_counts
contig | contig length | no. sites | pass gamb_colu_arab | pass gamb_colu | pass arab | |
---|---|---|---|---|---|---|
0 | 2R | 61545105 | 60132453 | 67.5% | 73.9% | 73.6% |
1 | 2L | 49364325 | 48525747 | 67.0% | 74.2% | 72.6% |
2 | 3R | 53200684 | 52226568 | 63.9% | 71.2% | 69.8% |
3 | 3L | 41963435 | 40758473 | 63.5% | 70.4% | 69.7% |
4 | X | 24393108 | 23385349 | 48.0% | 70.0% | 53.3% |
There’s a couple of things to note about these numbers:
The percentage of sites passing the
gamb_colu_arab
filter is always lower than either thegamb_colu
orarab
filters. This is because thegamb_colu_arab
filter is the intersection of the two other filters. I.e., a site passes thegamb_colu_arab
filter if it passes both thegamb_colu
and thearab
filters.The percentage of sites passing the
gamb_colu
andarab
filters is similar on all chromosome arms, except for the X chromosome. This is because of a large inversion on the X chromosome known as Xag which is a fixed difference between An. arabiensis and the other two species, and where there is a higher level of divergence in An. arabiensis relative to the reference genome we’ve used.
To gain a bit more intuition for these filters, let’s plot the fraction of genome sites that pass the site filters in moving windows over each of the chromosome arms.
def plot_site_filters(contig, window_size=200_000, ax=None):
# setup figure and axes
if ax is None:
contig_length = len(ag3.genome_sequence(contig))
figw = 7 * contig_length / len(ag3.genome_sequence("2R"))
fig, ax = plt.subplots(figsize=(figw, 2))
sns.despine(ax=ax, offset=5)
# setup X variable
ds_snps = ag3.snp_calls(region=contig)
pos = ds_snps["variant_position"].values
x = allel.moving_statistic(pos, statistic=np.mean, size=window_size)
# setup Y variables and plot
for mask in "gamb_colu", "arab", "gamb_colu_arab":
loc_pass = ds_snps[f"variant_filter_pass_{mask}"].values
y = allel.moving_statistic(loc_pass, statistic=np.mean, size=window_size)
ax.plot(x, y * 100, label=mask)
# tidy plot
ax.set_xlim(0, contig_length)
ax.set_ylim(0, 100)
ax.set_title(contig)
# convert X axis to Mbp
ax.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, p: x//1e6))
ax.set_xlabel("Position (Mbp)")
ax.set_ylabel('% pass')
ax.legend(loc='upper left', bbox_to_anchor=(1, 1), title="Filter",
frameon=False)
plot_site_filters(contig="2R")

plot_site_filters(contig="2L")

plot_site_filters(contig="3R")

plot_site_filters(contig="3L")

plot_site_filters(contig="X")

The methods in the malariagen_data.Ag3
class allow you to access data on sites or genotypes with one of these filters already applied. E.g., let’s access the SNP calls with the gamb_colu_arab
filter applied.
ds_snps_filtered = ag3.snp_calls(region="2R", site_mask="gamb_colu_arab", sample_sets="3.0")
ds_snps_filtered
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, samples: 3081, variants: 40561667) Coordinates: variant_position (variants) int32 dask.array<chunksize=(436987,), meta=np.ndarray> variant_contig (variants) uint8 dask.array<chunksize=(436987,), meta=np.ndarray> sample_id (samples) <U24 dask.array<chunksize=(81,), meta=np.ndarray> Dimensions without coordinates: alleles, ploidy, samples, variants Data variables: variant_allele (variants, alleles) |S1 dask.array<chunksize=(436987, 1), meta=np.ndarray> variant_filter_pass_gamb_colu_arab (variants) bool dask.array<chunksize=(246844,), meta=np.ndarray> variant_filter_pass_gamb_colu (variants) bool dask.array<chunksize=(246844,), meta=np.ndarray> variant_filter_pass_arab (variants) bool dask.array<chunksize=(246844,), meta=np.ndarray> call_genotype (variants, samples, ploidy) int8 dask.array<chunksize=(246844, 50, 2), meta=np.ndarray> call_GQ (variants, samples) int16 dask.array<chunksize=(246844, 50), meta=np.ndarray> call_MQ (variants, samples) int16 dask.array<chunksize=(246844, 50), meta=np.ndarray> call_AD (variants, samples, alleles) int16 dask.array<chunksize=(246844, 50, 4), meta=np.ndarray> call_genotype_mask (variants, samples, ploidy) bool dask.array<chunksize=(246844, 50, 2), meta=np.ndarray> Attributes: contigs: ('2R', '2L', '3R', '3L', 'X')
- alleles: 4
- ploidy: 2
- samples: 3081
- variants: 40561667
- variant_position(variants)int32dask.array<chunksize=(436987,), meta=np.ndarray>
Array Chunk Bytes 162.25 MB 1.82 MB Shape (40561667,) (455384,) Count 891 Tasks 115 Chunks Type int32 numpy.ndarray - variant_contig(variants)uint8dask.array<chunksize=(436987,), meta=np.ndarray>
Array Chunk Bytes 40.56 MB 455.38 kB Shape (40561667,) (455384,) Count 890 Tasks 115 Chunks Type uint8 numpy.ndarray - sample_id(samples)<U24dask.array<chunksize=(81,), meta=np.ndarray>
Array Chunk Bytes 295.78 kB 29.09 kB Shape (3081,) (303,) Count 112 Tasks 28 Chunks Type numpy.ndarray
- variant_allele(variants, alleles)|S1dask.array<chunksize=(436987, 1), meta=np.ndarray>
Array Chunk Bytes 162.25 MB 1.37 MB Shape (40561667, 4) (455384, 3) Count 1352 Tasks 230 Chunks Type |S1 numpy.ndarray - variant_filter_pass_gamb_colu_arab(variants)booldask.array<chunksize=(246844,), meta=np.ndarray>
Array Chunk Bytes 40.56 MB 262.53 kB Shape (40561667,) (262530,) Count 604 Tasks 201 Chunks Type bool numpy.ndarray - variant_filter_pass_gamb_colu(variants)booldask.array<chunksize=(246844,), meta=np.ndarray>
Array Chunk Bytes 40.56 MB 262.53 kB Shape (40561667,) (262530,) Count 806 Tasks 201 Chunks Type bool numpy.ndarray - variant_filter_pass_arab(variants)booldask.array<chunksize=(246844,), meta=np.ndarray>
Array Chunk Bytes 40.56 MB 262.53 kB Shape (40561667,) (262530,) Count 806 Tasks 201 Chunks Type bool numpy.ndarray - call_genotype(variants, samples, ploidy)int8dask.array<chunksize=(246844, 50, 2), meta=np.ndarray>
Array Chunk Bytes 249.94 GB 26.25 MB Shape (40561667, 3081, 2) (262530, 50, 2) Count 44852 Tasks 14874 Chunks Type int8 numpy.ndarray - call_GQ(variants, samples)int16dask.array<chunksize=(246844, 50), meta=np.ndarray>
Array Chunk Bytes 249.94 GB 26.25 MB Shape (40561667, 3081) (262530, 50) Count 44852 Tasks 14874 Chunks Type int16 numpy.ndarray - call_MQ(variants, samples)int16dask.array<chunksize=(246844, 50), meta=np.ndarray>
Array Chunk Bytes 249.94 GB 26.25 MB Shape (40561667, 3081) (262530, 50) Count 44852 Tasks 14874 Chunks Type int16 numpy.ndarray - call_AD(variants, samples, alleles)int16dask.array<chunksize=(246844, 50, 4), meta=np.ndarray>
Array Chunk Bytes 999.76 GB 105.01 MB Shape (40561667, 3081, 4) (262530, 50, 4) Count 44852 Tasks 14874 Chunks Type int16 numpy.ndarray - call_genotype_mask(variants, samples, ploidy)booldask.array<chunksize=(246844, 50, 2), meta=np.ndarray>
Array Chunk Bytes 249.94 GB 26.25 MB Shape (40561667, 3081, 2) (262530, 50, 2) Count 59726 Tasks 14874 Chunks Type bool numpy.ndarray
- contigs :
- ('2R', '2L', '3R', '3L', 'X')
Notice the size of the “variants” dimension is 40,561,667, which we calculated earlier as the number of sites passing the gamb_colu_arab
filter on chromosome arm 2R.
Similarly, here are the genotypes with the gamb_colu_arab
site filter applied.
gt_filtered = ds_snps_filtered["call_genotype"].data
gt_filtered
|
Locating SNPs¶
Locating a single SNP¶
For some analyses you may have a particular genomic site you are investigating, and want to know what variation occurs at that site in the Ag3 data resource. To locate that site, a convenient approach is to use an index.
For example, let’s locate data for the site 2L:2422652.
ds_snps = ag3.snp_calls(region="2L", sample_sets="3.0")
ds_snps = ds_snps.set_index(variants="variant_position", samples="sample_id")
ds_snps
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, samples: 3081, variants: 48525747) Coordinates: variant_contig (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray> * variants (variants) int64 1 2 ... 49364325 * samples (samples) object 'AR0047-C' ... 'AD04... Dimensions without coordinates: alleles, ploidy Data variables: variant_allele (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray> variant_filter_pass_gamb_colu_arab (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> variant_filter_pass_gamb_colu (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> variant_filter_pass_arab (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> call_genotype (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> call_GQ (variants, samples) int16 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_MQ (variants, samples) int16 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_AD (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray> call_genotype_mask (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> Attributes: contigs: ('2R', '2L', '3R', '3L', 'X')
- alleles: 4
- ploidy: 2
- samples: 3081
- variants: 48525747
- variant_contig(variants)uint8dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 524.29 kB Shape (48525747,) (524288,) Count 93 Tasks 93 Chunks Type uint8 numpy.ndarray - variants(variants)int641 2 3 ... 49364324 49364325
array([ 1, 2, 3, ..., 49364323, 49364324, 49364325])
- samples(samples)object'AR0047-C' ... 'AD0498-C'
array(['AR0047-C', 'AR0049-C', 'AR0051-C', ..., 'AD0496-C', 'AD0497-C', 'AD0498-C'], dtype=object)
- variant_allele(variants, alleles)|S1dask.array<chunksize=(524288, 1), meta=np.ndarray>
Array Chunk Bytes 194.10 MB 1.57 MB Shape (48525747, 4) (524288, 3) Count 467 Tasks 186 Chunks Type |S1 numpy.ndarray - variant_filter_pass_gamb_colu_arab(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - variant_filter_pass_gamb_colu(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - variant_filter_pass_arab(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - call_genotype(variants, samples, ploidy)int8dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 299.02 GB 30.00 MB Shape (48525747, 3081, 2) (300000, 50, 2) Count 24004 Tasks 11988 Chunks Type int8 numpy.ndarray - call_GQ(variants, samples)int16dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 299.02 GB 30.00 MB Shape (48525747, 3081) (300000, 50) Count 24004 Tasks 11988 Chunks Type int16 numpy.ndarray - call_MQ(variants, samples)int16dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 299.02 GB 30.00 MB Shape (48525747, 3081) (300000, 50) Count 24004 Tasks 11988 Chunks Type int16 numpy.ndarray - call_AD(variants, samples, alleles)int16dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
Array Chunk Bytes 1.20 TB 120.00 MB Shape (48525747, 3081, 4) (300000, 50, 4) Count 24004 Tasks 11988 Chunks Type int16 numpy.ndarray - call_genotype_mask(variants, samples, ploidy)booldask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 299.02 GB 30.00 MB Shape (48525747, 3081, 2) (300000, 50, 2) Count 35992 Tasks 11988 Chunks Type bool numpy.ndarray
- contigs :
- ('2R', '2L', '3R', '3L', 'X')
# access the SNP data at that site
ds_site = ds_snps.sel(variants=2422652)
ds_site
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, samples: 3081) Coordinates: variant_contig uint64 dask.array<chunksize=(), meta=np.ndarray> variants int64 2422652 * samples (samples) object 'AR0047-C' ... 'AD04... Dimensions without coordinates: alleles, ploidy Data variables: variant_allele (alleles) |S1 dask.array<chunksize=(1,), meta=np.ndarray> variant_filter_pass_gamb_colu_arab int64 dask.array<chunksize=(), meta=np.ndarray> variant_filter_pass_gamb_colu int64 dask.array<chunksize=(), meta=np.ndarray> variant_filter_pass_arab int64 dask.array<chunksize=(), meta=np.ndarray> call_genotype (samples, ploidy) int8 dask.array<chunksize=(50, 2), meta=np.ndarray> call_GQ (samples) int16 dask.array<chunksize=(50,), meta=np.ndarray> call_MQ (samples) int16 dask.array<chunksize=(50,), meta=np.ndarray> call_AD (samples, alleles) int16 dask.array<chunksize=(50, 4), meta=np.ndarray> call_genotype_mask (samples, ploidy) bool dask.array<chunksize=(50, 2), meta=np.ndarray> Attributes: contigs: ('2R', '2L', '3R', '3L', 'X')
- alleles: 4
- ploidy: 2
- samples: 3081
- variant_contig()uint64dask.array<chunksize=(), meta=np.ndarray>
Array Chunk Bytes 8 B 8 B Shape () () Count 94 Tasks 1 Chunks Type uint64 numpy.ndarray - variants()int642422652
array(2422652)
- samples(samples)object'AR0047-C' ... 'AD0498-C'
array(['AR0047-C', 'AR0049-C', 'AR0051-C', ..., 'AD0496-C', 'AD0497-C', 'AD0498-C'], dtype=object)
- variant_allele(alleles)|S1dask.array<chunksize=(1,), meta=np.ndarray>
Array Chunk Bytes 4 B 3 B Shape (4,) (3,) Count 469 Tasks 2 Chunks Type |S1 numpy.ndarray - variant_filter_pass_gamb_colu_arab()int64dask.array<chunksize=(), meta=np.ndarray>
Array Chunk Bytes 8 B 8 B Shape () () Count 164 Tasks 1 Chunks Type int64 numpy.ndarray - variant_filter_pass_gamb_colu()int64dask.array<chunksize=(), meta=np.ndarray>
Array Chunk Bytes 8 B 8 B Shape () () Count 164 Tasks 1 Chunks Type int64 numpy.ndarray - variant_filter_pass_arab()int64dask.array<chunksize=(), meta=np.ndarray>
Array Chunk Bytes 8 B 8 B Shape () () Count 164 Tasks 1 Chunks Type int64 numpy.ndarray - call_genotype(samples, ploidy)int8dask.array<chunksize=(50, 2), meta=np.ndarray>
Array Chunk Bytes 6.16 kB 100 B Shape (3081, 2) (50, 2) Count 24078 Tasks 74 Chunks Type int8 numpy.ndarray - call_GQ(samples)int16dask.array<chunksize=(50,), meta=np.ndarray>
Array Chunk Bytes 6.16 kB 100 B Shape (3081,) (50,) Count 24078 Tasks 74 Chunks Type int16 numpy.ndarray - call_MQ(samples)int16dask.array<chunksize=(50,), meta=np.ndarray>
Array Chunk Bytes 6.16 kB 100 B Shape (3081,) (50,) Count 24078 Tasks 74 Chunks Type int16 numpy.ndarray - call_AD(samples, alleles)int16dask.array<chunksize=(50, 4), meta=np.ndarray>
Array Chunk Bytes 24.65 kB 400 B Shape (3081, 4) (50, 4) Count 24078 Tasks 74 Chunks Type int16 numpy.ndarray - call_genotype_mask(samples, ploidy)booldask.array<chunksize=(50, 2), meta=np.ndarray>
Array Chunk Bytes 6.16 kB 100 B Shape (3081, 2) (50, 2) Count 36066 Tasks 74 Chunks Type bool numpy.ndarray
- contigs :
- ('2R', '2L', '3R', '3L', 'X')
# access the alternate alleles
alleles_site = ds_site["variant_allele"].values
alleles_site
array([b'A', b'C', b'T', b'G'], dtype='|S1')
# access the genotypes
gt_site = ds_site["call_genotype"].values
gt_site
array([[2, 2],
[2, 2],
[2, 2],
...,
[0, 0],
[0, 2],
[0, 2]], dtype=int8)
From the output above we can see there are some samples with a (2, 2)
genotype, which correspond to the T/T genotype, and some samples with a (0, 0)
which corresponds to the A/A genotype.
Locating SNPs in a genome region¶
For other analyses you may want to extract data for a specific genome region, such as a gene. Again a good approach is to use an index. E.g., let’s locate data for the Vgsc gene, which spans the region 2L:2358158-2431617.
ds_region = ds_snps.sel(variants=slice(2358158, 2431617))
ds_region
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, samples: 3081, variants: 69667) Coordinates: variant_contig (variants) uint8 dask.array<chunksize=(14017,), meta=np.ndarray> * variants (variants) int64 2358158 ... 2431617 * samples (samples) object 'AR0047-C' ... 'AD04... Dimensions without coordinates: alleles, ploidy Data variables: variant_allele (variants, alleles) |S1 dask.array<chunksize=(14017, 1), meta=np.ndarray> variant_filter_pass_gamb_colu_arab (variants) bool dask.array<chunksize=(16865,), meta=np.ndarray> variant_filter_pass_gamb_colu (variants) bool dask.array<chunksize=(16865,), meta=np.ndarray> variant_filter_pass_arab (variants) bool dask.array<chunksize=(16865,), meta=np.ndarray> call_genotype (variants, samples, ploidy) int8 dask.array<chunksize=(16865, 50, 2), meta=np.ndarray> call_GQ (variants, samples) int16 dask.array<chunksize=(16865, 50), meta=np.ndarray> call_MQ (variants, samples) int16 dask.array<chunksize=(16865, 50), meta=np.ndarray> call_AD (variants, samples, alleles) int16 dask.array<chunksize=(16865, 50, 4), meta=np.ndarray> call_genotype_mask (variants, samples, ploidy) bool dask.array<chunksize=(16865, 50, 2), meta=np.ndarray> Attributes: contigs: ('2R', '2L', '3R', '3L', 'X')
- alleles: 4
- ploidy: 2
- samples: 3081
- variants: 69667
- variant_contig(variants)uint8dask.array<chunksize=(14017,), meta=np.ndarray>
Array Chunk Bytes 69.67 kB 55.65 kB Shape (69667,) (55650,) Count 95 Tasks 2 Chunks Type uint8 numpy.ndarray - variants(variants)int642358158 2358159 ... 2431616 2431617
array([2358158, 2358159, 2358160, ..., 2431615, 2431616, 2431617])
- samples(samples)object'AR0047-C' ... 'AD0498-C'
array(['AR0047-C', 'AR0049-C', 'AR0051-C', ..., 'AD0496-C', 'AD0497-C', 'AD0498-C'], dtype=object)
- variant_allele(variants, alleles)|S1dask.array<chunksize=(14017, 1), meta=np.ndarray>
Array Chunk Bytes 278.67 kB 166.95 kB Shape (69667, 4) (55650, 3) Count 471 Tasks 4 Chunks Type |S1 numpy.ndarray - variant_filter_pass_gamb_colu_arab(variants)booldask.array<chunksize=(16865,), meta=np.ndarray>
Array Chunk Bytes 69.67 kB 52.80 kB Shape (69667,) (52802,) Count 165 Tasks 2 Chunks Type bool numpy.ndarray - variant_filter_pass_gamb_colu(variants)booldask.array<chunksize=(16865,), meta=np.ndarray>
Array Chunk Bytes 69.67 kB 52.80 kB Shape (69667,) (52802,) Count 165 Tasks 2 Chunks Type bool numpy.ndarray - variant_filter_pass_arab(variants)booldask.array<chunksize=(16865,), meta=np.ndarray>
Array Chunk Bytes 69.67 kB 52.80 kB Shape (69667,) (52802,) Count 165 Tasks 2 Chunks Type bool numpy.ndarray - call_genotype(variants, samples, ploidy)int8dask.array<chunksize=(16865, 50, 2), meta=np.ndarray>
Array Chunk Bytes 429.29 MB 5.28 MB Shape (69667, 3081, 2) (52802, 50, 2) Count 24152 Tasks 148 Chunks Type int8 numpy.ndarray - call_GQ(variants, samples)int16dask.array<chunksize=(16865, 50), meta=np.ndarray>
Array Chunk Bytes 429.29 MB 5.28 MB Shape (69667, 3081) (52802, 50) Count 24152 Tasks 148 Chunks Type int16 numpy.ndarray - call_MQ(variants, samples)int16dask.array<chunksize=(16865, 50), meta=np.ndarray>
Array Chunk Bytes 429.29 MB 5.28 MB Shape (69667, 3081) (52802, 50) Count 24152 Tasks 148 Chunks Type int16 numpy.ndarray - call_AD(variants, samples, alleles)int16dask.array<chunksize=(16865, 50, 4), meta=np.ndarray>
Array Chunk Bytes 1.72 GB 21.12 MB Shape (69667, 3081, 4) (52802, 50, 4) Count 24152 Tasks 148 Chunks Type int16 numpy.ndarray - call_genotype_mask(variants, samples, ploidy)booldask.array<chunksize=(16865, 50, 2), meta=np.ndarray>
Array Chunk Bytes 429.29 MB 5.28 MB Shape (69667, 3081, 2) (52802, 50, 2) Count 36140 Tasks 148 Chunks Type bool numpy.ndarray
- contigs :
- ('2R', '2L', '3R', '3L', 'X')
# obtain SNP positions within selected region
pos_region = ds_region["variants"].values
pos_region
array([2358158, 2358159, 2358160, ..., 2431615, 2431616, 2431617])
# alleles within selected region
alleles_region = ds_region["variant_allele"].values
alleles_region
array([[b'A', b'C', b'T', b'G'],
[b'T', b'A', b'C', b'G'],
[b'G', b'A', b'C', b'T'],
...,
[b'T', b'A', b'C', b'G'],
[b'G', b'A', b'C', b'T'],
[b'A', b'C', b'T', b'G']], dtype='|S1')
# genotypes within selected region
gt_region = allel.GenotypeDaskArray(ds_region["call_genotype"].data)
gt_region
0 | 1 | 2 | 3 | 4 | ... | 3076 | 3077 | 3078 | 3079 | 3080 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
69664 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
69665 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
69666 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
# allele counts
ac_region = gt_region.count_alleles(max_allele=3).compute()
ac_region
0 | 1 | 2 | 3 | ||
---|---|---|---|---|---|
0 | 6162 | 0 | 0 | 0 | |
1 | 6162 | 0 | 0 | 0 | |
2 | 6161 | 0 | 1 | 0 | |
... | ... | ||||
69664 | 6162 | 0 | 0 | 0 | |
69665 | 6162 | 0 | 0 | 0 | |
69666 | 6162 | 0 | 0 | 0 |
# how many segregating SNPs within our region of interest?
ac_region.is_segregating().sum()
35754
Locating samples¶
In many cases you may want to access genotypes for an individual mosquito sample, or a group of samples. This usually requires linking the sample metadata to the genotypes. Let’s look at how to do that. First, load sample metadata.
df_samples = ag3.sample_metadata(sample_sets="3.0").set_index("sample_id")
df_samples
partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | sample_set | ... | aim_species | country_iso | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin2_year | cohort_admin2_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sample_id | |||||||||||||||||||||
AR0047-C | LUA047 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | AG1000G-AO | ... | coluzzii | AGO | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 |
AR0049-C | LUA049 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | AG1000G-AO | ... | coluzzii | AGO | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 |
AR0051-C | LUA051 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | AG1000G-AO | ... | coluzzii | AGO | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 |
AR0061-C | LUA061 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | AG1000G-AO | ... | coluzzii | AGO | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 |
AR0078-C | LUA078 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | AG1000G-AO | ... | coluzzii | AGO | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
AD0494-C | 80-2-o-16 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | F | AG1000G-X | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AD0495-C | 80-2-o-17 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | AG1000G-X | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AD0496-C | 80-2-o-18 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | AG1000G-X | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AD0497-C | 80-2-o-19 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | F | AG1000G-X | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
AD0498-C | 80-2-o-20 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | AG1000G-X | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3081 rows × 25 columns
Now access SNP calls for whatever chromosome arm you are analysing, e.g., 2L.
ds_snps = ag3.snp_calls(region="2L", sample_sets="3.0")
ds_snps = ds_snps.set_index(variants="variant_position", samples="sample_id")
ds_snps
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, samples: 3081, variants: 48525747) Coordinates: variant_contig (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray> * variants (variants) int64 1 2 ... 49364325 * samples (samples) object 'AR0047-C' ... 'AD04... Dimensions without coordinates: alleles, ploidy Data variables: variant_allele (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray> variant_filter_pass_gamb_colu_arab (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> variant_filter_pass_gamb_colu (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> variant_filter_pass_arab (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> call_genotype (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> call_GQ (variants, samples) int16 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_MQ (variants, samples) int16 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_AD (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray> call_genotype_mask (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> Attributes: contigs: ('2R', '2L', '3R', '3L', 'X')
- alleles: 4
- ploidy: 2
- samples: 3081
- variants: 48525747
- variant_contig(variants)uint8dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 524.29 kB Shape (48525747,) (524288,) Count 93 Tasks 93 Chunks Type uint8 numpy.ndarray - variants(variants)int641 2 3 ... 49364324 49364325
array([ 1, 2, 3, ..., 49364323, 49364324, 49364325])
- samples(samples)object'AR0047-C' ... 'AD0498-C'
array(['AR0047-C', 'AR0049-C', 'AR0051-C', ..., 'AD0496-C', 'AD0497-C', 'AD0498-C'], dtype=object)
- variant_allele(variants, alleles)|S1dask.array<chunksize=(524288, 1), meta=np.ndarray>
Array Chunk Bytes 194.10 MB 1.57 MB Shape (48525747, 4) (524288, 3) Count 467 Tasks 186 Chunks Type |S1 numpy.ndarray - variant_filter_pass_gamb_colu_arab(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - variant_filter_pass_gamb_colu(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - variant_filter_pass_arab(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - call_genotype(variants, samples, ploidy)int8dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 299.02 GB 30.00 MB Shape (48525747, 3081, 2) (300000, 50, 2) Count 24004 Tasks 11988 Chunks Type int8 numpy.ndarray - call_GQ(variants, samples)int16dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 299.02 GB 30.00 MB Shape (48525747, 3081) (300000, 50) Count 24004 Tasks 11988 Chunks Type int16 numpy.ndarray - call_MQ(variants, samples)int16dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 299.02 GB 30.00 MB Shape (48525747, 3081) (300000, 50) Count 24004 Tasks 11988 Chunks Type int16 numpy.ndarray - call_AD(variants, samples, alleles)int16dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
Array Chunk Bytes 1.20 TB 120.00 MB Shape (48525747, 3081, 4) (300000, 50, 4) Count 24004 Tasks 11988 Chunks Type int16 numpy.ndarray - call_genotype_mask(variants, samples, ploidy)booldask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 299.02 GB 30.00 MB Shape (48525747, 3081, 2) (300000, 50, 2) Count 35992 Tasks 11988 Chunks Type bool numpy.ndarray
- contigs :
- ('2R', '2L', '3R', '3L', 'X')
The key point to know here is that the number of rows in the sample metadata dataframe and the size of the “samples” dimension in the SNP calls dataset are the same.
len(df_samples) == ds_snps.dims["samples"]
True
Also, the order of samples in the sample metadata is the same as the order of the columns in the genotype array. We can use this correspondance to look up samples and access their genotypes.
Locating a single sample¶
Let’s extract both sample metadata and genotypes for a single sample, e.g., sample AB0110-C.
s = df_samples.loc['AB0110-C']
s
partner_sample_id BF9-2
contributor Austin Burt
country Burkina Faso
location Bana
year 2012
month 7
latitude 11.233
longitude -4.472
sex_call F
sample_set AG1000G-BF-A
release 3.0
aim_species_fraction_colu 0.974
aim_species_fraction_arab 0.002
aim_species_gambcolu_arabiensis gamb_colu
aim_species_gambiae_coluzzii coluzzii
aim_species coluzzii
country_iso BFA
admin1_name Hauts-Bassins
admin1_iso BF-09
admin2_name Houet
taxon coluzzii
cohort_admin1_year BF-09_colu_2012
cohort_admin1_month BF-09_colu_2012_07
cohort_admin2_year BF-09_Houet_colu_2012
cohort_admin2_month BF-09_Houet_colu_2012_07
Name: AB0110-C, dtype: object
ds_sample = ds_snps.sel(samples='AB0110-C')
ds_sample
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, variants: 48525747) Coordinates: variant_contig (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray> * variants (variants) int64 1 2 ... 49364325 samples <U8 'AB0110-C' Dimensions without coordinates: alleles, ploidy Data variables: variant_allele (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray> variant_filter_pass_gamb_colu_arab (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> variant_filter_pass_gamb_colu (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> variant_filter_pass_arab (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray> call_genotype (variants, ploidy) int8 dask.array<chunksize=(300000, 2), meta=np.ndarray> call_GQ (variants) int16 dask.array<chunksize=(300000,), meta=np.ndarray> call_MQ (variants) int16 dask.array<chunksize=(300000,), meta=np.ndarray> call_AD (variants, alleles) int16 dask.array<chunksize=(300000, 4), meta=np.ndarray> call_genotype_mask (variants, ploidy) bool dask.array<chunksize=(300000, 2), meta=np.ndarray> Attributes: contigs: ('2R', '2L', '3R', '3L', 'X')
- alleles: 4
- ploidy: 2
- variants: 48525747
- variant_contig(variants)uint8dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 524.29 kB Shape (48525747,) (524288,) Count 93 Tasks 93 Chunks Type uint8 numpy.ndarray - variants(variants)int641 2 3 ... 49364324 49364325
array([ 1, 2, 3, ..., 49364323, 49364324, 49364325])
- samples()<U8'AB0110-C'
array('AB0110-C', dtype='<U8')
- variant_allele(variants, alleles)|S1dask.array<chunksize=(524288, 1), meta=np.ndarray>
Array Chunk Bytes 194.10 MB 1.57 MB Shape (48525747, 4) (524288, 3) Count 467 Tasks 186 Chunks Type |S1 numpy.ndarray - variant_filter_pass_gamb_colu_arab(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - variant_filter_pass_gamb_colu(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - variant_filter_pass_arab(variants)booldask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 48.53 MB 300.00 kB Shape (48525747,) (300000,) Count 163 Tasks 162 Chunks Type bool numpy.ndarray - call_genotype(variants, ploidy)int8dask.array<chunksize=(300000, 2), meta=np.ndarray>
Array Chunk Bytes 97.05 MB 600.00 kB Shape (48525747, 2) (300000, 2) Count 24166 Tasks 162 Chunks Type int8 numpy.ndarray - call_GQ(variants)int16dask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 97.05 MB 600.00 kB Shape (48525747,) (300000,) Count 24166 Tasks 162 Chunks Type int16 numpy.ndarray - call_MQ(variants)int16dask.array<chunksize=(300000,), meta=np.ndarray>
Array Chunk Bytes 97.05 MB 600.00 kB Shape (48525747,) (300000,) Count 24166 Tasks 162 Chunks Type int16 numpy.ndarray - call_AD(variants, alleles)int16dask.array<chunksize=(300000, 4), meta=np.ndarray>
Array Chunk Bytes 388.21 MB 2.40 MB Shape (48525747, 4) (300000, 4) Count 24166 Tasks 162 Chunks Type int16 numpy.ndarray - call_genotype_mask(variants, ploidy)booldask.array<chunksize=(300000, 2), meta=np.ndarray>