Ag3 cloud data access

This page provides information about how to access data from Anopheles gambiae 1000 Genomes project (Ag1000G) phase 3 via Google Cloud. This includes sample metadata and single nucleotide polymorphism (SNP) calls.

This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via MyBinder or Google Colab which are free interactive computing service running in the cloud.

To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.

Data hosting

All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the vo_agam_release bucket, which is a multi-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access.

Setup

Running this notebook requires some Python packages to be installed. These packages can be installed via pip or conda. E.g.:

!pip install -q malariagen_data

To make accessing these data more convenient, we’ve created the malariagen_data Python package, which is available from PyPI. This is experimental so please let us know if you find any bugs or have any suggestions.

Now import the packages we’ll need to use here.

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
# silence some warnings
dask.config.set(**{'array.slicing.split_large_chunks': False})
import allel
import malariagen_data

Ag3 data access from Google Cloud is set up with the following code:

ag3 = malariagen_data.Ag3("gs://vo_agam_release/")

Sample sets

Data in this release are organised into 28 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets.

To see which sample sets are available, load the sample set manifest into a pandas dataframe:

df_sample_sets = ag3.sample_sets()
df_sample_sets
sample_set sample_count release
0 AG1000G-AO 81 v3
1 AG1000G-BF-A 181 v3
2 AG1000G-BF-B 102 v3
3 AG1000G-BF-C 13 v3
4 AG1000G-CD 76 v3
5 AG1000G-CF 73 v3
6 AG1000G-CI 80 v3
7 AG1000G-CM-A 303 v3
8 AG1000G-CM-B 97 v3
9 AG1000G-CM-C 44 v3
10 AG1000G-FR 23 v3
11 AG1000G-GA-A 69 v3
12 AG1000G-GH 100 v3
13 AG1000G-GM-A 74 v3
14 AG1000G-GM-B 31 v3
15 AG1000G-GM-C 174 v3
16 AG1000G-GN-A 45 v3
17 AG1000G-GN-B 185 v3
18 AG1000G-GQ 10 v3
19 AG1000G-GW 101 v3
20 AG1000G-KE 86 v3
21 AG1000G-ML-A 60 v3
22 AG1000G-ML-B 71 v3
23 AG1000G-MW 41 v3
24 AG1000G-MZ 74 v3
25 AG1000G-TZ 300 v3
26 AG1000G-UG 290 v3
27 AG1000G-X 297 v3

For more information about these sample sets, see the section on sample sets in the introduction to Ag1000G phase 3.

Sample metadata

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.

E.g., load sample metadata for the AG1000G-BF-A and AG1000G-BF-B sample sets into a pandas dataframe:

df_samples = ag3.sample_metadata(sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"])
df_samples
sample_id partner_sample_id contributor country location year month latitude longitude sex_call sample_set release aim_fraction_colu aim_fraction_arab species_gambcolu_arabiensis species_gambiae_coluzzii species
0 AB0085-Cx BF2-4 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.024 0.002 gamb_colu gambiae gambiae
1 AB0086-Cx BF2-6 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.038 0.002 gamb_colu gambiae gambiae
2 AB0087-C BF3-3 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.982 0.002 gamb_colu coluzzii coluzzii
3 AB0088-C BF3-5 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.990 0.002 gamb_colu coluzzii coluzzii
4 AB0089-Cx BF3-8 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.975 0.002 gamb_colu coluzzii coluzzii
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
278 AB0533-C BF13-18 Austin Burt Burkina Faso Souroukoudinga 2014 7 11.235 -4.535 F AG1000G-BF-B v3 0.021 0.002 gamb_colu gambiae gambiae
279 AB0536-C BF13-31 Austin Burt Burkina Faso Souroukoudinga 2014 7 11.235 -4.535 F AG1000G-BF-B v3 0.025 0.002 gamb_colu gambiae gambiae
280 AB0537-C BF13-32 Austin Burt Burkina Faso Souroukoudinga 2014 7 11.235 -4.535 F AG1000G-BF-B v3 0.029 0.002 gamb_colu gambiae gambiae
281 AB0538-C BF13-33 Austin Burt Burkina Faso Souroukoudinga 2014 7 11.235 -4.535 F AG1000G-BF-B v3 0.018 0.002 gamb_colu gambiae gambiae
282 AB0408-C BF14-20 Austin Burt Burkina Faso Bana 2014 7 11.233 -4.472 F AG1000G-BF-B v3 0.986 0.002 gamb_colu coluzzii coluzzii

283 rows × 17 columns

The sample_id column gives the sample identifier used throughout all Ag1000G analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

The sex_call column gives the gender as determined from the sequence data.

To load metadata for all wild-caught samples, you can use the shortcut “v3_wild”, e.g.:

df_samples = ag3.sample_metadata(sample_sets="v3_wild")
df_samples
sample_id partner_sample_id contributor country location year month latitude longitude sex_call sample_set release aim_fraction_colu aim_fraction_arab species_gambcolu_arabiensis species_gambiae_coluzzii species
0 AR0047-C LUA047 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.945 0.001 gamb_colu coluzzii coluzzii
1 AR0049-C LUA049 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.933 0.001 gamb_colu coluzzii coluzzii
2 AR0051-C LUA051 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.937 0.002 gamb_colu coluzzii coluzzii
3 AR0061-C LUA061 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.938 0.002 gamb_colu coluzzii coluzzii
4 AR0078-C LUA078 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F AG1000G-AO v3 0.926 0.001 gamb_colu coluzzii coluzzii
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2779 AC0295-C K92 Martin Donnelly Uganda Kihihi 2012 11 -0.751 29.701 F AG1000G-UG v3 0.026 0.002 gamb_colu gambiae gambiae
2780 AC0296-C K93 Martin Donnelly Uganda Kihihi 2012 11 -0.751 29.701 F AG1000G-UG v3 0.029 0.003 gamb_colu gambiae gambiae
2781 AC0297-C K94 Martin Donnelly Uganda Kihihi 2012 11 -0.751 29.701 F AG1000G-UG v3 0.026 0.002 gamb_colu gambiae gambiae
2782 AC0298-C K95 Martin Donnelly Uganda Kihihi 2012 11 -0.751 29.701 F AG1000G-UG v3 0.029 0.002 gamb_colu gambiae gambiae
2783 AC0299-C K96 Martin Donnelly Uganda Kihihi 2012 11 -0.751 29.701 F AG1000G-UG v3 0.029 0.002 gamb_colu gambiae gambiae

2784 rows × 17 columns

Pandas can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:

df_samples.groupby("species").size()
species
arabiensis                          368
coluzzii                            675
gambiae                            1571
intermediate_arabiensis_gambiae       1
intermediate_gambiae_coluzzii       169
dtype: int64

Note that samples within a sample set may belong to different species. For convenience, we have made a species call for all samples in Ag1000G phase 3, using the genomic data. Calling species is not always straightforward, and we have used two different methods for species calling, ancestry informative markers (AIM) and principal components analysis (PCA). When loading the sample metadata, the AIM species calls are included by default. The results of these two different methods generally agree, although there are some populations where results are different, particularly in Guinea-Bissau, The Gambia, Kenya and Tanzania. If you have any questions about how to interpret these species calls, please get in touch.

SNP sites and alleles

We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not “N”. Data on this set of genomic positions and alleles for a given chromosome arm (e.g., 3R) can be accessed as dask arrays as follows:

pos, ref, alt = ag3.snp_sites(contig="3R")
pos
Array Chunk
Bytes 208.91 MB 134.22 MB
Shape (52226568,) (33554432,)
Count 2 Tasks 2 Chunks
Type int32 numpy.ndarray
52226568 1
ref
Array Chunk
Bytes 52.23 MB 52.23 MB
Shape (52226568,) (52226568,)
Count 1 Tasks 1 Chunks
Type |S1 numpy.ndarray
52226568 1
alt
Array Chunk
Bytes 156.68 MB 133.69 MB
Shape (52226568, 3) (44564480, 3)
Count 2 Tasks 2 Chunks
Type |S1 numpy.ndarray
3 52226568

Data can be loaded into memory as numpy arrays as shown in the following examples.

# read first 10 SNP positions into a numpy array
p = pos[:10].compute()
p
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)
# read first 10 SNP reference alleles
r = ref[:10].compute()
r
array([b'C', b'C', b'T', b'C', b'T', b'A', b'C', b'G', b'T', b'T'],
      dtype='|S1')
# read first 10 SNP alternate alleles
a = alt[:10].compute()
a
array([[b'A', b'T', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'C', b'T', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'T'],
       [b'A', b'C', b'G'],
       [b'A', b'C', b'G']], dtype='|S1')

Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the Ag3 samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below.

Site filters

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs.

Because different species may have different genome accessibility issues, we have created three separate site filters:

  • The “gamb_colu” filter is design for working only with An. gambiae and/or An. coluzzii samples.

  • The “arab” filter is designed for working with An. arabiensis samples.

  • The “gamb_colu_arab” filter is suitable for when analysing samples of any species together.

Each set of site filters provides a “filter_pass” Boolean mask for each chromosome arm, where True indicates that the site passed the filter and is accessible to high quality SNP calling.

The site filters data can be accessed as dask arrays as shown in the examples below.

# access gamb_colu_arab site filters for chromosome arm 3R as a dask array
filter_pass = ag3.site_filters(contig="3R", mask="gamb_colu_arab")
filter_pass
Array Chunk
Bytes 52.23 MB 52.23 MB
Shape (52226568,) (52226568,)
Count 1 Tasks 1 Chunks
Type bool numpy.ndarray
52226568 1
# read filter values for first 10 SNPs (True means the site passes filters)
f = filter_pass[:10].compute()
f
array([False, False, False, False, False, False, False, False, False,
       False])

SNP genotypes

SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

SNP genotypes can be accessed as dask arrays as shown below.

gt = ag3.snp_genotypes(contig="3R", sample_sets="v3_wild")
gt
Array Chunk
Bytes 290.80 GB 122.40 MB
Shape (52226568, 2784, 2) (600000, 102, 2)
Count 16774 Tasks 5032 Chunks
Type int8 numpy.ndarray
2 2784 52226568

Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:

df_samples = ag3.sample_metadata(sample_sets="v3_wild")
gt = ag3.snp_genotypes(contig="3R", sample_sets="v3_wild")
len(df_samples) == gt.shape[1]
True

You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:

loc_gambiae = df_samples.eval("species == 'gambiae'").values
print(f"found {np.count_nonzero(loc_gambiae)} gambiae samples")
gt_gambiae = da.compress(loc_gambiae, gt, axis=1)
gt_gambiae
found 1571 gambiae samples
Array Chunk
Bytes 164.10 GB 120.00 MB
Shape (52226568, 1571, 2) (600000, 100, 2)
Count 21262 Tasks 4488 Chunks
Type int8 numpy.ndarray
2 1571 52226568

Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:

g = gt[:5, :3, :].compute()
g
array([[[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8)

If you want to work with the genotype calls, you may find it convenient to use scikit-allel. E.g., the code below sets up a genotype array.

# use the scikit-allel wrapper class for genotype calls
gt = ag3.snp_genotypes(contig="3R", sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"])
gt = allel.GenotypeDaskArray(gt)
gt
<GenotypeDaskArray shape=(52226568, 283, 2) dtype=int8>
01234...278279280281282
00/00/00/00/00/0...0/00/00/00/00/0
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
52226565./../../../../....0/0./../../../.
52226566./../../../../...../../../../../.
52226567./../../../../...../../../../../.

SNP calls (xarray dataset)

If you are familiar with xarray, you may find it more convenient to access an xarray dataset which contains all of the data relating to SNP calls, including the SNP positions, alleles, site filters, and genotypes.

ds_snps = ag3.snp_calls(contig="3R", sample_sets="v3_wild")
ds_snps
<xarray.Dataset>
Dimensions:                             (alleles: 4, ploidy: 2, samples: 2784, variants: 52226568)
Coordinates:
    variant_position                    (variants) int32 dask.array<chunksize=(33554432,), meta=np.ndarray>
    variant_contig                      (variants) uint8 dask.array<chunksize=(33554432,), meta=np.ndarray>
    sample_id                           (samples) |S24 dask.array<chunksize=(81,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele                      (variants, alleles) |S1 dask.array<chunksize=(44564480, 1), meta=np.ndarray>
    variant_filter_pass_gamb_colu_arab  (variants) bool dask.array<chunksize=(52226568,), meta=np.ndarray>
    variant_filter_pass_gamb_colu       (variants) bool dask.array<chunksize=(52226568,), meta=np.ndarray>
    variant_filter_pass_arab            (variants) bool dask.array<chunksize=(52226568,), meta=np.ndarray>
    call_genotype                       (variants, samples, ploidy) int8 dask.array<chunksize=(600000, 81, 2), meta=np.ndarray>
    call_GQ                             (variants, samples) int16 dask.array<chunksize=(600000, 81), meta=np.ndarray>
    call_MQ                             (variants, samples) int16 dask.array<chunksize=(600000, 81), meta=np.ndarray>
    call_AD                             (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
    call_genotype_mask                  (variants, samples, ploidy) bool dask.array<chunksize=(600000, 81, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

The arrays within this dataset can be accessed as shown below.

pos = ds_snps['variant_position'].data
pos
Array Chunk
Bytes 208.91 MB 134.22 MB
Shape (52226568,) (33554432,)
Count 2 Tasks 2 Chunks
Type int32 numpy.ndarray
52226568 1
gt = allel.GenotypeDaskArray(ds_snps['call_genotype'].data)
gt
<GenotypeDaskArray shape=(52226568, 2784, 2) dtype=int8>
01234...27792780278127822783
00/00/00/00/00/0...0/00/00/00/00/0
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
52226565./../../../.0/0..../../.0/0./.0/0
52226566./../../../.0/0..../../.0/0./.0/0
52226567./../../../.0/0..../../.0/0./../.

Copy number variation (CNV) data

Data on copy number variation within the Ag3 cohort are available as three separate data types:

  • HMM – Genome-wide inferences of copy number state within each individual mosquito in 300 bp non-overlapping windows.

  • Coverage calls – Genome-wide copy number variant calls, derived from the HMM outputs by analysing contiguous regions of elevated copy number state then clustering of variants across individuals based on breakpoint proximity.

  • Discordant read calls – Copy number variant calls at selected insecticide resistance loci, based on analysis of read alignments at CNV breakpoints.

For more information on the methods used to generate these data, see the variant-calling methods page.

The malariagen_data Python package provides some convenience functions for accessing these data, illustrated below.

CNV HMM

Access HMM data via the cnv_hmm() method, which returns an xarray dataset. E.g.:

ds_hmm = ag3.cnv_hmm(contig="2R", sample_sets="AG1000G-TZ")
ds_hmm
<xarray.Dataset>
Dimensions:           (samples: 300, variants: 205151)
Coordinates:
    variant_position  (variants) int32 dask.array<chunksize=(205151,), meta=np.ndarray>
    variant_end       (variants) int32 dask.array<chunksize=(205151,), meta=np.ndarray>
    variant_contig    (variants) uint8 dask.array<chunksize=(205151,), meta=np.ndarray>
    sample_id         (samples) object dask.array<chunksize=(300,), meta=np.ndarray>
Dimensions without coordinates: samples, variants
Data variables:
    call_CN           (variants, samples) int8 dask.array<chunksize=(205151, 300), meta=np.ndarray>
    call_RawCov       (variants, samples) int32 dask.array<chunksize=(131072, 128), meta=np.ndarray>
    call_NormCov      (variants, samples) float32 dask.array<chunksize=(131072, 128), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

Here “variants” are the 300 bp windows in which coverage was calculated and the HMM fitted. Window start positions are given by the variant_position array and ends are given by variant_end.

pos = ds_hmm['variant_position'].values
pos
array([       1,      301,      601, ..., 61544401, 61544701, 61545001],
      dtype=int32)
end = ds_hmm['variant_end'].values
end
array([     300,      600,      900, ..., 61544700, 61545000, 61545105],
      dtype=int32)

Copy number state is given by the call_CN array, where rows are windows and columns are individual samples.

On the autosomes (2R, 2L, 3R, 3L) normal diploid copy number is 2. Values greater than 2 mean amplification, less then 2 mean deletion.

On the X chromosome, normal copy number is 2 in females and 1 in males.

For all chromosomes, -1 means missing, i.e., the window was not included.

Rows are variants (windows), columns are individual samples.

cn = ds_hmm['call_CN'].values
cn
array([[ 5,  2,  2, ...,  6,  4,  2],
       [ 5,  2,  2, ...,  6,  4,  2],
       [ 5, 11,  8, ...,  6,  4,  2],
       ...,
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1]], dtype=int8)

The gene_cnv() method can be used to summarise the HMM data and compute modal coverage by gene, for all genes in a chromosome arm. E.g.:

ds_gene_cnv = ag3.gene_cnv(contig="2R", sample_sets="AG1000G-TZ")
ds_gene_cnv
<xarray.Dataset>
Dimensions:        (genes: 3668, samples: 300)
Coordinates:
    gene_id        (genes) object 'AGAP001096' 'AGAP013094' ... 'AGAP004676'
    gene_start     (genes) int64 3426 13754 18496 ... 61303692 61372543 61478226
    gene_end       (genes) int64 9985 15149 19579 ... 61306248 61418536 61478802
    sample_id      (samples) object 'BL0046-C' 'BL0047-C' ... 'BL0250-CW'
Dimensions without coordinates: genes, samples
Data variables:
    gene_windows   (genes) int64 23 6 5 8 33 63 19 29 ... 168 546 57 80 10 154 3
    gene_name      (genes) object nan nan nan nan nan ... nan nan 'GPRMAC2' nan
    gene_strand    (genes) object '+' '+' '+' '+' '-' ... '+' '+' '-' '+' '+'
    CN_mode        (genes, samples) int8 2 2 2 2 2 2 2 2 2 ... 3 3 3 4 4 3 4 2 1
    CN_mode_count  (genes, samples) int64 23 23 23 23 23 23 23 ... 2 2 3 3 3 3 1

The gene_cnv_frequencies() method can be used to compute CNV frequencies by gene and cohort, e.g.:

cohorts = {
    'tz_gamb': "sample_set == 'AG1000G-TZ' and species == 'gambiae'",
    'tz_arab': "sample_set == 'AG1000G-TZ' and species == 'arabiensis'",
}
df_gene_cnv_freqs = ag3.gene_cnv_frequencies(contig='2R', cohorts=cohorts)
df_gene_cnv_freqs
contig start end strand Name description tz_gamb_amp tz_gamb_del tz_arab_amp tz_arab_del
ID
AGAP001096 2R 3426 9985 + NaN NaN 0.014706 0.058824 0.000000 0.004444
AGAP013094 2R 13754 15149 + NaN Elongation of very long chain fatty acids prot... 0.000000 0.044118 0.004444 0.004444
AGAP001097 2R 18496 19579 + NaN Elongation of very long chain fatty acids prot... 0.000000 0.044118 0.000000 0.004444
AGAP001098 2R 19797 21866 + NaN DAMOX [Source:VB Community Annotation] 0.000000 0.044118 0.000000 0.004444
AGAP001099 2R 21882 31394 - NaN Polybromo-1 [Source:VB Community Annotation] 0.000000 0.044118 0.004444 0.004444
... ... ... ... ... ... ... ... ... ... ...
AGAP004671 2R 61176240 61192895 + NaN nuclear pore complex protein Nup160 [Source:VB... 0.014706 0.029412 0.004444 0.000000
AGAP004672 2R 61230781 61254533 + NaN NaN 0.000000 0.044118 0.008889 0.004444
AGAP004674 2R 61303692 61306248 - NaN Phenoloxidase inhibitor protein [Source:VB Com... 0.000000 0.073529 0.000000 0.080000
AGAP004675 2R 61372543 61418536 + GPRMAC2 putative muscarinic acetylcholine receptor 2 [... 0.000000 0.044118 0.000000 0.004444
AGAP004676 2R 61478226 61478802 + NaN NaN 0.676471 0.058824 0.000000 1.000000

3668 rows × 10 columns

CNV coverage calls

Coverage calls can be accessed via the cnv_coverage_calls() method, which returns an xarray dataset. Note that an analysis parameter must be given, which can be either “arab” for An. arabiensis samples or “gamb_colu” for non-An. arabiensis. E.g.:

ds_cnv = ag3.cnv_coverage_calls(contig='2R', analysis='arab', sample_set='AG1000G-TZ')
ds_cnv
<xarray.Dataset>
Dimensions:              (samples: 221, variants: 9459)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_end          (variants) int32 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_contig       (variants) uint8 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_id           (variants) object dask.array<chunksize=(9459,), meta=np.ndarray>
    sample_id            (samples) object dask.array<chunksize=(221,), meta=np.ndarray>
Dimensions without coordinates: samples, variants
Data variables:
    variant_CIPOS        (variants) int32 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_CIEND        (variants) int32 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(9459,), meta=np.ndarray>
    call_genotype        (variants, samples) int8 dask.array<chunksize=(9459, 221), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

Here “variants” are copy number variants, inferred from the HMM results. CNV start positions are given by the variant_position array and ends are given by variant_end.

pos = ds_cnv['variant_position'].values
pos
array([       1,        1,      601, ..., 61514401, 61515901, 61533301],
      dtype=int32)
end = ds_cnv['variant_end'].values
end
array([    1500,     2100,     2100, ..., 61517400, 61517700, 61535700],
      dtype=int32)

CNV genotypes are given by the call_genotype array, coded as 0 for absence and 1 for presence of the CNV allele. Rows are CNV alleles and columns are individual samples.

gt = ds_cnv['call_genotype'].values
gt
array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

Note that not all samples will have been included in the coverage calls, as some are excluded due to high coverage variance. Use the sample_id array to access the samples included in a given dataset, e.g.:

samples = ds_cnv['sample_id'].values
samples[:5]
array(['BL0047-C', 'BL0048-C', 'BL0049-C', 'BL0050-C', 'BL0055-C'],
      dtype=object)

CNV discordant read calls

Access the discordant read calls via the cnv_discordant_read_calls() method. E.g.:

ds_cnv_dr = ag3.cnv_discordant_read_calls(contig="2R", sample_sets="AG1000G-TZ")
ds_cnv_dr
<xarray.Dataset>
Dimensions:                        (samples: 300, variants: 40)
Coordinates:
    variant_position               (variants) int32 dask.array<chunksize=(40,), meta=np.ndarray>
    variant_end                    (variants) int32 dask.array<chunksize=(40,), meta=np.ndarray>
    variant_id                     (variants) object dask.array<chunksize=(40,), meta=np.ndarray>
    variant_contig                 (variants) uint8 dask.array<chunksize=(40,), meta=np.ndarray>
    sample_id                      (samples) object dask.array<chunksize=(300,), meta=np.ndarray>
Dimensions without coordinates: samples, variants
Data variables:
    variant_Region                 (variants) object dask.array<chunksize=(40,), meta=np.ndarray>
    variant_StartBreakpointMethod  (variants) int32 dask.array<chunksize=(40,), meta=np.ndarray>
    variant_EndBreakpointMethod    (variants) int32 dask.array<chunksize=(40,), meta=np.ndarray>
    call_genotype                  (variants, samples) int8 dask.array<chunksize=(40, 300), meta=np.ndarray>
    sample_coverage_variance       (samples) float32 dask.array<chunksize=(300,), meta=np.ndarray>
    sample_is_high_variance        (samples) bool dask.array<chunksize=(300,), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

CNV start positions are given by the variant_position array and ends are given by variant_end. Note that some starts and/or ends are not known, and missing values are coded as -1.

The identifier of the CNV alleles is given by the variant_id array. The region where CNV alleles were analysed is given by the variant_Region array.

Note that for each region there is a special variant named “Dup0” which simply records whether any of the genes in a given cluster have increased coverage based on the modal HMM copy number state. I.e., if a sample has a “Dup0” variant but no other variant in a given region, then it has some form of copy number variation but we couldn’t find any discordant reads to uniquely identify the CNV allele it carries.

CNV genotypes are given by the call_genotype array, coded as 0 for absence and 1 for presence of the CNV allele. Rows are CNV alleles and columns are individual samples.

vid = ds_cnv_dr['variant_id'].values
vid
array(['Ace1_Dup0', 'Cyp6aap_Dup0', 'Cyp6aap_Dup1', 'Cyp6aap_Dup23',
       'Ace1_Dup1', 'Ace1_Dup2', 'Ace1_Del1', 'Ace1_Del4', 'Ace1_Del3',
       'Ace1_Del2', 'Cyp6aap_Dup15', 'Cyp6aap_Dup13', 'Cyp6aap_Dup20',
       'Cyp6aap_Dup14', 'Cyp6aap_Dup12', 'Cyp6aap_Dup21', 'Cyp6aap_Dup19',
       'Cyp6aap_Dup8', 'Cyp6aap_Dup22', 'Cyp6aap_Dup17', 'Cyp6aap_Dup28',
       'Cyp6aap_Dup10', 'Cyp6aap_Dup7', 'Cyp6aap_Dup6', 'Cyp6aap_Dup4',
       'Cyp6aap_Dup30', 'Cyp6aap_Dup9', 'Cyp6aap_Dup3', 'Cyp6aap_Dup18',
       'Cyp6aap_Dup26', 'Cyp6aap_Dup1a', 'Cyp6aap_Dup1b', 'Cyp6aap_Dup25',
       'Cyp6aap_Dup5', 'Cyp6aap_Dup16', 'Cyp6aap_Dup24', 'Cyp6aap_Dup11',
       'Cyp6aap_Dup2', 'Cyp6aap_Dup29', 'Cyp6aap_Dup27'], dtype=object)
reg = ds_cnv_dr['variant_Region'].values
reg
array(['Ace1', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Ace1',
       'Ace1', 'Ace1', 'Ace1', 'Ace1', 'Ace1', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p'], dtype=object)
pos = ds_cnv_dr['variant_position'].values
pos
array([      -1,       -1,       -1,       -1,  3436926,  3447950,
        3501850,  3512200,  3535850,  3539300, 28465673, 28472728,
       28473600, 28473874, 28474576, 28475128, 28475490, 28475996,
       28477220, 28477540, 28477710, 28477889, 28478057, 28478272,
       28478925, 28478987, 28479181, 28479407, 28479548, 28480166,
       28480189, 28480193, 28480335, 28480372, 28480547, 28480585,
       28487546, 28493547, 28494017, 28496700], dtype=int32)
end = ds_cnv_dr['variant_end'].values
end
array([      -1,       -1,       -1, 28497740,  3639836,  3518900,
        3599050,  3616290,  3619000,  3573750, 28555300, 28522671,
       28795255, 28563596, 28520016, 28483473, 28556726, 28485005,
       28484338, 28485380, 28496953, 28491215, 28486036, 28484157,
       28483069, 28485033, 28491431, 28483372, 28494597, 28483253,
       28483475, 28483675, 28483384, 28484518, 28483236, 28483442,
       28518123, 28497279, 28496505, 28499200], dtype=int32)
gt = ds_cnv_dr['call_genotype'].values
gt
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)
samples = ds_cnv_dr['sample_id'].values
samples[:5]
array(['BL0046-C', 'BL0047-C', 'BL0048-C', 'BL0049-C', 'BL0050-C'],
      dtype=object)

Example computation

Here’s an example computation to count the number of segregating SNPs on chromosome arm 3R that also pass gamb_colu_arab site filters. This may take a minute or two, because it is scanning genotype calls at millions of SNPs in hundreds of samples.

# choose chromosome arm
contig = "3R"

# choose site filter mask
mask = "gamb_colu_arab"

# choose sample sets
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B"]

# locate pass sites
loc_pass = ag3.site_filters(contig=contig, mask=mask).compute()

# perform an allele count over genotypes
gt = ag3.snp_genotypes(contig=contig, sample_sets=sample_sets)
gt = allel.GenotypeDaskArray(gt)
ac = gt.count_alleles(max_allele=3)

# locate segregating sites
loc_seg = ac.is_segregating()

# count segregating and pass sites
n_pass_seg = da.count_nonzero(loc_pass & loc_seg)

# run the computation
with ProgressBar():
    n_pass_seg = n_pass_seg.compute()

n_pass_seg
[########################################] | 100% Completed |  1min 41.2s
11226723

For more in-depth worked examples of various analyses, see the Ag3 analysis examples page.

Running larger computations

Please note that free cloud computing services such as Google Colab and MyBinder provide only limited computing resources. Thus although these services are able to efficiently read Ag3 data stored on Google Cloud, you may find that you run out of memory, or computations take a long time running on a single core. If you would like any suggestions regarding how to set up more powerful compute resources in the cloud, please feel free to get in touch via the malariagen/vector-data GitHub discussion board.

Feedback and suggestions

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.

API docs

Here are the docstrings for the functions in the malariagen_data package that we used above.

help(ag3.sample_sets)
Help on method sample_sets in module malariagen_data.ag3:

sample_sets(release='v3') method of malariagen_data.ag3.Ag3 instance
    Access the manifest of sample sets.
    
    Parameters
    ----------
    release : str
        Release identifier. Give "v3" to access the Ag1000G phase 3 data release.
    
    Returns
    -------
    df : pandas.DataFrame
help(ag3.sample_metadata)
Help on method sample_metadata in module malariagen_data.ag3:

sample_metadata(sample_sets='v3_wild', species_calls=('20200422', 'aim')) method of malariagen_data.ag3.Ag3 instance
    Access sample metadata for one or more sample sets.
    
    Parameters
    ----------
    sample_sets : str or list of str
        Can be a sample set identifier (e.g., "AG1000G-AO") or a list of sample set
        identifiers (e.g., ["AG1000G-BF-A", "AG1000G-BF-B"]) or a release identifier (e.g.,
        "v3") or a list of release identifiers.
    species_calls : (str, str), optional
        Include species calls in metadata.
    
    Returns
    -------
    df : pandas.DataFrame
help(ag3.snp_sites)
Help on method snp_sites in module malariagen_data.ag3:

snp_sites(contig, field=None, site_mask=None, site_filters='dt_20200416', inline_array=True, chunks='auto') method of malariagen_data.ag3.Ag3 instance
    Access SNP site data (positions and alleles).
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    field : {"POS", "REF", "ALT"}, optional
        Array to access. If not provided, all three arrays POS, REF, ALT will be returned as a
        tuple.
    site_mask : {"gamb_colu_arab", "gamb_colu", "arab"}
        Site filters mask to apply.
    site_filters : str
        Site filters analysis version.
    inline_array : bool, optional
        Passed through to dask.array.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    d : dask.array.Array or tuple of dask.array.Array
help(ag3.site_filters)
Help on method site_filters in module malariagen_data.ag3:

site_filters(contig, mask, field='filter_pass', analysis='dt_20200416', inline_array=True, chunks='auto') method of malariagen_data.ag3.Ag3 instance
    Access SNP site filters.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    mask : {"gamb_colu_arab", "gamb_colu", "arab"}
        Mask to use.
    field : str, optional
        Array to access.
    analysis : str, optional
        Site filters analysis version.
    inline_array : bool, optional
        Passed through to dask.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    d : dask.array.Array
help(ag3.snp_genotypes)
Help on method snp_genotypes in module malariagen_data.ag3:

snp_genotypes(contig, sample_sets='v3_wild', field='GT', site_mask=None, site_filters='dt_20200416', inline_array=True, chunks='auto') method of malariagen_data.ag3.Ag3 instance
    Access SNP genotypes and associated data.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    sample_sets : str or list of str
        Can be a sample set identifier (e.g., "AG1000G-AO") or a list of sample set
        identifiers (e.g., ["AG1000G-BF-A", "AG1000G-BF-B"]) or a release identifier (e.g.,
        "v3") or a list of release identifiers.
    field : {"GT", "GQ", "AD", "MQ"}
        Array to access.
    site_mask : {"gamb_colu_arab", "gamb_colu", "arab"}
        Site filters mask to apply.
    site_filters : str, optional
        Site filters analysis version.
    inline_array : bool, optional
        Passed through to dask.array.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    d : dask.array.Array
help(ag3.snp_calls)
Help on method snp_calls in module malariagen_data.ag3:

snp_calls(contig, sample_sets='v3_wild', site_mask=None, site_filters='dt_20200416', inline_array=True, chunks='auto') method of malariagen_data.ag3.Ag3 instance
    Access SNP sites, site filters and genotype calls.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    sample_sets : str or list of str
        Can be a sample set identifier (e.g., "AG1000G-AO") or a list of sample set
        identifiers (e.g., ["AG1000G-BF-A", "AG1000G-BF-B"]) or a release identifier (e.g.,
        "v3") or a list of release identifiers.
    site_mask : {"gamb_colu_arab", "gamb_colu", "arab"}
        Site filters mask to apply.
    site_filters : str
        Site filters analysis version.
    inline_array : bool, optional
        Passed through to dask.array.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    ds : xarray.Dataset
help(ag3.cnv_hmm)
Help on method cnv_hmm in module malariagen_data.ag3:

cnv_hmm(contig, sample_sets='v3_wild', inline_array=True, chunks='auto') method of malariagen_data.ag3.Ag3 instance
    Access CNV HMM data.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    sample_sets : str or list of str
        Can be a sample set identifier (e.g., "AG1000G-AO") or a list of sample set
        identifiers (e.g., ["AG1000G-BF-A", "AG1000G-BF-B"]) or a release identifier (e.g.,
        "v3") or a list of release identifiers.
    inline_array : bool, optional
        Passed through to dask.array.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    ds : xarray.Dataset
help(ag3.gene_cnv)
Help on method gene_cnv in module malariagen_data.ag3:

gene_cnv(contig, sample_sets='v3_wild') method of malariagen_data.ag3.Ag3 instance
    Compute modal copy number by gene, from HMM data.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    sample_sets : str or list of str
        Can be a sample set identifier (e.g., "AG1000G-AO") or a list of sample set
        identifiers (e.g., ["AG1000G-BF-A", "AG1000G-BF-B"]) or a release identifier (e.g.,
        "v3") or a list of release identifiers.
    
    Returns
    -------
    ds : xarray.Dataset
help(ag3.gene_cnv_frequencies)
Help on method gene_cnv_frequencies in module malariagen_data.ag3:

gene_cnv_frequencies(contig, cohorts=None, sample_sets='v3_wild') method of malariagen_data.ag3.Ag3 instance
    Compute modal copy number by gene, then compute the frequency of
    amplifications and deletions in one or more cohorts, from HMM data.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    cohorts : dict
        Dictionary to map cohort IDs to sample queries, e.g.,
        {"bf_2012_col": "country == 'Burkina Faso' and year == 2012 and species == 'coluzzii'"}
    sample_sets : str or list of str
        Can be a sample set identifier (e.g., "AG1000G-AO") or a list of sample set
        identifiers (e.g., ["AG1000G-BF-A", "AG1000G-BF-B"]) or a release identifier (e.g.,
        "v3") or a list of release identifiers.
    
    Returns
    -------
    df : pandas.DataFrame
help(ag3.cnv_coverage_calls)
Help on method cnv_coverage_calls in module malariagen_data.ag3:

cnv_coverage_calls(contig, sample_set, analysis, inline_array=True, chunks='auto') method of malariagen_data.ag3.Ag3 instance
    Access CNV HMM data.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    sample_set : str
        Sample set identifier.
    analysis : {'gamb_colu', 'arab', 'crosses'}
        Name of CNV analysis.
    inline_array : bool, optional
        Passed through to dask.array.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    ds : xarray.Dataset
help(ag3.cnv_discordant_read_calls)
Help on method cnv_discordant_read_calls in module malariagen_data.ag3:

cnv_discordant_read_calls(contig, sample_sets='v3_wild', inline_array=True, chunks='auto') method of malariagen_data.ag3.Ag3 instance
    Access CNV discordant read calls data.
    
    Parameters
    ----------
    contig : str
        Chromosome arm, e.g., "3R".
    sample_sets : str or list of str
        Can be a sample set identifier (e.g., "AG1000G-AO") or a list of sample set
        identifiers (e.g., ["AG1000G-BF-A", "AG1000G-BF-B"]) or a release identifier (e.g.,
        "v3") or a list of release identifiers.
    inline_array : bool, optional
        Passed through to dask.array.from_array().
    chunks : str, optional
        If 'auto' let dask decide chunk size. If 'native' use native zarr chunks.
        Also can be a target size, e.g., '200 MiB'.
    
    Returns
    -------
    ds : xarray.Dataset