Ag3 cloud data access

Ag3 cloud data access#

This page provides information about how to access data from Anopheles gambiae 1000 Genomes project (Ag1000G) phase 3 via Google Cloud. This includes sample metadata and single nucleotide polymorphism (SNP) calls.

This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via Google Colab which are free interactive computing service running in the cloud.

To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.

Data hosting#

All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the vo_agam_release_master_us_central1 bucket, which is a single-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access.

Setup#

Running this notebook requires some Python packages to be installed:

!pip install -q malariagen_data

To make accessing these data more convenient, we’ve created the malariagen_data Python package. This is experimental so please let us know if you find any bugs or have any suggestions. See the Ag3 API docs for documentation of all functions available from this package.

Import other packages we’ll need to use here.

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
# silence some warnings
dask.config.set(**{'array.slicing.split_large_chunks': False})
import allel
import malariagen_data

Ag3 data access from Google Cloud is set up with the following code:

ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 data resource API
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release/
Data releases available	3.0
Cohorts analysis	20211101
Species analysis	aim_20200422
Site filters analysis	dt_20200416
Software version	3.0.0

Sample sets#

Data are organised into different releases. As an example, data in Ag3.0 are organised into 28 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets.

To see which sample sets are available, load the sample set manifest into a pandas dataframe:

df_sample_sets = ag3.sample_sets(release="3.0")
df_sample_sets

	sample_set	sample_count	release
0	AG1000G-AO	81	3.0
1	AG1000G-BF-A	181	3.0
2	AG1000G-BF-B	102	3.0
3	AG1000G-BF-C	13	3.0
4	AG1000G-CD	76	3.0
5	AG1000G-CF	73	3.0
6	AG1000G-CI	80	3.0
7	AG1000G-CM-A	303	3.0
8	AG1000G-CM-B	97	3.0
9	AG1000G-CM-C	44	3.0
10	AG1000G-FR	23	3.0
11	AG1000G-GA-A	69	3.0
12	AG1000G-GH	100	3.0
13	AG1000G-GM-A	74	3.0
14	AG1000G-GM-B	31	3.0
15	AG1000G-GM-C	174	3.0
16	AG1000G-GN-A	45	3.0
17	AG1000G-GN-B	185	3.0
18	AG1000G-GQ	10	3.0
19	AG1000G-GW	101	3.0
20	AG1000G-KE	86	3.0
21	AG1000G-ML-A	60	3.0
22	AG1000G-ML-B	71	3.0
23	AG1000G-MW	41	3.0
24	AG1000G-MZ	74	3.0
25	AG1000G-TZ	300	3.0
26	AG1000G-UG	290	3.0
27	AG1000G-X	297	3.0

For more information about these sample sets, see the section on sample sets in the introduction to Ag1000G phase 3.

Sample metadata#

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.

E.g., load sample metadata for all samples in the Ag3.0 release into a pandas DataFrame:

df_samples = ag3.sample_metadata(sample_sets="3.0")
df_samples

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	aim_species	country_iso	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin2_year	cohort_admin2_month
0	AR0047-C	LUA047	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
1	AR0049-C	LUA049	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
2	AR0051-C	LUA051	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
3	AR0061-C	LUA061	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
4	AR0078-C	LUA078	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3076	AD0494-C	80-2-o-16	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	F	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3077	AD0495-C	80-2-o-17	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3078	AD0496-C	80-2-o-18	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3079	AD0497-C	80-2-o-19	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	F	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3080	AD0498-C	80-2-o-20	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	M	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3081 rows × 26 columns

The sample_id column gives the sample identifier used throughout all Ag1000G analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

The sex_call column gives the gender as determined from the sequence data.

All samples were derived from wild-caught mosquito specimens, with the exception of samples in AG1000G-X which were derived from lab crosses. For most analyses you may want to exclude the crosses and work with wild-caught samples only.

Pandas can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:

df_samples.groupby("taxon").size()

taxon
arabiensis                          368
coluzzii                            507
gambiae                            1470
gcx1                                170
gcx2                                199
gcx3                                 65
intermediate_arabiensis_gambiae       1
intermediate_gambiae_coluzzii         4
dtype: int64

Note that samples within a sample set may belong to different species. For convenience, we have made a species call for all samples in Ag1000G phase 3, using the genomic data. If you have any questions about how to interpret these species calls, please get in touch.

SNP calls#

Data on SNP calls, including the SNP positions, alleles, site filters, and genotypes, can be accessed as an xarray Dataset.

E.g., access SNP calls for chromosome arm 3R for all samples in Ag3.0.

ds_snps = ag3.snp_calls(region="3R", sample_sets="3.0")
ds_snps

<xarray.Dataset>
Dimensions:                             (alleles: 4, ploidy: 2, samples: 3081, variants: 52226568)
Coordinates:
    variant_position                    (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig                      (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id                           (samples) <U24 dask.array<chunksize=(81,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele                      (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass_gamb_colu_arab  (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray>
    variant_filter_pass_gamb_colu       (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray>
    variant_filter_pass_arab            (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray>
    call_genotype                       (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
    call_GQ                             (variants, samples) int16 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_MQ                             (variants, samples) int16 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_AD                             (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
    call_genotype_mask                  (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

xarray.Dataset

Dimensions:
- alleles: 4
- ploidy: 2
- samples: 3081
- variants: 52226568

Coordinates: (3)

variant_position

(variants)

int32

dask.array<chunksize=(524288,), meta=np.ndarray>

	Array	Chunk
Bytes	208.91 MB	2.10 MB
Shape	(52226568,)	(524288,)
Count	101 Tasks	100 Chunks
Type	int32	numpy.ndarray

variant_contig

(variants)

uint8

dask.array<chunksize=(524288,), meta=np.ndarray>

	Array	Chunk
Bytes	52.23 MB	524.29 kB
Shape	(52226568,)	(524288,)
Count	100 Tasks	100 Chunks
Type	uint8	numpy.ndarray

sample_id

(samples)

<U24

dask.array<chunksize=(81,), meta=np.ndarray>

	Array	Chunk
Bytes	295.78 kB	29.09 kB
Shape	(3081,)	(303,)
Count	112 Tasks	28 Chunks
Type		numpy.ndarray

Data variables: (9)

variant_allele

(variants, alleles)

|S1

dask.array<chunksize=(524288, 1), meta=np.ndarray>

	Array	Chunk
Bytes	208.91 MB	1.57 MB
Shape	(52226568, 4)	(524288, 3)
Count	502 Tasks	200 Chunks
Type	\|S1	numpy.ndarray

variant_filter_pass_gamb_colu_arab

(variants)

bool

dask.array<chunksize=(300000,), meta=np.ndarray>

	Array	Chunk
Bytes	52.23 MB	300.00 kB
Shape	(52226568,)	(300000,)
Count	176 Tasks	175 Chunks
Type	bool	numpy.ndarray

variant_filter_pass_gamb_colu

(variants)

bool

dask.array<chunksize=(300000,), meta=np.ndarray>

	Array	Chunk
Bytes	52.23 MB	300.00 kB
Shape	(52226568,)	(300000,)
Count	176 Tasks	175 Chunks
Type	bool	numpy.ndarray

variant_filter_pass_arab

(variants)

bool

dask.array<chunksize=(300000,), meta=np.ndarray>

	Array	Chunk
Bytes	52.23 MB	300.00 kB
Shape	(52226568,)	(300000,)
Count	176 Tasks	175 Chunks
Type	bool	numpy.ndarray

call_genotype

(variants, samples, ploidy)

int8

dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>

	Array	Chunk
Bytes	321.82 GB	30.00 MB
Shape	(52226568, 3081, 2)	(300000, 50, 2)
Count	25928 Tasks	12950 Chunks
Type	int8	numpy.ndarray

call_GQ

(variants, samples)

int16

dask.array<chunksize=(300000, 50), meta=np.ndarray>

	Array	Chunk
Bytes	321.82 GB	30.00 MB
Shape	(52226568, 3081)	(300000, 50)
Count	25928 Tasks	12950 Chunks
Type	int16	numpy.ndarray

call_MQ

(variants, samples)

int16

dask.array<chunksize=(300000, 50), meta=np.ndarray>

	Array	Chunk
Bytes	321.82 GB	30.00 MB
Shape	(52226568, 3081)	(300000, 50)
Count	25928 Tasks	12950 Chunks
Type	int16	numpy.ndarray

call_AD

(variants, samples, alleles)

int16

dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>

	Array	Chunk
Bytes	1.29 TB	120.00 MB
Shape	(52226568, 3081, 4)	(300000, 50, 4)
Count	25928 Tasks	12950 Chunks
Type	int16	numpy.ndarray

call_genotype_mask

(variants, samples, ploidy)

bool

dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>

	Array	Chunk
Bytes	321.82 GB	30.00 MB
Shape	(52226568, 3081, 2)	(300000, 50, 2)
Count	38878 Tasks	12950 Chunks
Type	bool	numpy.ndarray

Attributes: (1)
contigs :
('2R', '2L', '3R', '3L', 'X')

The arrays within this dataset are backed by Dask arrays, and can be accessed as shown below.

SNP sites and alleles#

We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not “N”. Data on this set of genomic positions and alleles for a given chromosome arm (e.g., 3R) can be accessed as Dask arrays as follows.

pos = ds_snps["variant_position"].data
pos

	Array	Chunk
Bytes	208.91 MB	2.10 MB
Shape	(52226568,)	(524288,)
Count	101 Tasks	100 Chunks
Type	int32	numpy.ndarray

alleles = ds_snps["variant_allele"].data
alleles

	Array	Chunk
Bytes	208.91 MB	1.57 MB
Shape	(52226568, 4)	(524288, 3)
Count	502 Tasks	200 Chunks
Type	\|S1	numpy.ndarray

Data can be loaded into memory as a NumPy array as shown in the following examples.

# read first 10 SNP positions into a numpy array
p = pos[:10].compute()
p

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

# read first 10 SNP alleles into a numpy array
a = alleles[:10].compute()
a

array([[b'C', b'A', b'T', b'G'],
       [b'C', b'A', b'T', b'G'],
       [b'T', b'A', b'C', b'G'],
       [b'C', b'A', b'T', b'G'],
       [b'T', b'A', b'C', b'G'],
       [b'A', b'C', b'T', b'G'],
       [b'C', b'A', b'T', b'G'],
       [b'G', b'A', b'C', b'T'],
       [b'T', b'A', b'C', b'G'],
       [b'T', b'A', b'C', b'G']], dtype='|S1')

Here the first column contains the reference alleles, and the remaining columns contain the alternate alleles.

Note that a byte string data type is used here for efficiency. E.g., the Python code b'T' represents a byte string containing the letter “T”, which here stands for the nucleotide thymine.

Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the Ag3 samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below.

Site filters#

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs.

Because different species may have different genome accessibility issues, we have created three separate site filters:

The “gamb_colu” filter is design for working only with An. gambiae and/or An. coluzzii samples.
The “arab” filter is designed for working with An. arabiensis samples.
The “gamb_colu_arab” filter is suitable for when analysing samples of any species together.

Each set of site filters provides a “filter_pass” Boolean mask for each chromosome arm, where True indicates that the site passed the filter and is accessible to high quality SNP calling.

The site filters data can be accessed as dask arrays as shown in the examples below.

# access gamb_colu_arab site filters for chromosome arm 3R as a dask array
filter_pass = ds_snps['variant_filter_pass_gamb_colu_arab'].data
filter_pass

	Array	Chunk
Bytes	52.23 MB	300.00 kB
Shape	(52226568,)	(300000,)
Count	176 Tasks	175 Chunks
Type	bool	numpy.ndarray

# read filter values for first 10 SNPs (True means the site passes filters)
f = filter_pass[:10].compute()
f

array([False, False, False, False, False, False, False, False, False,
       False])

SNP genotypes#

SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

SNP genotypes can be accessed as dask arrays as shown below.

gt = ds_snps["call_genotype"].data
gt

	Array	Chunk
Bytes	321.82 GB	30.00 MB
Shape	(52226568, 3081, 2)	(300000, 50, 2)
Count	25928 Tasks	12950 Chunks
Type	int8	numpy.ndarray

Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:

df_samples = ag3.sample_metadata(sample_sets="3.0")
gt = ds_snps["call_genotype"].data
len(df_samples) == gt.shape[1]

True

You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:

loc_gambiae = df_samples.eval("taxon == 'gambiae'").values
print(f"found {np.count_nonzero(loc_gambiae)} gambiae samples")

found 1470 gambiae samples

ds_snps_gambiae = ds_snps.isel(samples=loc_gambiae)
ds_snps_gambiae

<xarray.Dataset>
Dimensions:                             (alleles: 4, ploidy: 2, samples: 1470, variants: 52226568)
Coordinates:
    variant_position                    (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig                      (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id                           (samples) <U24 dask.array<chunksize=(98,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele                      (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass_gamb_colu_arab  (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray>
    variant_filter_pass_gamb_colu       (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray>
    variant_filter_pass_arab            (variants) bool dask.array<chunksize=(300000,), meta=np.ndarray>
    call_genotype                       (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 19, 2), meta=np.ndarray>
    call_GQ                             (variants, samples) int16 dask.array<chunksize=(300000, 19), meta=np.ndarray>
    call_MQ                             (variants, samples) int16 dask.array<chunksize=(300000, 19), meta=np.ndarray>
    call_AD                             (variants, samples, alleles) int16 dask.array<chunksize=(300000, 19, 4), meta=np.ndarray>
    call_genotype_mask                  (variants, samples, ploidy) bool dask.array<chunksize=(300000, 19, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

xarray.Dataset

Dimensions:
- alleles: 4
- ploidy: 2
- samples: 1470
- variants: 52226568

Coordinates: (3)

variant_position

(variants)

int32

dask.array<chunksize=(524288,), meta=np.ndarray>

	Array	Chunk
Bytes	208.91 MB	2.10 MB
Shape	(52226568,)	(524288,)
Count	101 Tasks	100 Chunks
Type	int32	numpy.ndarray

variant_contig

(variants)

uint8

dask.array<chunksize=(524288,), meta=np.ndarray>

	Array	Chunk
Bytes	52.23 MB	524.29 kB
Shape	(52226568,)	(524288,)
Count	100 Tasks	100 Chunks
Type	uint8	numpy.ndarray

sample_id

(samples)

<U24

dask.array<chunksize=(98,), meta=np.ndarray>

	Array	Chunk
Bytes	141.12 kB	29.09 kB
Shape	(1470,)	(303,)
Count	134 Tasks	22 Chunks
Type		numpy.ndarray

Data variables: (9)

variant_allele

(variants, alleles)

|S1

dask.array<chunksize=(524288, 1), meta=np.ndarray>

	Array	Chunk
Bytes	208.91 MB	1.57 MB
Shape	(52226568, 4)	(524288, 3)
Count	502 Tasks	200 Chunks
Type	\|S1	numpy.ndarray

variant_filter_pass_gamb_colu_arab

(variants)

bool

dask.array<chunksize=(300000,), meta=np.ndarray>

	Array	Chunk
Bytes	52.23 MB	300.00 kB
Shape	(52226568,)	(300000,)
Count	176 Tasks	175 Chunks
Type	bool	numpy.ndarray

variant_filter_pass_gamb_colu

(variants)

bool

dask.array<chunksize=(300000,), meta=np.ndarray>

	Array	Chunk
Bytes	52.23 MB	300.00 kB
Shape	(52226568,)	(300000,)
Count	176 Tasks	175 Chunks
Type	bool	numpy.ndarray

variant_filter_pass_arab

(variants)

bool

dask.array<chunksize=(300000,), meta=np.ndarray>

	Array	Chunk
Bytes	52.23 MB	300.00 kB
Shape	(52226568,)	(300000,)
Count	176 Tasks	175 Chunks
Type	bool	numpy.ndarray

call_genotype

(variants, samples, ploidy)

int8

dask.array<chunksize=(300000, 19, 2), meta=np.ndarray>

	Array	Chunk
Bytes	153.55 GB	30.00 MB
Shape	(52226568, 1470, 2)	(300000, 50, 2)
Count	34853 Tasks	8925 Chunks
Type	int8	numpy.ndarray

call_GQ

(variants, samples)

int16

dask.array<chunksize=(300000, 19), meta=np.ndarray>

	Array	Chunk
Bytes	153.55 GB	30.00 MB
Shape	(52226568, 1470)	(300000, 50)
Count	34853 Tasks	8925 Chunks
Type	int16	numpy.ndarray

call_MQ

(variants, samples)

int16

dask.array<chunksize=(300000, 19), meta=np.ndarray>

	Array	Chunk
Bytes	153.55 GB	30.00 MB
Shape	(52226568, 1470)	(300000, 50)
Count	34853 Tasks	8925 Chunks
Type	int16	numpy.ndarray

call_AD

(variants, samples, alleles)

int16

dask.array<chunksize=(300000, 19, 4), meta=np.ndarray>

	Array	Chunk
Bytes	614.18 GB	120.00 MB
Shape	(52226568, 1470, 4)	(300000, 50, 4)
Count	34853 Tasks	8925 Chunks
Type	int16	numpy.ndarray

call_genotype_mask

(variants, samples, ploidy)

bool

dask.array<chunksize=(300000, 19, 2), meta=np.ndarray>

	Array	Chunk
Bytes	153.55 GB	30.00 MB
Shape	(52226568, 1470, 2)	(300000, 50, 2)
Count	47803 Tasks	8925 Chunks
Type	bool	numpy.ndarray

Attributes: (1)
contigs :
('2R', '2L', '3R', '3L', 'X')

Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:

g = gt[:5, :3, :].compute()
g

array([[[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8)

If you want to work with the genotype calls, you may find it convenient to use scikit-allel. E.g., the code below sets up a genotype array.

# use the scikit-allel wrapper class for genotype calls
gt = allel.GenotypeDaskArray(ds_snps["call_genotype"].data)
gt

	0	1	2	3	4	...	3076	3077	3078	3079	3080
0	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
1	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
2	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
...	...
52226565	./.	./.	./.	./.	0/0	...	./.	./.	./.	./.	./.
52226566	./.	./.	./.	./.	0/0	...	./.	./.	./.	./.	./.
52226567	./.	./.	./.	./.	0/0	...	./.	./.	./.	./.	./.

Copy number variation (CNV) data#

Data on copy number variation within the Ag3 cohort are available as three separate data types:

HMM – Genome-wide inferences of copy number state within each individual mosquito in 300 bp non-overlapping windows.
Coverage calls – Genome-wide copy number variant calls, derived from the HMM outputs by analysing contiguous regions of elevated copy number state then clustering of variants across individuals based on breakpoint proximity.
Discordant read calls – Copy number variant calls at selected insecticide resistance loci, based on analysis of read alignments at CNV breakpoints.

For more information on the methods used to generate these data, see the variant-calling methods page.

The malariagen_data Python package provides some convenience functions for accessing these data, illustrated below.

CNV HMM#

Access HMM data via the cnv_hmm() method, which returns an xarray dataset. E.g.:

ds_hmm = ag3.cnv_hmm(region="2R", sample_sets="3.0")
ds_hmm

<xarray.Dataset>
Dimensions:                   (samples: 3081, variants: 205151)
Coordinates:
    variant_position          (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_end               (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_contig            (variants) uint8 dask.array<chunksize=(205151,), meta=np.ndarray>
    sample_id                 (samples) object dask.array<chunksize=(81,), meta=np.ndarray>
Dimensions without coordinates: samples, variants
Data variables:
    call_CN                   (variants, samples) int8 dask.array<chunksize=(65536, 64), meta=np.ndarray>
    call_RawCov               (variants, samples) int32 dask.array<chunksize=(65536, 64), meta=np.ndarray>
    call_NormCov              (variants, samples) float32 dask.array<chunksize=(65536, 64), meta=np.ndarray>
    sample_coverage_variance  (samples) float32 dask.array<chunksize=(81,), meta=np.ndarray>
    sample_is_high_variance   (samples) bool dask.array<chunksize=(81,), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

xarray.Dataset

Dimensions:
- samples: 3081
- variants: 205151

Coordinates: (4)

variant_position

(variants)

int32

dask.array<chunksize=(65536,), meta=np.ndarray>

	Array	Chunk
Bytes	820.60 kB	262.14 kB
Shape	(205151,)	(65536,)
Count	5 Tasks	4 Chunks
Type	int32	numpy.ndarray

variant_end

(variants)

int32

dask.array<chunksize=(65536,), meta=np.ndarray>

	Array	Chunk
Bytes	820.60 kB	262.14 kB
Shape	(205151,)	(65536,)
Count	5 Tasks	4 Chunks
Type	int32	numpy.ndarray

variant_contig

(variants)

uint8

dask.array<chunksize=(205151,), meta=np.ndarray>

	Array	Chunk
Bytes	205.15 kB	205.15 kB
Shape	(205151,)	(205151,)
Count	1 Tasks	1 Chunks
Type	uint8	numpy.ndarray

sample_id

(samples)

object

dask.array<chunksize=(81,), meta=np.ndarray>

	Array	Chunk
Bytes	24.65 kB	2.42 kB
Shape	(3081,)	(303,)
Count	84 Tasks	28 Chunks
Type	object	numpy.ndarray

Data variables: (5)

call_CN

(variants, samples)

int8

dask.array<chunksize=(65536, 64), meta=np.ndarray>

	Array	Chunk
Bytes	632.07 MB	4.19 MB
Shape	(205151, 3081)	(65536, 64)
Count	532 Tasks	252 Chunks
Type	int8	numpy.ndarray

call_RawCov

(variants, samples)

int32

dask.array<chunksize=(65536, 64), meta=np.ndarray>

	Array	Chunk
Bytes	2.53 GB	16.78 MB
Shape	(205151, 3081)	(65536, 64)
Count	532 Tasks	252 Chunks
Type	int32	numpy.ndarray

call_NormCov

(variants, samples)

float32

dask.array<chunksize=(65536, 64), meta=np.ndarray>

	Array	Chunk
Bytes	2.53 GB	16.78 MB
Shape	(205151, 3081)	(65536, 64)
Count	532 Tasks	252 Chunks
Type	float32	numpy.ndarray

sample_coverage_variance

(samples)

float32

dask.array<chunksize=(81,), meta=np.ndarray>

	Array	Chunk
Bytes	12.32 kB	1.21 kB
Shape	(3081,)	(303,)
Count	84 Tasks	28 Chunks
Type	float32	numpy.ndarray

sample_is_high_variance

(samples)

bool

dask.array<chunksize=(81,), meta=np.ndarray>

	Array	Chunk
Bytes	3.08 kB	303 B
Shape	(3081,)	(303,)
Count	84 Tasks	28 Chunks
Type	bool	numpy.ndarray

Attributes: (1)
contigs :
('2R', '2L', '3R', '3L', 'X')

Here “variants” are the 300 bp windows in which coverage was calculated and the HMM fitted. Window start positions are given by the variant_position array and ends are given by variant_end.

pos = ds_hmm['variant_position'].values
pos

array([       1,      301,      601, ..., 61544401, 61544701, 61545001],
      dtype=int32)

end = ds_hmm['variant_end'].values
end

array([     300,      600,      900, ..., 61544700, 61545000, 61545105],
      dtype=int32)

Copy number state is given by the call_CN array, where rows are windows and columns are individual samples.

On the autosomes (2R, 2L, 3R, 3L) normal diploid copy number is 2. Values greater than 2 mean amplification, less then 2 mean deletion.

On the X chromosome, normal copy number is 2 in females and 1 in males.

For all chromosomes, -1 means missing, i.e., the window was not included.

Rows are variants (windows), columns are individual samples.

cn = ds_hmm['call_CN'].values
cn

array([[ 5,  4,  4, ..., 11,  5, 12],
       [ 5,  4,  4, ...,  4,  5,  4],
       [ 5,  4,  4, ...,  4,  5,  4],
       ...,
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1]], dtype=int8)

The gene_cnv() method can be used to summarise the HMM data and compute modal coverage by gene, for all genes in a chromosome arm. E.g.:

The gene_cnv_frequencies() method can be used to compute CNV frequencies by gene and cohort, e.g.:

df_gene_cnv_freqs = ag3.gene_cnv_frequencies(
    region="2R", 
    sample_sets="3.0",
    sample_query="country == 'Burkina Faso'",
    cohorts="admin1_year",
)
df_gene_cnv_freqs

			gene_strand	gene_description	contig	start	end	frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	max_af	windows	label
gene_id	gene_name	cnv_type
AGAP001096	NaN	amp	+	NaN	2R	3426	9985	0.000000	0.0125	0.000000	0.000000	0.000000	0.012500	23	AGAP001096 amp
AGAP001096	NaN	del	+	NaN	2R	3426	9985	0.076923	0.1125	0.113208	0.030928	0.000000	0.113208	23	AGAP001096 del
AGAP013094	NaN	del	+	Elongation of very long chain fatty acids prot...	2R	13754	15149	0.000000	0.0375	0.018868	0.000000	0.000000	0.037500	6	AGAP013094 del
AGAP001097	NaN	del	+	Elongation of very long chain fatty acids prot...	2R	18496	19579	0.000000	0.0125	0.018868	0.000000	0.000000	0.018868	5	AGAP001097 del
AGAP001098	NaN	del	+	DAMOX [Source:VB Community Annotation]	2R	19797	21866	0.000000	0.0125	0.018868	0.000000	0.000000	0.018868	8	AGAP001098 del
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
AGAP004674	NaN	amp	-	Phenoloxidase inhibitor protein [Source:VB Com...	2R	61303692	61306248	0.000000	0.0125	0.000000	0.000000	0.000000	0.012500	10	AGAP004674 amp
AGAP004674	NaN	del	-	Phenoloxidase inhibitor protein [Source:VB Com...	2R	61303692	61306248	0.000000	0.0000	0.018868	0.000000	0.000000	0.018868	10	AGAP004674 del
AGAP004675	GPRMAC2	amp	+	putative muscarinic acetylcholine receptor 2 [...	2R	61372543	61418536	0.000000	0.0125	0.000000	0.000000	0.000000	0.012500	154	AGAP004675 (GPRMAC2) amp
AGAP004676	NaN	amp	+	NaN	2R	61478226	61478802	0.615385	0.0625	0.018868	0.618557	0.717391	0.717391	3	AGAP004676 amp
AGAP004676	NaN	del	+	NaN	2R	61478226	61478802	0.000000	0.4625	0.566038	0.041237	0.000000	0.566038	3	AGAP004676 del

1736 rows × 13 columns

CNV coverage calls#

Coverage calls can be accessed via the cnv_coverage_calls() method, which returns an xarray dataset. Note that an analysis parameter must be given, which can be either “arab” for An. arabiensis samples or “gamb_colu” for non-An. arabiensis.

N.B., coverage calls can only be accessed on sample set at a time, because the CNV alleles do not necessarily match between sample sets. E.g.:

ds_cnv = ag3.cnv_coverage_calls(
    region='2R', 
    analysis='arab', 
    sample_set='AG1000G-TZ'
)
ds_cnv

<xarray.Dataset>
Dimensions:              (samples: 221, variants: 9459)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_end          (variants) int32 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_contig       (variants) uint8 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_id           (variants) object dask.array<chunksize=(9459,), meta=np.ndarray>
    sample_id            (samples) object dask.array<chunksize=(221,), meta=np.ndarray>
Dimensions without coordinates: samples, variants
Data variables:
    variant_CIPOS        (variants) int32 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_CIEND        (variants) int32 dask.array<chunksize=(9459,), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(9459,), meta=np.ndarray>
    call_genotype        (variants, samples) int8 dask.array<chunksize=(9459, 64), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

xarray.Dataset

Dimensions:
- samples: 221
- variants: 9459

Coordinates: (5)

variant_position

(variants)

int32

dask.array<chunksize=(9459,), meta=np.ndarray>

	Array	Chunk
Bytes	37.84 kB	37.84 kB
Shape	(9459,)	(9459,)
Count	2 Tasks	1 Chunks
Type	int32	numpy.ndarray

variant_end

(variants)

int32

dask.array<chunksize=(9459,), meta=np.ndarray>

	Array	Chunk
Bytes	37.84 kB	37.84 kB
Shape	(9459,)	(9459,)
Count	2 Tasks	1 Chunks
Type	int32	numpy.ndarray

variant_contig

(variants)

uint8

dask.array<chunksize=(9459,), meta=np.ndarray>

	Array	Chunk
Bytes	9.46 kB	9.46 kB
Shape	(9459,)	(9459,)
Count	1 Tasks	1 Chunks
Type	uint8	numpy.ndarray

variant_id

(variants)

object

dask.array<chunksize=(9459,), meta=np.ndarray>

	Array	Chunk
Bytes	75.67 kB	75.67 kB
Shape	(9459,)	(9459,)
Count	2 Tasks	1 Chunks
Type	object	numpy.ndarray

sample_id

(samples)

object

dask.array<chunksize=(221,), meta=np.ndarray>

	Array	Chunk
Bytes	1.77 kB	1.77 kB
Shape	(221,)	(221,)
Count	2 Tasks	1 Chunks
Type	object	numpy.ndarray

Data variables: (4)

variant_CIPOS

(variants)

int32

dask.array<chunksize=(9459,), meta=np.ndarray>

	Array	Chunk
Bytes	37.84 kB	37.84 kB
Shape	(9459,)	(9459,)
Count	2 Tasks	1 Chunks
Type	int32	numpy.ndarray

variant_CIEND

(variants)

int32

dask.array<chunksize=(9459,), meta=np.ndarray>

	Array	Chunk
Bytes	37.84 kB	37.84 kB
Shape	(9459,)	(9459,)
Count	2 Tasks	1 Chunks
Type	int32	numpy.ndarray

variant_filter_pass

(variants)

bool

dask.array<chunksize=(9459,), meta=np.ndarray>

	Array	Chunk
Bytes	9.46 kB	9.46 kB
Shape	(9459,)	(9459,)
Count	2 Tasks	1 Chunks
Type	bool	numpy.ndarray

call_genotype

(variants, samples)

int8

dask.array<chunksize=(9459, 64), meta=np.ndarray>

	Array	Chunk
Bytes	2.09 MB	605.38 kB
Shape	(9459, 221)	(9459, 64)
Count	5 Tasks	4 Chunks
Type	int8	numpy.ndarray

Attributes: (1)
contigs :
('2R', '2L', '3R', '3L', 'X')

Here “variants” are copy number variants, inferred from the HMM results. CNV start positions are given by the variant_position array and ends are given by variant_end.

pos = ds_cnv['variant_position'].values
pos

array([       1,        1,      601, ..., 61514401, 61515901, 61533301],
      dtype=int32)

end = ds_cnv['variant_end'].values
end

array([    1500,     2100,     2100, ..., 61517400, 61517700, 61535700],
      dtype=int32)

CNV genotypes are given by the call_genotype array, coded as 0 for absence and 1 for presence of the CNV allele. Rows are CNV alleles and columns are individual samples.

gt = ds_cnv['call_genotype'].values
gt

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

Note that not all samples will have been included in the coverage calls, as some are excluded due to high coverage variance. Use the sample_id array to access the samples included in a given dataset, e.g.:

samples = ds_cnv['sample_id'].values
samples[:5]

array(['BL0047-C', 'BL0048-C', 'BL0049-C', 'BL0050-C', 'BL0055-C'],
      dtype=object)

CNV discordant read calls#

Access the discordant read calls via the cnv_discordant_read_calls() method. E.g.:

ds_cnv_dr = ag3.cnv_discordant_read_calls(contig="2R", sample_sets="3.0")
ds_cnv_dr

<xarray.Dataset>
Dimensions:                        (samples: 3081, variants: 40)
Coordinates:
    variant_position               (variants) int32 dask.array<chunksize=(40,), meta=np.ndarray>
    variant_end                    (variants) int32 dask.array<chunksize=(40,), meta=np.ndarray>
    variant_id                     (variants) object dask.array<chunksize=(40,), meta=np.ndarray>
    variant_contig                 (variants) uint8 dask.array<chunksize=(40,), meta=np.ndarray>
    sample_id                      (samples) object dask.array<chunksize=(81,), meta=np.ndarray>
Dimensions without coordinates: samples, variants
Data variables:
    variant_Region                 (variants) object dask.array<chunksize=(40,), meta=np.ndarray>
    variant_StartBreakpointMethod  (variants) int32 dask.array<chunksize=(40,), meta=np.ndarray>
    variant_EndBreakpointMethod    (variants) int32 dask.array<chunksize=(40,), meta=np.ndarray>
    call_genotype                  (variants, samples) int8 dask.array<chunksize=(40, 64), meta=np.ndarray>
    sample_coverage_variance       (samples) float32 dask.array<chunksize=(81,), meta=np.ndarray>
    sample_is_high_variance        (samples) bool dask.array<chunksize=(81,), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

xarray.Dataset

Dimensions:
- samples: 3081
- variants: 40

Coordinates: (5)

variant_position

(variants)

int32

dask.array<chunksize=(40,), meta=np.ndarray>

	Array	Chunk
Bytes	160 B	160 B
Shape	(40,)	(40,)
Count	2 Tasks	1 Chunks
Type	int32	numpy.ndarray

variant_end

(variants)

int32

dask.array<chunksize=(40,), meta=np.ndarray>

	Array	Chunk
Bytes	160 B	160 B
Shape	(40,)	(40,)
Count	2 Tasks	1 Chunks
Type	int32	numpy.ndarray

variant_id

(variants)

object

dask.array<chunksize=(40,), meta=np.ndarray>

	Array	Chunk
Bytes	320 B	320 B
Shape	(40,)	(40,)
Count	2 Tasks	1 Chunks
Type	object	numpy.ndarray

variant_contig
(variants)
uint8
dask.array<chunksize=(40,), meta=np.ndarray>

Array Chunk

Bytes 40 B 40 B

Shape (40,) (40,)

Count 1 Tasks 1 Chunks

Type uint8 numpy.ndarray

sample_id

(samples)

object

dask.array<chunksize=(81,), meta=np.ndarray>

	Array	Chunk
Bytes	24.65 kB	2.42 kB
Shape	(3081,)	(303,)
Count	84 Tasks	28 Chunks
Type	object	numpy.ndarray

Data variables: (6)

variant_Region

(variants)

object

dask.array<chunksize=(40,), meta=np.ndarray>

	Array	Chunk
Bytes	320 B	320 B
Shape	(40,)	(40,)
Count	2 Tasks	1 Chunks
Type	object	numpy.ndarray

variant_StartBreakpointMethod

(variants)

int32

dask.array<chunksize=(40,), meta=np.ndarray>

	Array	Chunk
Bytes	160 B	160 B
Shape	(40,)	(40,)
Count	2 Tasks	1 Chunks
Type	int32	numpy.ndarray

variant_EndBreakpointMethod

(variants)

int32

dask.array<chunksize=(40,), meta=np.ndarray>

	Array	Chunk
Bytes	160 B	160 B
Shape	(40,)	(40,)
Count	2 Tasks	1 Chunks
Type	int32	numpy.ndarray

call_genotype

(variants, samples)

int8

dask.array<chunksize=(40, 64), meta=np.ndarray>

	Array	Chunk
Bytes	123.24 kB	2.56 kB
Shape	(40, 3081)	(40, 64)
Count	154 Tasks	63 Chunks
Type	int8	numpy.ndarray

sample_coverage_variance

(samples)

float32

dask.array<chunksize=(81,), meta=np.ndarray>

	Array	Chunk
Bytes	12.32 kB	1.21 kB
Shape	(3081,)	(303,)
Count	84 Tasks	28 Chunks
Type	float32	numpy.ndarray

sample_is_high_variance

(samples)

bool

dask.array<chunksize=(81,), meta=np.ndarray>

	Array	Chunk
Bytes	3.08 kB	303 B
Shape	(3081,)	(303,)
Count	84 Tasks	28 Chunks
Type	bool	numpy.ndarray

Attributes: (1)
contigs :
('2R', '2L', '3R', '3L', 'X')

CNV start positions are given by the variant_position array and ends are given by variant_end. Note that some starts and/or ends are not known, and missing values are coded as -1.

The identifier of the CNV alleles is given by the variant_id array. The region where CNV alleles were analysed is given by the variant_Region array.

Note that for each region there is a special variant named “Dup0” which simply records whether any of the genes in a given cluster have increased coverage based on the modal HMM copy number state. I.e., if a sample has a “Dup0” variant but no other variant in a given region, then it has some form of copy number variation but we couldn’t find any discordant reads to uniquely identify the CNV allele it carries.

CNV genotypes are given by the call_genotype array, coded as 0 for absence and 1 for presence of the CNV allele. Rows are CNV alleles and columns are individual samples.

vid = ds_cnv_dr['variant_id'].values
vid

array(['Ace1_Dup0', 'Ace1_Dup1', 'Ace1_Dup2', 'Ace1_Del1', 'Ace1_Del2',
       'Ace1_Del3', 'Ace1_Del4', 'Cyp6aap_Dup0', 'Cyp6aap_Dup1',
       'Cyp6aap_Dup1a', 'Cyp6aap_Dup1b', 'Cyp6aap_Dup2', 'Cyp6aap_Dup3',
       'Cyp6aap_Dup4', 'Cyp6aap_Dup5', 'Cyp6aap_Dup6', 'Cyp6aap_Dup7',
       'Cyp6aap_Dup8', 'Cyp6aap_Dup9', 'Cyp6aap_Dup10', 'Cyp6aap_Dup11',
       'Cyp6aap_Dup12', 'Cyp6aap_Dup13', 'Cyp6aap_Dup14', 'Cyp6aap_Dup15',
       'Cyp6aap_Dup16', 'Cyp6aap_Dup17', 'Cyp6aap_Dup18', 'Cyp6aap_Dup19',
       'Cyp6aap_Dup20', 'Cyp6aap_Dup21', 'Cyp6aap_Dup22', 'Cyp6aap_Dup23',
       'Cyp6aap_Dup24', 'Cyp6aap_Dup25', 'Cyp6aap_Dup26', 'Cyp6aap_Dup27',
       'Cyp6aap_Dup28', 'Cyp6aap_Dup29', 'Cyp6aap_Dup30'], dtype=object)

reg = ds_cnv_dr['variant_Region'].values
reg

array(['Ace1', 'Ace1', 'Ace1', 'Ace1', 'Ace1', 'Ace1', 'Ace1',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p', 'Cyp6aa/Cyp6p',
       'Cyp6aa/Cyp6p'], dtype=object)

pos = ds_cnv_dr['variant_position'].values
pos

array([       0,  3436926,  3447950,  3501850,  3539300,  3535850,
        3512200,        0,        0, 28480189, 28480193, 28493547,
       28479407, 28478925, 28480372, 28478272, 28478057, 28475996,
       28479181, 28477889, 28487546, 28474576, 28472728, 28473874,
       28465673, 28480547, 28477540, 28479548, 28475490, 28473600,
       28475128, 28477220,        0, 28480585, 28480335, 28480166,
       28496700, 28477710, 28494017, 28478987], dtype=int32)

end = ds_cnv_dr['variant_end'].values
end

array([      -1,  3639836,  3518900,  3599050,  3573750,  3619000,
        3616290,       -1,       -1, 28483475, 28483675, 28497279,
       28483372, 28483069, 28484518, 28484157, 28486036, 28485005,
       28491431, 28491215, 28518123, 28520016, 28522671, 28563596,
       28555300, 28483236, 28485380, 28494597, 28556726, 28795255,
       28483473, 28484338, 28497740, 28483442, 28483384, 28483253,
       28499200, 28496953, 28496505, 28485033], dtype=int32)

gt = ds_cnv_dr['call_genotype'].values
gt

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

samples = ds_cnv_dr['sample_id'].values
samples[:5]

array(['AR0047-C', 'AR0049-C', 'AR0051-C', 'AR0061-C', 'AR0078-C'],
      dtype=object)

Haplotypes#

The Ag3 data resource also includes haplotype reference panels, which were obtained by phasing the SNP calls. Phasing involves resolving the genotypes within each individual into a pair of haplotypes providing information about the two different DNA sequences inherited, one from each parent. Haplotypes provide greater power for a range of population genetic analyses, such as genome-wide scans for signals of recent selection, or analysis of adaptive gene flow between populations.

To allow for a range of different analyses, three different haplotype reference panels were constructed, each using a different subset of samples and applying different site filters:

“gamb_colu_arab” - This haplotype reference panel includes all wild-caught samples, and phases biallelic SNPs passing the “gamb_colu_arab” site filters.
“gamb_colu” - This haplotype reference panel includes wild-caught samples assigned as either An. gambiae, An. coluzzii or intermediate An. gambiae/An. coluzzii via the AIM species calling method, and phases biallelic SNPs passing the “gamb_colu” site filters.
“arab” - This haplotype reference panel includes wild-caught samples assigned as An. arabiensis via the AIM species calling method, and phases biallelic SNPs passing the “arab” site filters.

Data can be accessed in the cloud via the haplotypes() method. E.g., access haplotypes from the “gamb_colu_arab” analysis for all available samples, for chromosome arm 2R. This method returns an xarray Dataset.

ds_haps = ag3.haplotypes(region="2R", analysis="gamb_colu_arab", sample_sets="3.0")
ds_haps

<xarray.Dataset>
Dimensions:           (alleles: 2, ploidy: 2, samples: 2784, variants: 16514919)
Coordinates:
    variant_position  (variants) int32 dask.array<chunksize=(262144,), meta=np.ndarray>
    variant_contig    (variants) uint8 dask.array<chunksize=(16514919,), meta=np.ndarray>
    sample_id         (samples) object dask.array<chunksize=(81,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele    (variants, alleles) |S1 dask.array<chunksize=(262144, 1), meta=np.ndarray>
    call_genotype     (variants, samples, ploidy) int8 dask.array<chunksize=(262144, 64, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

xarray.Dataset

Dimensions:
- alleles: 2
- ploidy: 2
- samples: 2784
- variants: 16514919

Coordinates: (3)

variant_position

(variants)

int32

dask.array<chunksize=(262144,), meta=np.ndarray>

	Array	Chunk
Bytes	66.06 MB	1.05 MB
Shape	(16514919,)	(262144,)
Count	64 Tasks	63 Chunks
Type	int32	numpy.ndarray

variant_contig

(variants)

uint8

dask.array<chunksize=(16514919,), meta=np.ndarray>

	Array	Chunk
Bytes	16.51 MB	16.51 MB
Shape	(16514919,)	(16514919,)
Count	1 Tasks	1 Chunks
Type	uint8	numpy.ndarray

sample_id

(samples)

object

dask.array<chunksize=(81,), meta=np.ndarray>

	Array	Chunk
Bytes	22.27 kB	2.42 kB
Shape	(2784,)	(303,)
Count	81 Tasks	27 Chunks
Type	object	numpy.ndarray

Data variables: (2)

variant_allele

(variants, alleles)

|S1

dask.array<chunksize=(262144, 1), meta=np.ndarray>

	Array	Chunk
Bytes	33.03 MB	262.14 kB
Shape	(16514919, 2)	(262144, 1)
Count	380 Tasks	126 Chunks
Type	\|S1	numpy.ndarray

call_genotype

(variants, samples, ploidy)

int8

dask.array<chunksize=(262144, 64, 2), meta=np.ndarray>

	Array	Chunk
Bytes	91.96 GB	33.55 MB
Shape	(16514919, 2784, 2)	(262144, 64, 2)
Count	7335 Tasks	3654 Chunks
Type	int8	numpy.ndarray

Attributes: (1)
contigs :
('2R', '2L', '3R', '3L', 'X')

Here we have haplotype data for 2,784 samples at 16,514,919 SNPs.

The SNP positions and alleles can be accessed as follows.

# access haplotype SNP positions
pos = ds_haps["variant_position"].data  # dask array
pos

	Array	Chunk
Bytes	66.06 MB	1.05 MB
Shape	(16514919,)	(262144,)
Count	64 Tasks	63 Chunks
Type	int32	numpy.ndarray

# load positions of first 10 haplotype SNPs
pos[:10].compute()  # numpy array

array([135, 168, 170, 274, 289, 294, 295, 298, 299, 301], dtype=int32)

# access haplotype SNP alleles
alleles = ds_haps["variant_allele"].data  # dask array
alleles

	Array	Chunk
Bytes	33.03 MB	262.14 kB
Shape	(16514919, 2)	(262144, 1)
Count	380 Tasks	126 Chunks
Type	\|S1	numpy.ndarray

# load alleles of first 10 haplotype SNPs - note all are biallelic
alleles[:10].compute()  # numpy array

array([[b'A', b'T'],
       [b'A', b'G'],
       [b'A', b'T'],
       [b'G', b'A'],
       [b'A', b'T'],
       [b'T', b'A'],
       [b'T', b'C'],
       [b'C', b'T'],
       [b'T', b'C'],
       [b'G', b'T']], dtype='|S1')

The phased genotypes can be accessed as follows.

# access genotypes
gt = ds_haps["call_genotype"].data  # dask array
gt

	Array	Chunk
Bytes	91.96 GB	33.55 MB
Shape	(16514919, 2784, 2)	(262144, 64, 2)
Count	7335 Tasks	3654 Chunks
Type	int8	numpy.ndarray

# wrap as scikit-allel genotype array
gt = allel.GenotypeDaskArray(ds_haps["call_genotype"].data)
gt

	0	1	2	3	4	...	2779	2780	2781	2782	2783
0	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
1	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
2	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
...	...
16514916	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
16514917	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
16514918	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0

# convert to scikit-allel haplotype array - useful for some analyses
ht = gt.to_haplotypes()
ht

	0	1	2	3	4	...	5563	5564	5565	5566	5567
0	0	0	0	0	0	...	0	0	0	0	0
1	0	0	0	0	0	...	0	0	0	0	0
2	0	0	0	0	0	...	0	0	0	0	0
...	...
16514916	0	0	0	0	0	...	0	0	0	0	0
16514917	0	0	0	0	0	...	0	0	0	0	0
16514918	0	0	0	0	0	...	0	0	0	0	0

Note that in the haplotype array above, each row is a SNP and each column is a haplotype. There were \(2784\) samples in this analysis, and so we have \(5568\) (\(2\times2784\)) haplotypes.

Using sample metadata to subset haplotypes#

For some analyses, you’ll want to subset the haplotypes, e.g., by location and species. In order to perform subsetting, you need to obtain sample metadata, and align it with the haplotype data. This ensures that every row in the sample metadata dataframe corresponds to every column in the phased genotypes array. E.g.:

# load sample metadata
df_samples = ag3.sample_metadata(sample_sets="3.0")
df_samples.head()

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	aim_species	country_iso	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin2_year	cohort_admin2_month
0	AR0047-C	LUA047	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
1	AR0049-C	LUA049	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
2	AR0051-C	LUA051	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
3	AR0061-C	LUA061	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
4	AR0078-C	LUA078	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04

5 rows × 26 columns

# load IDs of phased samples
samples_phased = ds_haps['sample_id'].values
samples_phased[:5]

array(['AR0001-C', 'AR0002-C', 'AR0004-C', 'AR0006-C', 'AR0007-C'],
      dtype=object)

# align sample metadata to haplotypes
df_samples_phased = df_samples.set_index("sample_id").loc[samples_phased].reset_index()
df_samples_phased.head()

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	aim_species	country_iso	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin2_year	cohort_admin2_month
0	AR0001-C	LUA001	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
1	AR0002-C	LUA002	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	M	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
2	AR0004-C	LUA004	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	M	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
3	AR0006-C	LUA006	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	M	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04
4	AR0007-C	LUA007	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	coluzzii	AGO	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04

5 rows × 26 columns

# now define some cohort of interest and locate samples
cohort_query = "country == 'Burkina Faso' and taxon == 'coluzzii' and year == 2012"
loc_cohort_samples = df_samples_phased.query(cohort_query).index.values
loc_cohort_samples

array([ 83,  84,  85,  86,  87,  88,  89,  90,  92,  93,  94,  95,  96,
        99, 100, 101, 102, 103, 104, 105, 110, 111, 112, 113, 120, 125,
       126, 127, 128, 129, 162, 163, 164, 165, 166, 167, 168, 169, 170,
       171, 172, 173, 174, 175, 176, 184, 189, 190, 192, 193, 194, 195,
       196, 199, 200, 201, 202, 203, 205, 206, 208, 209, 213, 216, 219,
       221, 222, 224, 225, 226, 227, 228, 234, 235, 236, 239, 240, 243,
       244, 253, 256, 259])

# subset haplotypes to cohort
ds_haps_cohort = ds_haps.isel(samples=loc_cohort_samples)
ds_haps_cohort

<xarray.Dataset>
Dimensions:           (alleles: 2, ploidy: 2, samples: 82, variants: 16514919)
Coordinates:
    variant_position  (variants) int32 dask.array<chunksize=(262144,), meta=np.ndarray>
    variant_contig    (variants) uint8 dask.array<chunksize=(16514919,), meta=np.ndarray>
    sample_id         (samples) object dask.array<chunksize=(82,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele    (variants, alleles) |S1 dask.array<chunksize=(262144, 1), meta=np.ndarray>
    call_genotype     (variants, samples, ploidy) int8 dask.array<chunksize=(262144, 30, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

xarray.Dataset

Dimensions:
- alleles: 2
- ploidy: 2
- samples: 82
- variants: 16514919

Coordinates: (3)

variant_position

(variants)

int32

dask.array<chunksize=(262144,), meta=np.ndarray>

	Array	Chunk
Bytes	66.06 MB	1.05 MB
Shape	(16514919,)	(262144,)
Count	64 Tasks	63 Chunks
Type	int32	numpy.ndarray

variant_contig

(variants)

uint8

dask.array<chunksize=(16514919,), meta=np.ndarray>

	Array	Chunk
Bytes	16.51 MB	16.51 MB
Shape	(16514919,)	(16514919,)
Count	1 Tasks	1 Chunks
Type	uint8	numpy.ndarray

sample_id

(samples)

object

dask.array<chunksize=(82,), meta=np.ndarray>

	Array	Chunk
Bytes	656 B	656 B
Shape	(82,)	(82,)
Count	82 Tasks	1 Chunks
Type	object	numpy.ndarray

Data variables: (2)

variant_allele

(variants, alleles)

|S1

dask.array<chunksize=(262144, 1), meta=np.ndarray>

	Array	Chunk
Bytes	33.03 MB	262.14 kB
Shape	(16514919, 2)	(262144, 1)
Count	380 Tasks	126 Chunks
Type	\|S1	numpy.ndarray

call_genotype

(variants, samples, ploidy)

int8

dask.array<chunksize=(262144, 30, 2), meta=np.ndarray>

	Array	Chunk
Bytes	2.71 GB	16.25 MB
Shape	(16514919, 82, 2)	(262144, 31, 2)
Count	7524 Tasks	189 Chunks
Type	int8	numpy.ndarray

Attributes: (1)
contigs :
('2R', '2L', '3R', '3L', 'X')

# now access subsetted haplotypes
gt_cohort = allel.GenotypeDaskArray(ds_haps_cohort['call_genotype'].data)
ht_cohort = gt_cohort.to_haplotypes()
ht_cohort

	0	1	2	3	4	...	159	160	161	162	163
0	0	0	0	0	0	...	0	0	0	0	0
1	0	0	0	0	0	...	0	0	0	0	0
2	0	0	0	0	0	...	0	0	0	0	0
...	...
16514916	0	0	0	0	0	...	0	0	0	0	0
16514917	0	0	0	0	0	...	0	0	0	0	0
16514918	0	0	0	0	0	...	0	0	0	0	0

Note there are \(82\) samples in the cohort and thus \(164\) (\(2\times82\)) haplotypes.

Example computation#

Here’s an example computation to count the number of segregating SNPs on chromosome arm 3R that also pass gamb_colu_arab site filters. This may take a minute or two, because it is scanning genotype calls at millions of SNPs in hundreds of samples.

# choose chromosome arm
region = "3R"

# choose site filter mask
mask = "gamb_colu_arab"

# choose sample sets
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B"]

# access SNP calls
ds_snps = ag3.snp_calls(region=region, sample_sets=sample_sets)

# locate pass sites
loc_pass = ds_snps[f"variant_filter_pass_{mask}"].values

# perform an allele count over genotypes
gt = allel.GenotypeDaskArray(ds_snps["call_genotype"].data)
with ProgressBar():
    ac = gt.count_alleles(max_allele=3).compute()

# locate segregating sites
loc_seg = ac.is_segregating()

# count segregating and pass sites
n_pass_seg = np.count_nonzero(loc_pass & loc_seg)

n_pass_seg

[########################################] | 100% Completed |  2min 39.4s

11226723

For more in-depth worked examples of various analyses, see the Ag3 analysis page.

Running larger computations#

Please note that free cloud computing services such as Google Colab and MyBinder provide only limited computing resources. Thus although these services are able to efficiently read Ag3 data stored on Google Cloud, you may find that you run out of memory, or computations take a long time running on a single core. If you would like any suggestions regarding how to set up more powerful compute resources in the cloud, please feel free to get in touch via the malariagen/vector-data GitHub discussion board.

Feedback and suggestions#

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.

Ag3 cloud data access

Contents

Ag3 cloud data access#

Data hosting#

Setup#

Sample sets#

Sample metadata#

SNP calls#

SNP sites and alleles#

Site filters#

SNP genotypes#

Copy number variation (CNV) data#

CNV HMM#

CNV coverage calls#

CNV discordant read calls#

Haplotypes#

Using sample metadata to subset haplotypes#

Example computation#

Running larger computations#

Feedback and suggestions#