Amin1.0 cloud data access

Amin1.0 cloud data access#

This page provides information about how to access data from the Amin1.0 resource via Google Cloud. This includes sample metadata and single nucleotide polymorphism (SNP) calls.

This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via Google Colab which is free interactive computing service running in the cloud.

To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.

Data hosting#

All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the vo_amin_release bucket, which is a multi-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access.

Setup#

Running this notebook requires some Python packages to be installed. These packages can be installed via pip or conda. E.g.:

!pip install -q malariagen_data

To make accessing these data more convenient, we’ve created the malariagen_data Python package, which is available from PyPI. This is experimental so please let us know if you find any bugs or have any suggestions.

Now import the packages we’ll need to use here.

import malariagen_data
import numpy as np
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False})
import allel

Data access from Google Cloud is set up with the following code:

amin1 = malariagen_data.Amin1()

Sample metadata#

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.

Load sample metadata into a pandas DataFrame:

df_samples = amin1.sample_metadata()
df_samples

	sample_id	original_sample_id	sanger_sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	season	PCA_cohort	cohort	subsampled_cohort
0	VBS09378-4248STDY7308980	VBS09378	4248STDY7308980	CB-2-00264	Brandy St. Laurent	Cambodia	Preah Kleang	2016	3	13.667	104.982	Feb-Apr (late dry)	A	PV	NaN
1	VBS09382-4248STDY7308981	VBS09382	4248STDY7308981	CB-2-00258	Brandy St. Laurent	Cambodia	Preah Kleang	2016	3	13.667	104.982	Feb-Apr (late dry)	A	PV	NaN
2	VBS09397-4248STDY7308982	VBS09397	4248STDY7308982	CB-2-00384	Brandy St. Laurent	Cambodia	Preah Kleang	2016	3	13.667	104.982	Feb-Apr (late dry)	A	PV	PV
3	VBS09460-4248STDY7308986	VBS09460	4248STDY7308986	CB-2-02960	Brandy St. Laurent	Cambodia	Preah Kleang	2016	6	13.667	104.982	May-Jul (early wet)	A	PV	NaN
4	VBS09466-4248STDY7308989	VBS09466	4248STDY7308989	CB-2-04070	Brandy St. Laurent	Cambodia	Preah Kleang	2016	11	13.667	104.982	Nov-Jan (early dry)	A	PV	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
297	VBS16624-4248STDY7918667	VBS16624	4248STDY7918667	KV-32-01591	Brandy St. Laurent	Cambodia	Sayas	2014	6	13.548	107.025	May-Jul (early wet)	C	RK2	RK2
298	VBS16625-4248STDY7918668	VBS16625	4248STDY7918668	KV-32-01499	Brandy St. Laurent	Cambodia	Sayas	2014	6	13.548	107.025	May-Jul (early wet)	C	RK2	RK2
299	VBS16626-4248STDY7918669	VBS16626	4248STDY7918669	KV-32-01465	Brandy St. Laurent	Cambodia	Sayas	2014	6	13.548	107.025	May-Jul (early wet)	B	RK1	RK1
300	VBS16628-4248STDY7918670	VBS16628	4248STDY7918670	KV-32-01454	Brandy St. Laurent	Cambodia	Sayas	2014	6	13.548	107.025	May-Jul (early wet)	C	RK2	RK2
301	VBS16630-4248STDY7918671	VBS16630	4248STDY7918671	KV-31-01949	Brandy St. Laurent	Cambodia	Chamkar San	2014	4	13.595	106.995	Feb-Apr (late dry)	B	RK1	RK1

302 rows × 15 columns

The sample_id column gives the sample identifier used throughout all analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

The cohort column gives an assignment of individual mosquitoes to populations based on location of sampling and genetic population structure.

Pandas can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:

df_summary = df_samples.pivot_table(
    index=["longitude", "latitude", "location"], 
    columns=["year"],
    values="sample_id", 
    aggfunc=len,
    fill_value=0
)
df_summary.style.set_caption("Number of mosquito specimens by collection site and year.")

Number of mosquito specimens by collection site and year.
		year	2010	2011	2014	2015	2016
longitude	latitude	location
102.735	12.155	Thmar Da	26	15	0	0	0
104.92	13.77	Chean Mok	0	0	66	9	0
104.982	13.667	Preah Kleang	0	0	47	9	36
106.995	13.595	Chamkar San	0	0	40	11	0
107.025	13.548	Sayas	0	0	39	4	0

Reference genome#

Sequence data in this study were aligned to the MINIMUS1 reference genome. This reference genome contains 678 contigs in total, but many contigs are small and not suitable for population genetic analyses. We therefore have included only SNP calls for the 40 largest contigs. The set of contigs used for SNP calling can be accessed as follows:

amin1.contigs

('KB663610',
 'KB663611',
 'KB663622',
 'KB663633',
 'KB663644',
 'KB663655',
 'KB663666',
 'KB663677',
 'KB663688',
 'KB663699',
 'KB663710',
 'KB663721',
 'KB663722',
 'KB663733',
 'KB663744',
 'KB663755',
 'KB663766',
 'KB663777',
 'KB663788',
 'KB663799',
 'KB663810',
 'KB663821',
 'KB663832',
 'KB663833',
 'KB663844',
 'KB663855',
 'KB663866',
 'KB663877',
 'KB663888',
 'KB663899',
 'KB663910',
 'KB663921',
 'KB663932',
 'KB663943',
 'KB663955',
 'KB664054',
 'KB664165',
 'KB664255',
 'KB664266',
 'KB664277')

For convenience, the reference genome sequence for any contig can be loaded as a NumPy array, e.g.:

# load the reference sequence for a single contig as a numpy array
seq = amin1.genome_sequence(region="KB663610").compute()
seq

array([b'T', b'T', b'C', ..., b'C', b'A', b'C'], dtype='|S1')

# length of contig
len(seq)

31626230

# number of called bases in contig
np.sum((seq != b'N') & (seq != b'n'))

30141520

SNP calls#

We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not “N”. Data on the SNP positions, alleles, site filters and genotype calls for a given contig can be accessed as an xarray Dataset. E.g., access SNP calls for contig KB663610:

ds_snps = amin1.snp_calls(region="KB663610", site_mask=False)
ds_snps

<xarray.Dataset>
Dimensions:              (alleles: 4, ploidy: 2, samples: 302, variants: 30141520)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig       (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id            (samples) |S24 dask.array<chunksize=(302,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele       (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(941923,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
    call_GQ              (variants, samples) int8 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_MQ              (variants, samples) float32 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
    call_genotype_mask   (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Attributes:
    contigs:  ('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'K...

xarray.Dataset

Dimensions:
- alleles: 4
- ploidy: 2
- samples: 302
- variants: 30141520

Coordinates: (3)

variant_position

(variants)

int32

dask.array<chunksize=(524288,), meta=np.ndarray>

	Array	Chunk
Bytes	120.57 MB	2.10 MB
Shape	(30141520,)	(524288,)
Count	59 Tasks	58 Chunks
Type	int32	numpy.ndarray

variant_contig

(variants)

uint8

dask.array<chunksize=(524288,), meta=np.ndarray>

	Array	Chunk
Bytes	30.14 MB	524.29 kB
Shape	(30141520,)	(524288,)
Count	58 Tasks	58 Chunks
Type	uint8	numpy.ndarray

sample_id

(samples)

|S24

dask.array<chunksize=(302,), meta=np.ndarray>

	Array	Chunk
Bytes	7.25 kB	7.25 kB
Shape	(302,)	(302,)
Count	2 Tasks	1 Chunks
Type	\|S24	numpy.ndarray

Data variables: (7)

variant_allele

(variants, alleles)

|S1

dask.array<chunksize=(524288, 1), meta=np.ndarray>

	Array	Chunk
Bytes	120.57 MB	1.57 MB
Shape	(30141520, 4)	(524288, 3)
Count	292 Tasks	116 Chunks
Type	\|S1	numpy.ndarray

variant_filter_pass

(variants)

bool

dask.array<chunksize=(941923,), meta=np.ndarray>

	Array	Chunk
Bytes	30.14 MB	941.92 kB
Shape	(30141520,)	(941923,)
Count	33 Tasks	32 Chunks
Type	bool	numpy.ndarray

call_genotype

(variants, samples, ploidy)

int8

dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>

	Array	Chunk
Bytes	18.21 GB	30.00 MB
Shape	(30141520, 302, 2)	(300000, 50, 2)
Count	708 Tasks	707 Chunks
Type	int8	numpy.ndarray

call_GQ

(variants, samples)

int8

dask.array<chunksize=(300000, 50), meta=np.ndarray>

	Array	Chunk
Bytes	9.10 GB	15.00 MB
Shape	(30141520, 302)	(300000, 50)
Count	708 Tasks	707 Chunks
Type	int8	numpy.ndarray

call_MQ

(variants, samples)

float32

dask.array<chunksize=(300000, 50), meta=np.ndarray>

	Array	Chunk
Bytes	36.41 GB	60.00 MB
Shape	(30141520, 302)	(300000, 50)
Count	708 Tasks	707 Chunks
Type	float32	numpy.ndarray

call_AD

(variants, samples, alleles)

int16

dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>

	Array	Chunk
Bytes	72.82 GB	120.00 MB
Shape	(30141520, 302, 4)	(300000, 50, 4)
Count	708 Tasks	707 Chunks
Type	int16	numpy.ndarray

call_genotype_mask

(variants, samples, ploidy)

bool

dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>

	Array	Chunk
Bytes	18.21 GB	30.00 MB
Shape	(30141520, 302, 2)	(300000, 50, 2)
Count	1415 Tasks	707 Chunks
Type	bool	numpy.ndarray

Attributes: (1)
contigs :
('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'KB663655', 'KB663666', 'KB663677', 'KB663688', 'KB663699', 'KB663710', 'KB663721', 'KB663722', 'KB663733', 'KB663744', 'KB663755', 'KB663766', 'KB663777', 'KB663788', 'KB663799', 'KB663810', 'KB663821', 'KB663832', 'KB663833', 'KB663844', 'KB663855', 'KB663866', 'KB663877', 'KB663888', 'KB663899', 'KB663910', 'KB663921', 'KB663932', 'KB663943', 'KB663955', 'KB664054', 'KB664165', 'KB664255', 'KB664266', 'KB664277')

The arrays within this dataset are backed by Dask arrays, and can be accessed as shown below.

SNP positions and alleles#

# SNP positions (1-based)
pos = ds_snps['variant_position'].data
pos

	Array	Chunk
Bytes	120.57 MB	2.10 MB
Shape	(30141520,)	(524288,)
Count	59 Tasks	58 Chunks
Type	int32	numpy.ndarray

# read first 10 SNP positions into a numpy array
p = pos[:10].compute()
p

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

# SNP alleles (first column is the reference allele)
alleles = ds_snps['variant_allele'].data
alleles

	Array	Chunk
Bytes	120.57 MB	1.57 MB
Shape	(30141520, 4)	(524288, 3)
Count	292 Tasks	116 Chunks
Type	\|S1	numpy.ndarray

# read first 10 SNP alleles into a numpy array
a = alleles[:10, :].compute()
a

array([[b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'C', b'A', b'G', b'T'],
       [b'G', b'C', b'A', b'T'],
       [b'T', b'C', b'A', b'G'],
       [b'C', b'A', b'G', b'T'],
       [b'A', b'C', b'G', b'T'],
       [b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'G', b'C', b'A', b'T']], dtype='|S1')

Site filters#

SNP calling is not always reliable, and we have created site filters to allow excluding low quality SNPs. For each contig, a “filter_pass” Boolean mask is available, where True indicates that the site passed the filter and is accessible to high quality SNP calling.

# site filters
filter_pass = ds_snps['variant_filter_pass'].data
filter_pass

	Array	Chunk
Bytes	30.14 MB	941.92 kB
Shape	(30141520,)	(941923,)
Count	33 Tasks	32 Chunks
Type	bool	numpy.ndarray

# load site filters for first 10 SNPs
f = filter_pass[:10].compute()
f

array([False, False, False, False, False, False, False, False, False,
       False])

Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below.

SNP genotypes#

SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

SNP genotypes can be accessed as dask arrays as shown in the example below.

gt = ds_snps['call_genotype'].data
gt

	Array	Chunk
Bytes	18.21 GB	30.00 MB
Shape	(30141520, 302, 2)	(300000, 50, 2)
Count	708 Tasks	707 Chunks
Type	int8	numpy.ndarray

Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:

len(df_samples) == gt.shape[1]

True

You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:

df_samples.cohort.unique()

array(['PV', nan, 'RK2', 'RK1', 'TD'], dtype=object)

# select samples from the Thmar Da cohort
loc_cohort = df_samples.eval("cohort == 'TD'").values
print(f"found {np.count_nonzero(loc_cohort)} samples")
gt_cohort = da.compress(loc_cohort, gt, axis=1)
gt_cohort

found 41 samples

	Array	Chunk
Bytes	2.47 GB	20.40 MB
Shape	(30141520, 41, 2)	(300000, 34, 2)
Count	910 Tasks	202 Chunks
Type	int8	numpy.ndarray

Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:

g = gt[:5, :3, :].compute()
g

array([[[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8)

If you want to work with the genotype calls, you may find it convenient to use scikit-allel. E.g., the code below sets up a genotype array.

# use the scikit-allel wrapper class for genotype calls
gtd = allel.GenotypeDaskArray(ds_snps['call_genotype'].data)
gtd

	0	1	2	3	4	...	297	298	299	300	301
0	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
1	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
2	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
...	...
30141517	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
30141518	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
30141519	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.

Example computations#

Below are some examples of simple computations that can be run with these data.

Counting sites passing filters#

For each of the contigs for which SNP calling was performed, count the number of sites and the number passing site filters.

for contig in amin1.contigs:
    ds_snps = amin1.snp_calls(region=contig)
    filter_pass = ds_snps['variant_filter_pass'].data
    n_sites = ds_snps.dims['variants']
    n_pass = filter_pass.sum().compute()
    print(f"{contig}: {n_pass:,} out of {n_sites:,} ({n_pass/n_sites:.0%}) sites pass filters")

KB663610: 22,857,027 out of 30,141,520 (76%) sites pass filters
KB663611: 3,737,822 out of 5,453,172 (69%) sites pass filters
KB663622: 4,388,762 out of 5,461,660 (80%) sites pass filters
KB663633: 4,215,569 out of 5,450,627 (77%) sites pass filters
KB663644: 4,221,032 out of 5,413,357 (78%) sites pass filters
KB663655: 2,916,639 out of 3,744,733 (78%) sites pass filters
KB663666: 2,501,655 out of 3,719,057 (67%) sites pass filters
KB663677: 2,586,150 out of 3,624,725 (71%) sites pass filters
KB663688: 2,742,023 out of 3,544,647 (77%) sites pass filters
KB663699: 1,534,228 out of 1,873,197 (82%) sites pass filters
KB663710: 1,213,646 out of 1,685,057 (72%) sites pass filters
KB663721: 15,452,041 out of 20,762,843 (74%) sites pass filters
KB663722: 1,209,792 out of 1,659,122 (73%) sites pass filters
KB663733: 1,062,674 out of 1,537,061 (69%) sites pass filters
KB663744: 1,240,722 out of 1,464,294 (85%) sites pass filters
KB663755: 1,099,411 out of 1,252,480 (88%) sites pass filters
KB663766: 1,178,051 out of 1,313,725 (90%) sites pass filters
KB663777: 1,012,298 out of 1,306,825 (77%) sites pass filters
KB663788: 1,125,292 out of 1,246,996 (90%) sites pass filters
KB663799: 966,257 out of 1,234,260 (78%) sites pass filters
KB663810: 800,823 out of 928,103 (86%) sites pass filters
KB663821: 757,864 out of 875,898 (87%) sites pass filters
KB663832: 11,641,571 out of 16,000,346 (73%) sites pass filters
KB663833: 395,479 out of 554,130 (71%) sites pass filters
KB663844: 482,803 out of 602,452 (80%) sites pass filters
KB663855: 356,534 out of 428,439 (83%) sites pass filters
KB663866: 437,007 out of 472,977 (92%) sites pass filters
KB663877: 172,685 out of 273,855 (63%) sites pass filters
KB663888: 171,639 out of 221,033 (78%) sites pass filters
KB663899: 29,212 out of 87,710 (33%) sites pass filters
KB663910: 105,536 out of 150,741 (70%) sites pass filters
KB663921: 29,136 out of 87,604 (33%) sites pass filters
KB663932: 100,838 out of 133,376 (76%) sites pass filters
KB663943: 10,913,259 out of 14,689,009 (74%) sites pass filters
KB663955: 24,703 out of 55,494 (45%) sites pass filters
KB664054: 8,476,484 out of 11,726,578 (72%) sites pass filters
KB664165: 7,711,730 out of 9,974,421 (77%) sites pass filters
KB664255: 6,529,746 out of 9,221,028 (71%) sites pass filters
KB664266: 6,594,776 out of 8,849,743 (75%) sites pass filters
KB664277: 5,168,159 out of 6,068,897 (85%) sites pass filters

Counting segregating sites#

Count the number of segregating SNPs that also pass site filters, for a single contig.

# choose contig
region = "KB663610"

# access SNP data
ds_snps = amin1.snp_calls(region=region)
ds_snps

<xarray.Dataset>
Dimensions:              (alleles: 4, ploidy: 2, samples: 302, variants: 30141520)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig       (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id            (samples) |S24 dask.array<chunksize=(302,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele       (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(941923,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
    call_GQ              (variants, samples) int8 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_MQ              (variants, samples) float32 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
    call_genotype_mask   (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Attributes:
    contigs:  ('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'K...

xarray.Dataset

Dimensions:
- alleles: 4
- ploidy: 2
- samples: 302
- variants: 30141520

Coordinates: (3)

variant_position

(variants)

int32

dask.array<chunksize=(524288,), meta=np.ndarray>

	Array	Chunk
Bytes	120.57 MB	2.10 MB
Shape	(30141520,)	(524288,)
Count	59 Tasks	58 Chunks
Type	int32	numpy.ndarray

variant_contig

(variants)

uint8

dask.array<chunksize=(524288,), meta=np.ndarray>

	Array	Chunk
Bytes	30.14 MB	524.29 kB
Shape	(30141520,)	(524288,)
Count	58 Tasks	58 Chunks
Type	uint8	numpy.ndarray

sample_id

(samples)

|S24

dask.array<chunksize=(302,), meta=np.ndarray>

	Array	Chunk
Bytes	7.25 kB	7.25 kB
Shape	(302,)	(302,)
Count	2 Tasks	1 Chunks
Type	\|S24	numpy.ndarray

Data variables: (7)

variant_allele

(variants, alleles)

|S1

dask.array<chunksize=(524288, 1), meta=np.ndarray>

	Array	Chunk
Bytes	120.57 MB	1.57 MB
Shape	(30141520, 4)	(524288, 3)
Count	292 Tasks	116 Chunks
Type	\|S1	numpy.ndarray

variant_filter_pass

(variants)

bool

dask.array<chunksize=(941923,), meta=np.ndarray>

	Array	Chunk
Bytes	30.14 MB	941.92 kB
Shape	(30141520,)	(941923,)
Count	33 Tasks	32 Chunks
Type	bool	numpy.ndarray

call_genotype

(variants, samples, ploidy)

int8

dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>

	Array	Chunk
Bytes	18.21 GB	30.00 MB
Shape	(30141520, 302, 2)	(300000, 50, 2)
Count	708 Tasks	707 Chunks
Type	int8	numpy.ndarray

call_GQ

(variants, samples)

int8

dask.array<chunksize=(300000, 50), meta=np.ndarray>

	Array	Chunk
Bytes	9.10 GB	15.00 MB
Shape	(30141520, 302)	(300000, 50)
Count	708 Tasks	707 Chunks
Type	int8	numpy.ndarray

call_MQ

(variants, samples)

float32

dask.array<chunksize=(300000, 50), meta=np.ndarray>

	Array	Chunk
Bytes	36.41 GB	60.00 MB
Shape	(30141520, 302)	(300000, 50)
Count	708 Tasks	707 Chunks
Type	float32	numpy.ndarray

call_AD

(variants, samples, alleles)

int16

dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>

	Array	Chunk
Bytes	72.82 GB	120.00 MB
Shape	(30141520, 302, 4)	(300000, 50, 4)
Count	708 Tasks	707 Chunks
Type	int16	numpy.ndarray

call_genotype_mask

(variants, samples, ploidy)

bool

dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>

	Array	Chunk
Bytes	18.21 GB	30.00 MB
Shape	(30141520, 302, 2)	(300000, 50, 2)
Count	1415 Tasks	707 Chunks
Type	bool	numpy.ndarray

Attributes: (1)
contigs :
('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'KB663655', 'KB663666', 'KB663677', 'KB663688', 'KB663699', 'KB663710', 'KB663721', 'KB663722', 'KB663733', 'KB663744', 'KB663755', 'KB663766', 'KB663777', 'KB663788', 'KB663799', 'KB663810', 'KB663821', 'KB663832', 'KB663833', 'KB663844', 'KB663855', 'KB663866', 'KB663877', 'KB663888', 'KB663899', 'KB663910', 'KB663921', 'KB663932', 'KB663943', 'KB663955', 'KB664054', 'KB664165', 'KB664255', 'KB664266', 'KB664277')

# locate pass sites
loc_pass = ds_snps['variant_filter_pass'].values
loc_pass

array([False, False, False, ..., False, False, False])

# access genotypes
gt = allel.GenotypeDaskArray(ds_snps['call_genotype'].data)
gt

	0	1	2	3	4	...	297	298	299	300	301
0	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
1	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
2	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
...	...
30141517	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
30141518	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
30141519	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.

# perform an allele count over genotypes
with ProgressBar():
    ac = gt.count_alleles(max_allele=3).compute()
ac

[########################################] | 100% Completed |  4min 58.5s

	0	1	2	3
0	604	0	0	0
1	604	0	0	0
2	604	0	0	0
...	...
30141517	0	0	0	8
30141518	6	0	0	0
30141519	4	0	0	0

# locate pass sites
loc_pass = ds_snps['variant_filter_pass'].values
ac_pass = ac[loc_pass]
ac_pass

	0	1	2	3
0	604	0	0	0
1	604	0	0	0
2	604	0	0	0
...	...
22857024	604	0	0	0
22857025	604	0	0	0
22857026	604	0	0	0

# count segregating sites
n_pass_seg = np.count_nonzero(ac_pass.is_segregating())
print(f"No. segregating sites: {n_pass_seg:,}")

No. segregating sites: 6,918,663

allelism_count = np.bincount(ac_pass.allelism())
print(f"No. invariant sites: {allelism_count[1]:,}")
print(f"No. biallelic SNPs: {allelism_count[2]:,}")
print(f"No. triallelic SNPs: {allelism_count[3]:,}")
print(f"No. quadriallelic SNPs: {allelism_count[4]:,}")

No. invariant sites: 15,938,364
No. biallelic SNPs: 5,922,306
No. triallelic SNPs: 938,414
No. quadriallelic SNPs: 57,943

Feedback and suggestions#

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.

Amin1.0 cloud data access

Contents

Amin1.0 cloud data access#

Data hosting#

Setup#

Sample metadata#

Reference genome#

SNP calls#

SNP positions and alleles#

Site filters#

SNP genotypes#

Example computations#

Counting sites passing filters#

Counting segregating sites#

Feedback and suggestions#