Amin1.0 cloud data access
Contents
Amin1.0 cloud data access#
This page provides information about how to access data from the Amin1.0 resource via Google Cloud. This includes sample metadata and single nucleotide polymorphism (SNP) calls.
This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via Google Colab which is free interactive computing service running in the cloud.
To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.
Data hosting#
All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the vo_amin_release
bucket, which is a multi-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access.
Setup#
Running this notebook requires some Python packages to be installed. These packages can be installed via pip or conda. E.g.:
!pip install -q malariagen_data
|████████████████████████████████| 133 kB 6.3 MB/s
|████████████████████████████████| 5.7 MB 37.8 MB/s
|████████████████████████████████| 2.3 MB 45.4 MB/s
|████████████████████████████████| 153 kB 38.8 MB/s
|████████████████████████████████| 1.1 MB 46.9 MB/s
|████████████████████████████████| 144 kB 44.7 MB/s
|████████████████████████████████| 271 kB 54.7 MB/s
|████████████████████████████████| 94 kB 2.6 MB/s
|████████████████████████████████| 6.2 MB 41.1 MB/s
?25h Building wheel for asciitree (setup.py) ... ?25l?25hdone
To make accessing these data more convenient, we’ve created the malariagen_data Python package, which is available from PyPI. This is experimental so please let us know if you find any bugs or have any suggestions.
Now import the packages we’ll need to use here.
import malariagen_data
import numpy as np
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False})
import allel
Data access from Google Cloud is set up with the following code:
amin1 = malariagen_data.Amin1()
Sample metadata#
Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.
Load sample metadata into a pandas DataFrame:
df_samples = amin1.sample_metadata()
df_samples
sample_id | original_sample_id | sanger_sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | season | PCA_cohort | cohort | subsampled_cohort | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VBS09378-4248STDY7308980 | VBS09378 | 4248STDY7308980 | CB-2-00264 | Brandy St. Laurent | Cambodia | Preah Kleang | 2016 | 3 | 13.667 | 104.982 | Feb-Apr (late dry) | A | PV | NaN |
1 | VBS09382-4248STDY7308981 | VBS09382 | 4248STDY7308981 | CB-2-00258 | Brandy St. Laurent | Cambodia | Preah Kleang | 2016 | 3 | 13.667 | 104.982 | Feb-Apr (late dry) | A | PV | NaN |
2 | VBS09397-4248STDY7308982 | VBS09397 | 4248STDY7308982 | CB-2-00384 | Brandy St. Laurent | Cambodia | Preah Kleang | 2016 | 3 | 13.667 | 104.982 | Feb-Apr (late dry) | A | PV | PV |
3 | VBS09460-4248STDY7308986 | VBS09460 | 4248STDY7308986 | CB-2-02960 | Brandy St. Laurent | Cambodia | Preah Kleang | 2016 | 6 | 13.667 | 104.982 | May-Jul (early wet) | A | PV | NaN |
4 | VBS09466-4248STDY7308989 | VBS09466 | 4248STDY7308989 | CB-2-04070 | Brandy St. Laurent | Cambodia | Preah Kleang | 2016 | 11 | 13.667 | 104.982 | Nov-Jan (early dry) | A | PV | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
297 | VBS16624-4248STDY7918667 | VBS16624 | 4248STDY7918667 | KV-32-01591 | Brandy St. Laurent | Cambodia | Sayas | 2014 | 6 | 13.548 | 107.025 | May-Jul (early wet) | C | RK2 | RK2 |
298 | VBS16625-4248STDY7918668 | VBS16625 | 4248STDY7918668 | KV-32-01499 | Brandy St. Laurent | Cambodia | Sayas | 2014 | 6 | 13.548 | 107.025 | May-Jul (early wet) | C | RK2 | RK2 |
299 | VBS16626-4248STDY7918669 | VBS16626 | 4248STDY7918669 | KV-32-01465 | Brandy St. Laurent | Cambodia | Sayas | 2014 | 6 | 13.548 | 107.025 | May-Jul (early wet) | B | RK1 | RK1 |
300 | VBS16628-4248STDY7918670 | VBS16628 | 4248STDY7918670 | KV-32-01454 | Brandy St. Laurent | Cambodia | Sayas | 2014 | 6 | 13.548 | 107.025 | May-Jul (early wet) | C | RK2 | RK2 |
301 | VBS16630-4248STDY7918671 | VBS16630 | 4248STDY7918671 | KV-31-01949 | Brandy St. Laurent | Cambodia | Chamkar San | 2014 | 4 | 13.595 | 106.995 | Feb-Apr (late dry) | B | RK1 | RK1 |
302 rows × 15 columns
The sample_id
column gives the sample identifier used throughout all analyses.
The country
, location
, latitude
and longitude
columns give the location where the specimen was collected.
The year
and month
columns give the approximate date when the specimen was collected.
The cohort
column gives an assignment of individual mosquitoes to populations based on location of sampling and genetic population structure.
Pandas can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:
df_summary = df_samples.pivot_table(
index=["longitude", "latitude", "location"],
columns=["year"],
values="sample_id",
aggfunc=len,
fill_value=0
)
df_summary.style.set_caption("Number of mosquito specimens by collection site and year.")
year | 2010 | 2011 | 2014 | 2015 | 2016 | ||
---|---|---|---|---|---|---|---|
longitude | latitude | location | |||||
102.735 | 12.155 | Thmar Da | 26 | 15 | 0 | 0 | 0 |
104.92 | 13.77 | Chean Mok | 0 | 0 | 66 | 9 | 0 |
104.982 | 13.667 | Preah Kleang | 0 | 0 | 47 | 9 | 36 |
106.995 | 13.595 | Chamkar San | 0 | 0 | 40 | 11 | 0 |
107.025 | 13.548 | Sayas | 0 | 0 | 39 | 4 | 0 |
Reference genome#
Sequence data in this study were aligned to the MINIMUS1 reference genome. This reference genome contains 678 contigs in total, but many contigs are small and not suitable for population genetic analyses. We therefore have included only SNP calls for the 40 largest contigs. The set of contigs used for SNP calling can be accessed as follows:
amin1.contigs
('KB663610',
'KB663611',
'KB663622',
'KB663633',
'KB663644',
'KB663655',
'KB663666',
'KB663677',
'KB663688',
'KB663699',
'KB663710',
'KB663721',
'KB663722',
'KB663733',
'KB663744',
'KB663755',
'KB663766',
'KB663777',
'KB663788',
'KB663799',
'KB663810',
'KB663821',
'KB663832',
'KB663833',
'KB663844',
'KB663855',
'KB663866',
'KB663877',
'KB663888',
'KB663899',
'KB663910',
'KB663921',
'KB663932',
'KB663943',
'KB663955',
'KB664054',
'KB664165',
'KB664255',
'KB664266',
'KB664277')
For convenience, the reference genome sequence for any contig can be loaded as a NumPy array, e.g.:
# load the reference sequence for a single contig as a numpy array
seq = amin1.genome_sequence(region="KB663610").compute()
seq
array([b'T', b'T', b'C', ..., b'C', b'A', b'C'], dtype='|S1')
# length of contig
len(seq)
31626230
# number of called bases in contig
np.sum((seq != b'N') & (seq != b'n'))
30141520
SNP calls#
We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not “N”. Data on the SNP positions, alleles, site filters and genotype calls for a given contig can be accessed as an xarray Dataset. E.g., access SNP calls for contig KB663610:
ds_snps = amin1.snp_calls(region="KB663610", site_mask=False)
ds_snps
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, samples: 302, variants: 30141520) Coordinates: variant_position (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray> variant_contig (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray> sample_id (samples) |S24 dask.array<chunksize=(302,), meta=np.ndarray> Dimensions without coordinates: alleles, ploidy, samples, variants Data variables: variant_allele (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray> variant_filter_pass (variants) bool dask.array<chunksize=(941923,), meta=np.ndarray> call_genotype (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> call_GQ (variants, samples) int8 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_MQ (variants, samples) float32 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_AD (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray> call_genotype_mask (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> Attributes: contigs: ('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'K...
- alleles: 4
- ploidy: 2
- samples: 302
- variants: 30141520
- variant_position(variants)int32dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 120.57 MB 2.10 MB Shape (30141520,) (524288,) Count 59 Tasks 58 Chunks Type int32 numpy.ndarray - variant_contig(variants)uint8dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 30.14 MB 524.29 kB Shape (30141520,) (524288,) Count 58 Tasks 58 Chunks Type uint8 numpy.ndarray - sample_id(samples)|S24dask.array<chunksize=(302,), meta=np.ndarray>
Array Chunk Bytes 7.25 kB 7.25 kB Shape (302,) (302,) Count 2 Tasks 1 Chunks Type |S24 numpy.ndarray
- variant_allele(variants, alleles)|S1dask.array<chunksize=(524288, 1), meta=np.ndarray>
Array Chunk Bytes 120.57 MB 1.57 MB Shape (30141520, 4) (524288, 3) Count 292 Tasks 116 Chunks Type |S1 numpy.ndarray - variant_filter_pass(variants)booldask.array<chunksize=(941923,), meta=np.ndarray>
Array Chunk Bytes 30.14 MB 941.92 kB Shape (30141520,) (941923,) Count 33 Tasks 32 Chunks Type bool numpy.ndarray - call_genotype(variants, samples, ploidy)int8dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 18.21 GB 30.00 MB Shape (30141520, 302, 2) (300000, 50, 2) Count 708 Tasks 707 Chunks Type int8 numpy.ndarray - call_GQ(variants, samples)int8dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 9.10 GB 15.00 MB Shape (30141520, 302) (300000, 50) Count 708 Tasks 707 Chunks Type int8 numpy.ndarray - call_MQ(variants, samples)float32dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 36.41 GB 60.00 MB Shape (30141520, 302) (300000, 50) Count 708 Tasks 707 Chunks Type float32 numpy.ndarray - call_AD(variants, samples, alleles)int16dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
Array Chunk Bytes 72.82 GB 120.00 MB Shape (30141520, 302, 4) (300000, 50, 4) Count 708 Tasks 707 Chunks Type int16 numpy.ndarray - call_genotype_mask(variants, samples, ploidy)booldask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 18.21 GB 30.00 MB Shape (30141520, 302, 2) (300000, 50, 2) Count 1415 Tasks 707 Chunks Type bool numpy.ndarray
- contigs :
- ('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'KB663655', 'KB663666', 'KB663677', 'KB663688', 'KB663699', 'KB663710', 'KB663721', 'KB663722', 'KB663733', 'KB663744', 'KB663755', 'KB663766', 'KB663777', 'KB663788', 'KB663799', 'KB663810', 'KB663821', 'KB663832', 'KB663833', 'KB663844', 'KB663855', 'KB663866', 'KB663877', 'KB663888', 'KB663899', 'KB663910', 'KB663921', 'KB663932', 'KB663943', 'KB663955', 'KB664054', 'KB664165', 'KB664255', 'KB664266', 'KB664277')
The arrays within this dataset are backed by Dask arrays, and can be accessed as shown below.
SNP positions and alleles#
# SNP positions (1-based)
pos = ds_snps['variant_position'].data
pos
|
# read first 10 SNP positions into a numpy array
p = pos[:10].compute()
p
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=int32)
# SNP alleles (first column is the reference allele)
alleles = ds_snps['variant_allele'].data
alleles
|
# read first 10 SNP alleles into a numpy array
a = alleles[:10, :].compute()
a
array([[b'T', b'C', b'A', b'G'],
[b'T', b'C', b'A', b'G'],
[b'C', b'A', b'G', b'T'],
[b'G', b'C', b'A', b'T'],
[b'T', b'C', b'A', b'G'],
[b'C', b'A', b'G', b'T'],
[b'A', b'C', b'G', b'T'],
[b'T', b'C', b'A', b'G'],
[b'T', b'C', b'A', b'G'],
[b'G', b'C', b'A', b'T']], dtype='|S1')
Site filters#
SNP calling is not always reliable, and we have created site filters to allow excluding low quality SNPs. For each contig, a “filter_pass” Boolean mask is available, where True
indicates that the site passed the filter and is accessible to high quality SNP calling.
# site filters
filter_pass = ds_snps['variant_filter_pass'].data
filter_pass
|
# load site filters for first 10 SNPs
f = filter_pass[:10].compute()
f
array([False, False, False, False, False, False, False, False, False,
False])
Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below.
SNP genotypes#
SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.
SNP genotypes can be accessed as dask arrays as shown in the example below.
gt = ds_snps['call_genotype'].data
gt
|
Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:
len(df_samples) == gt.shape[1]
True
You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:
df_samples.cohort.unique()
array(['PV', nan, 'RK2', 'RK1', 'TD'], dtype=object)
# select samples from the Thmar Da cohort
loc_cohort = df_samples.eval("cohort == 'TD'").values
print(f"found {np.count_nonzero(loc_cohort)} samples")
gt_cohort = da.compress(loc_cohort, gt, axis=1)
gt_cohort
found 41 samples
|
Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:
g = gt[:5, :3, :].compute()
g
array([[[0, 0],
[0, 0],
[0, 0]],
[[0, 0],
[0, 0],
[0, 0]],
[[0, 0],
[0, 0],
[0, 0]],
[[0, 0],
[0, 0],
[0, 0]],
[[0, 0],
[0, 0],
[0, 0]]], dtype=int8)
If you want to work with the genotype calls, you may find it convenient to use scikit-allel. E.g., the code below sets up a genotype array.
# use the scikit-allel wrapper class for genotype calls
gtd = allel.GenotypeDaskArray(ds_snps['call_genotype'].data)
gtd
0 | 1 | 2 | 3 | 4 | ... | 297 | 298 | 299 | 300 | 301 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
30141517 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. | |
30141518 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. | |
30141519 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. |
Example computations#
Below are some examples of simple computations that can be run with these data.
Counting sites passing filters#
For each of the contigs for which SNP calling was performed, count the number of sites and the number passing site filters.
for contig in amin1.contigs:
ds_snps = amin1.snp_calls(region=contig)
filter_pass = ds_snps['variant_filter_pass'].data
n_sites = ds_snps.dims['variants']
n_pass = filter_pass.sum().compute()
print(f"{contig}: {n_pass:,} out of {n_sites:,} ({n_pass/n_sites:.0%}) sites pass filters")
KB663610: 22,857,027 out of 30,141,520 (76%) sites pass filters
KB663611: 3,737,822 out of 5,453,172 (69%) sites pass filters
KB663622: 4,388,762 out of 5,461,660 (80%) sites pass filters
KB663633: 4,215,569 out of 5,450,627 (77%) sites pass filters
KB663644: 4,221,032 out of 5,413,357 (78%) sites pass filters
KB663655: 2,916,639 out of 3,744,733 (78%) sites pass filters
KB663666: 2,501,655 out of 3,719,057 (67%) sites pass filters
KB663677: 2,586,150 out of 3,624,725 (71%) sites pass filters
KB663688: 2,742,023 out of 3,544,647 (77%) sites pass filters
KB663699: 1,534,228 out of 1,873,197 (82%) sites pass filters
KB663710: 1,213,646 out of 1,685,057 (72%) sites pass filters
KB663721: 15,452,041 out of 20,762,843 (74%) sites pass filters
KB663722: 1,209,792 out of 1,659,122 (73%) sites pass filters
KB663733: 1,062,674 out of 1,537,061 (69%) sites pass filters
KB663744: 1,240,722 out of 1,464,294 (85%) sites pass filters
KB663755: 1,099,411 out of 1,252,480 (88%) sites pass filters
KB663766: 1,178,051 out of 1,313,725 (90%) sites pass filters
KB663777: 1,012,298 out of 1,306,825 (77%) sites pass filters
KB663788: 1,125,292 out of 1,246,996 (90%) sites pass filters
KB663799: 966,257 out of 1,234,260 (78%) sites pass filters
KB663810: 800,823 out of 928,103 (86%) sites pass filters
KB663821: 757,864 out of 875,898 (87%) sites pass filters
KB663832: 11,641,571 out of 16,000,346 (73%) sites pass filters
KB663833: 395,479 out of 554,130 (71%) sites pass filters
KB663844: 482,803 out of 602,452 (80%) sites pass filters
KB663855: 356,534 out of 428,439 (83%) sites pass filters
KB663866: 437,007 out of 472,977 (92%) sites pass filters
KB663877: 172,685 out of 273,855 (63%) sites pass filters
KB663888: 171,639 out of 221,033 (78%) sites pass filters
KB663899: 29,212 out of 87,710 (33%) sites pass filters
KB663910: 105,536 out of 150,741 (70%) sites pass filters
KB663921: 29,136 out of 87,604 (33%) sites pass filters
KB663932: 100,838 out of 133,376 (76%) sites pass filters
KB663943: 10,913,259 out of 14,689,009 (74%) sites pass filters
KB663955: 24,703 out of 55,494 (45%) sites pass filters
KB664054: 8,476,484 out of 11,726,578 (72%) sites pass filters
KB664165: 7,711,730 out of 9,974,421 (77%) sites pass filters
KB664255: 6,529,746 out of 9,221,028 (71%) sites pass filters
KB664266: 6,594,776 out of 8,849,743 (75%) sites pass filters
KB664277: 5,168,159 out of 6,068,897 (85%) sites pass filters
Counting segregating sites#
Count the number of segregating SNPs that also pass site filters, for a single contig.
# choose contig
region = "KB663610"
# access SNP data
ds_snps = amin1.snp_calls(region=region)
ds_snps
<xarray.Dataset> Dimensions: (alleles: 4, ploidy: 2, samples: 302, variants: 30141520) Coordinates: variant_position (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray> variant_contig (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray> sample_id (samples) |S24 dask.array<chunksize=(302,), meta=np.ndarray> Dimensions without coordinates: alleles, ploidy, samples, variants Data variables: variant_allele (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray> variant_filter_pass (variants) bool dask.array<chunksize=(941923,), meta=np.ndarray> call_genotype (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> call_GQ (variants, samples) int8 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_MQ (variants, samples) float32 dask.array<chunksize=(300000, 50), meta=np.ndarray> call_AD (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray> call_genotype_mask (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray> Attributes: contigs: ('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'K...
- alleles: 4
- ploidy: 2
- samples: 302
- variants: 30141520
- variant_position(variants)int32dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 120.57 MB 2.10 MB Shape (30141520,) (524288,) Count 59 Tasks 58 Chunks Type int32 numpy.ndarray - variant_contig(variants)uint8dask.array<chunksize=(524288,), meta=np.ndarray>
Array Chunk Bytes 30.14 MB 524.29 kB Shape (30141520,) (524288,) Count 58 Tasks 58 Chunks Type uint8 numpy.ndarray - sample_id(samples)|S24dask.array<chunksize=(302,), meta=np.ndarray>
Array Chunk Bytes 7.25 kB 7.25 kB Shape (302,) (302,) Count 2 Tasks 1 Chunks Type |S24 numpy.ndarray
- variant_allele(variants, alleles)|S1dask.array<chunksize=(524288, 1), meta=np.ndarray>
Array Chunk Bytes 120.57 MB 1.57 MB Shape (30141520, 4) (524288, 3) Count 292 Tasks 116 Chunks Type |S1 numpy.ndarray - variant_filter_pass(variants)booldask.array<chunksize=(941923,), meta=np.ndarray>
Array Chunk Bytes 30.14 MB 941.92 kB Shape (30141520,) (941923,) Count 33 Tasks 32 Chunks Type bool numpy.ndarray - call_genotype(variants, samples, ploidy)int8dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 18.21 GB 30.00 MB Shape (30141520, 302, 2) (300000, 50, 2) Count 708 Tasks 707 Chunks Type int8 numpy.ndarray - call_GQ(variants, samples)int8dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 9.10 GB 15.00 MB Shape (30141520, 302) (300000, 50) Count 708 Tasks 707 Chunks Type int8 numpy.ndarray - call_MQ(variants, samples)float32dask.array<chunksize=(300000, 50), meta=np.ndarray>
Array Chunk Bytes 36.41 GB 60.00 MB Shape (30141520, 302) (300000, 50) Count 708 Tasks 707 Chunks Type float32 numpy.ndarray - call_AD(variants, samples, alleles)int16dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
Array Chunk Bytes 72.82 GB 120.00 MB Shape (30141520, 302, 4) (300000, 50, 4) Count 708 Tasks 707 Chunks Type int16 numpy.ndarray - call_genotype_mask(variants, samples, ploidy)booldask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Array Chunk Bytes 18.21 GB 30.00 MB Shape (30141520, 302, 2) (300000, 50, 2) Count 1415 Tasks 707 Chunks Type bool numpy.ndarray
- contigs :
- ('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'KB663655', 'KB663666', 'KB663677', 'KB663688', 'KB663699', 'KB663710', 'KB663721', 'KB663722', 'KB663733', 'KB663744', 'KB663755', 'KB663766', 'KB663777', 'KB663788', 'KB663799', 'KB663810', 'KB663821', 'KB663832', 'KB663833', 'KB663844', 'KB663855', 'KB663866', 'KB663877', 'KB663888', 'KB663899', 'KB663910', 'KB663921', 'KB663932', 'KB663943', 'KB663955', 'KB664054', 'KB664165', 'KB664255', 'KB664266', 'KB664277')
# locate pass sites
loc_pass = ds_snps['variant_filter_pass'].values
loc_pass
array([False, False, False, ..., False, False, False])
# access genotypes
gt = allel.GenotypeDaskArray(ds_snps['call_genotype'].data)
gt
0 | 1 | 2 | 3 | 4 | ... | 297 | 298 | 299 | 300 | 301 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
1 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
2 | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | ... | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | |
... | ... | |||||||||||
30141517 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. | |
30141518 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. | |
30141519 | ./. | ./. | ./. | ./. | ./. | ... | ./. | ./. | ./. | ./. | ./. |
# perform an allele count over genotypes
with ProgressBar():
ac = gt.count_alleles(max_allele=3).compute()
ac
[########################################] | 100% Completed | 4min 58.5s
0 | 1 | 2 | 3 | ||
---|---|---|---|---|---|
0 | 604 | 0 | 0 | 0 | |
1 | 604 | 0 | 0 | 0 | |
2 | 604 | 0 | 0 | 0 | |
... | ... | ||||
30141517 | 0 | 0 | 0 | 8 | |
30141518 | 6 | 0 | 0 | 0 | |
30141519 | 4 | 0 | 0 | 0 |
# locate pass sites
loc_pass = ds_snps['variant_filter_pass'].values
ac_pass = ac[loc_pass]
ac_pass
0 | 1 | 2 | 3 | ||
---|---|---|---|---|---|
0 | 604 | 0 | 0 | 0 | |
1 | 604 | 0 | 0 | 0 | |
2 | 604 | 0 | 0 | 0 | |
... | ... | ||||
22857024 | 604 | 0 | 0 | 0 | |
22857025 | 604 | 0 | 0 | 0 | |
22857026 | 604 | 0 | 0 | 0 |
# count segregating sites
n_pass_seg = np.count_nonzero(ac_pass.is_segregating())
print(f"No. segregating sites: {n_pass_seg:,}")
No. segregating sites: 6,918,663
allelism_count = np.bincount(ac_pass.allelism())
print(f"No. invariant sites: {allelism_count[1]:,}")
print(f"No. biallelic SNPs: {allelism_count[2]:,}")
print(f"No. triallelic SNPs: {allelism_count[3]:,}")
print(f"No. quadriallelic SNPs: {allelism_count[4]:,}")
No. invariant sites: 15,938,364
No. biallelic SNPs: 5,922,306
No. triallelic SNPs: 938,414
No. quadriallelic SNPs: 57,943
Feedback and suggestions#
If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.