Amin1.0 cloud data access#

This page provides information about how to access data from the Amin1.0 resource via Google Cloud. This includes sample metadata and single nucleotide polymorphism (SNP) calls.

This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via Google Colab which is free interactive computing service running in the cloud.

To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.

Data hosting#

All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the vo_amin_release bucket, which is a multi-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access.

Setup#

Running this notebook requires some Python packages to be installed. These packages can be installed via pip or conda. E.g.:

!pip install -q malariagen_data
     |████████████████████████████████| 133 kB 6.3 MB/s 
     |████████████████████████████████| 5.7 MB 37.8 MB/s 
     |████████████████████████████████| 2.3 MB 45.4 MB/s 
     |████████████████████████████████| 153 kB 38.8 MB/s 
     |████████████████████████████████| 1.1 MB 46.9 MB/s 
     |████████████████████████████████| 144 kB 44.7 MB/s 
     |████████████████████████████████| 271 kB 54.7 MB/s 
     |████████████████████████████████| 94 kB 2.6 MB/s 
     |████████████████████████████████| 6.2 MB 41.1 MB/s 
?25h  Building wheel for asciitree (setup.py) ... ?25l?25hdone

To make accessing these data more convenient, we’ve created the malariagen_data Python package, which is available from PyPI. This is experimental so please let us know if you find any bugs or have any suggestions.

Now import the packages we’ll need to use here.

import malariagen_data
import numpy as np
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False})
import allel

Data access from Google Cloud is set up with the following code:

amin1 = malariagen_data.Amin1()

Sample metadata#

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.

Load sample metadata into a pandas DataFrame:

df_samples = amin1.sample_metadata()
df_samples
sample_id original_sample_id sanger_sample_id partner_sample_id contributor country location year month latitude longitude season PCA_cohort cohort subsampled_cohort
0 VBS09378-4248STDY7308980 VBS09378 4248STDY7308980 CB-2-00264 Brandy St. Laurent Cambodia Preah Kleang 2016 3 13.667 104.982 Feb-Apr (late dry) A PV NaN
1 VBS09382-4248STDY7308981 VBS09382 4248STDY7308981 CB-2-00258 Brandy St. Laurent Cambodia Preah Kleang 2016 3 13.667 104.982 Feb-Apr (late dry) A PV NaN
2 VBS09397-4248STDY7308982 VBS09397 4248STDY7308982 CB-2-00384 Brandy St. Laurent Cambodia Preah Kleang 2016 3 13.667 104.982 Feb-Apr (late dry) A PV PV
3 VBS09460-4248STDY7308986 VBS09460 4248STDY7308986 CB-2-02960 Brandy St. Laurent Cambodia Preah Kleang 2016 6 13.667 104.982 May-Jul (early wet) A PV NaN
4 VBS09466-4248STDY7308989 VBS09466 4248STDY7308989 CB-2-04070 Brandy St. Laurent Cambodia Preah Kleang 2016 11 13.667 104.982 Nov-Jan (early dry) A PV NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
297 VBS16624-4248STDY7918667 VBS16624 4248STDY7918667 KV-32-01591 Brandy St. Laurent Cambodia Sayas 2014 6 13.548 107.025 May-Jul (early wet) C RK2 RK2
298 VBS16625-4248STDY7918668 VBS16625 4248STDY7918668 KV-32-01499 Brandy St. Laurent Cambodia Sayas 2014 6 13.548 107.025 May-Jul (early wet) C RK2 RK2
299 VBS16626-4248STDY7918669 VBS16626 4248STDY7918669 KV-32-01465 Brandy St. Laurent Cambodia Sayas 2014 6 13.548 107.025 May-Jul (early wet) B RK1 RK1
300 VBS16628-4248STDY7918670 VBS16628 4248STDY7918670 KV-32-01454 Brandy St. Laurent Cambodia Sayas 2014 6 13.548 107.025 May-Jul (early wet) C RK2 RK2
301 VBS16630-4248STDY7918671 VBS16630 4248STDY7918671 KV-31-01949 Brandy St. Laurent Cambodia Chamkar San 2014 4 13.595 106.995 Feb-Apr (late dry) B RK1 RK1

302 rows × 15 columns

The sample_id column gives the sample identifier used throughout all analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

The cohort column gives an assignment of individual mosquitoes to populations based on location of sampling and genetic population structure.

Pandas can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:

df_summary = df_samples.pivot_table(
    index=["longitude", "latitude", "location"], 
    columns=["year"],
    values="sample_id", 
    aggfunc=len,
    fill_value=0
)
df_summary.style.set_caption("Number of mosquito specimens by collection site and year.")
Number of mosquito specimens by collection site and year.
    year 2010 2011 2014 2015 2016
longitude latitude location          
102.735 12.155 Thmar Da 26 15 0 0 0
104.92 13.77 Chean Mok 0 0 66 9 0
104.982 13.667 Preah Kleang 0 0 47 9 36
106.995 13.595 Chamkar San 0 0 40 11 0
107.025 13.548 Sayas 0 0 39 4 0

Reference genome#

Sequence data in this study were aligned to the MINIMUS1 reference genome. This reference genome contains 678 contigs in total, but many contigs are small and not suitable for population genetic analyses. We therefore have included only SNP calls for the 40 largest contigs. The set of contigs used for SNP calling can be accessed as follows:

amin1.contigs
('KB663610',
 'KB663611',
 'KB663622',
 'KB663633',
 'KB663644',
 'KB663655',
 'KB663666',
 'KB663677',
 'KB663688',
 'KB663699',
 'KB663710',
 'KB663721',
 'KB663722',
 'KB663733',
 'KB663744',
 'KB663755',
 'KB663766',
 'KB663777',
 'KB663788',
 'KB663799',
 'KB663810',
 'KB663821',
 'KB663832',
 'KB663833',
 'KB663844',
 'KB663855',
 'KB663866',
 'KB663877',
 'KB663888',
 'KB663899',
 'KB663910',
 'KB663921',
 'KB663932',
 'KB663943',
 'KB663955',
 'KB664054',
 'KB664165',
 'KB664255',
 'KB664266',
 'KB664277')

For convenience, the reference genome sequence for any contig can be loaded as a NumPy array, e.g.:

# load the reference sequence for a single contig as a numpy array
seq = amin1.genome_sequence(region="KB663610").compute()
seq
array([b'T', b'T', b'C', ..., b'C', b'A', b'C'], dtype='|S1')
# length of contig
len(seq)
31626230
# number of called bases in contig
np.sum((seq != b'N') & (seq != b'n'))
30141520

SNP calls#

We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not “N”. Data on the SNP positions, alleles, site filters and genotype calls for a given contig can be accessed as an xarray Dataset. E.g., access SNP calls for contig KB663610:

ds_snps = amin1.snp_calls(region="KB663610", site_mask=False)
ds_snps
<xarray.Dataset>
Dimensions:              (alleles: 4, ploidy: 2, samples: 302, variants: 30141520)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig       (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id            (samples) |S24 dask.array<chunksize=(302,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele       (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(941923,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
    call_GQ              (variants, samples) int8 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_MQ              (variants, samples) float32 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
    call_genotype_mask   (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Attributes:
    contigs:  ('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'K...

The arrays within this dataset are backed by Dask arrays, and can be accessed as shown below.

SNP positions and alleles#

# SNP positions (1-based)
pos = ds_snps['variant_position'].data
pos
Array Chunk
Bytes 120.57 MB 2.10 MB
Shape (30141520,) (524288,)
Count 59 Tasks 58 Chunks
Type int32 numpy.ndarray
30141520 1
# read first 10 SNP positions into a numpy array
p = pos[:10].compute()
p
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)
# SNP alleles (first column is the reference allele)
alleles = ds_snps['variant_allele'].data
alleles
Array Chunk
Bytes 120.57 MB 1.57 MB
Shape (30141520, 4) (524288, 3)
Count 292 Tasks 116 Chunks
Type |S1 numpy.ndarray
4 30141520
# read first 10 SNP alleles into a numpy array
a = alleles[:10, :].compute()
a
array([[b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'C', b'A', b'G', b'T'],
       [b'G', b'C', b'A', b'T'],
       [b'T', b'C', b'A', b'G'],
       [b'C', b'A', b'G', b'T'],
       [b'A', b'C', b'G', b'T'],
       [b'T', b'C', b'A', b'G'],
       [b'T', b'C', b'A', b'G'],
       [b'G', b'C', b'A', b'T']], dtype='|S1')

Site filters#

SNP calling is not always reliable, and we have created site filters to allow excluding low quality SNPs. For each contig, a “filter_pass” Boolean mask is available, where True indicates that the site passed the filter and is accessible to high quality SNP calling.

# site filters
filter_pass = ds_snps['variant_filter_pass'].data
filter_pass
Array Chunk
Bytes 30.14 MB 941.92 kB
Shape (30141520,) (941923,)
Count 33 Tasks 32 Chunks
Type bool numpy.ndarray
30141520 1
# load site filters for first 10 SNPs
f = filter_pass[:10].compute()
f
array([False, False, False, False, False, False, False, False, False,
       False])

Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below.

SNP genotypes#

SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

SNP genotypes can be accessed as dask arrays as shown in the example below.

gt = ds_snps['call_genotype'].data
gt
Array Chunk
Bytes 18.21 GB 30.00 MB
Shape (30141520, 302, 2) (300000, 50, 2)
Count 708 Tasks 707 Chunks
Type int8 numpy.ndarray
2 302 30141520

Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:

len(df_samples) == gt.shape[1]
True

You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:

df_samples.cohort.unique()
array(['PV', nan, 'RK2', 'RK1', 'TD'], dtype=object)
# select samples from the Thmar Da cohort
loc_cohort = df_samples.eval("cohort == 'TD'").values
print(f"found {np.count_nonzero(loc_cohort)} samples")
gt_cohort = da.compress(loc_cohort, gt, axis=1)
gt_cohort
found 41 samples
Array Chunk
Bytes 2.47 GB 20.40 MB
Shape (30141520, 41, 2) (300000, 34, 2)
Count 910 Tasks 202 Chunks
Type int8 numpy.ndarray
2 41 30141520

Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:

g = gt[:5, :3, :].compute()
g
array([[[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8)

If you want to work with the genotype calls, you may find it convenient to use scikit-allel. E.g., the code below sets up a genotype array.

# use the scikit-allel wrapper class for genotype calls
gtd = allel.GenotypeDaskArray(ds_snps['call_genotype'].data)
gtd
<GenotypeDaskArray shape=(30141520, 302, 2) dtype=int8>
01234...297298299300301
00/00/00/00/00/0...0/00/00/00/00/0
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
30141517./../../../../...../../../../../.
30141518./../../../../...../../../../../.
30141519./../../../../...../../../../../.

Example computations#

Below are some examples of simple computations that can be run with these data.

Counting sites passing filters#

For each of the contigs for which SNP calling was performed, count the number of sites and the number passing site filters.

for contig in amin1.contigs:
    ds_snps = amin1.snp_calls(region=contig)
    filter_pass = ds_snps['variant_filter_pass'].data
    n_sites = ds_snps.dims['variants']
    n_pass = filter_pass.sum().compute()
    print(f"{contig}: {n_pass:,} out of {n_sites:,} ({n_pass/n_sites:.0%}) sites pass filters")
KB663610: 22,857,027 out of 30,141,520 (76%) sites pass filters
KB663611: 3,737,822 out of 5,453,172 (69%) sites pass filters
KB663622: 4,388,762 out of 5,461,660 (80%) sites pass filters
KB663633: 4,215,569 out of 5,450,627 (77%) sites pass filters
KB663644: 4,221,032 out of 5,413,357 (78%) sites pass filters
KB663655: 2,916,639 out of 3,744,733 (78%) sites pass filters
KB663666: 2,501,655 out of 3,719,057 (67%) sites pass filters
KB663677: 2,586,150 out of 3,624,725 (71%) sites pass filters
KB663688: 2,742,023 out of 3,544,647 (77%) sites pass filters
KB663699: 1,534,228 out of 1,873,197 (82%) sites pass filters
KB663710: 1,213,646 out of 1,685,057 (72%) sites pass filters
KB663721: 15,452,041 out of 20,762,843 (74%) sites pass filters
KB663722: 1,209,792 out of 1,659,122 (73%) sites pass filters
KB663733: 1,062,674 out of 1,537,061 (69%) sites pass filters
KB663744: 1,240,722 out of 1,464,294 (85%) sites pass filters
KB663755: 1,099,411 out of 1,252,480 (88%) sites pass filters
KB663766: 1,178,051 out of 1,313,725 (90%) sites pass filters
KB663777: 1,012,298 out of 1,306,825 (77%) sites pass filters
KB663788: 1,125,292 out of 1,246,996 (90%) sites pass filters
KB663799: 966,257 out of 1,234,260 (78%) sites pass filters
KB663810: 800,823 out of 928,103 (86%) sites pass filters
KB663821: 757,864 out of 875,898 (87%) sites pass filters
KB663832: 11,641,571 out of 16,000,346 (73%) sites pass filters
KB663833: 395,479 out of 554,130 (71%) sites pass filters
KB663844: 482,803 out of 602,452 (80%) sites pass filters
KB663855: 356,534 out of 428,439 (83%) sites pass filters
KB663866: 437,007 out of 472,977 (92%) sites pass filters
KB663877: 172,685 out of 273,855 (63%) sites pass filters
KB663888: 171,639 out of 221,033 (78%) sites pass filters
KB663899: 29,212 out of 87,710 (33%) sites pass filters
KB663910: 105,536 out of 150,741 (70%) sites pass filters
KB663921: 29,136 out of 87,604 (33%) sites pass filters
KB663932: 100,838 out of 133,376 (76%) sites pass filters
KB663943: 10,913,259 out of 14,689,009 (74%) sites pass filters
KB663955: 24,703 out of 55,494 (45%) sites pass filters
KB664054: 8,476,484 out of 11,726,578 (72%) sites pass filters
KB664165: 7,711,730 out of 9,974,421 (77%) sites pass filters
KB664255: 6,529,746 out of 9,221,028 (71%) sites pass filters
KB664266: 6,594,776 out of 8,849,743 (75%) sites pass filters
KB664277: 5,168,159 out of 6,068,897 (85%) sites pass filters

Counting segregating sites#

Count the number of segregating SNPs that also pass site filters, for a single contig.

# choose contig
region = "KB663610"

# access SNP data
ds_snps = amin1.snp_calls(region=region)
ds_snps
<xarray.Dataset>
Dimensions:              (alleles: 4, ploidy: 2, samples: 302, variants: 30141520)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig       (variants) uint8 dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id            (samples) |S24 dask.array<chunksize=(302,), meta=np.ndarray>
Dimensions without coordinates: alleles, ploidy, samples, variants
Data variables:
    variant_allele       (variants, alleles) |S1 dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(941923,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
    call_GQ              (variants, samples) int8 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_MQ              (variants, samples) float32 dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
    call_genotype_mask   (variants, samples, ploidy) bool dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Attributes:
    contigs:  ('KB663610', 'KB663611', 'KB663622', 'KB663633', 'KB663644', 'K...
# locate pass sites
loc_pass = ds_snps['variant_filter_pass'].values
loc_pass
array([False, False, False, ..., False, False, False])
# access genotypes
gt = allel.GenotypeDaskArray(ds_snps['call_genotype'].data)
gt
<GenotypeDaskArray shape=(30141520, 302, 2) dtype=int8>
01234...297298299300301
00/00/00/00/00/0...0/00/00/00/00/0
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
30141517./../../../../...../../../../../.
30141518./../../../../...../../../../../.
30141519./../../../../...../../../../../.
# perform an allele count over genotypes
with ProgressBar():
    ac = gt.count_alleles(max_allele=3).compute()
ac
[########################################] | 100% Completed |  4min 58.5s
<AlleleCountsArray shape=(30141520, 4) dtype=int32>
0123
0604 0 0 0
1604 0 0 0
2604 0 0 0
......
301415170008
301415186000
301415194000
# locate pass sites
loc_pass = ds_snps['variant_filter_pass'].values
ac_pass = ac[loc_pass]
ac_pass
<AlleleCountsArray shape=(22857027, 4) dtype=int32>
0123
0604 0 0 0
1604 0 0 0
2604 0 0 0
......
22857024604 0 0 0
22857025604 0 0 0
22857026604 0 0 0
# count segregating sites
n_pass_seg = np.count_nonzero(ac_pass.is_segregating())
print(f"No. segregating sites: {n_pass_seg:,}")
No. segregating sites: 6,918,663
allelism_count = np.bincount(ac_pass.allelism())
print(f"No. invariant sites: {allelism_count[1]:,}")
print(f"No. biallelic SNPs: {allelism_count[2]:,}")
print(f"No. triallelic SNPs: {allelism_count[3]:,}")
print(f"No. quadriallelic SNPs: {allelism_count[4]:,}")
No. invariant sites: 15,938,364
No. biallelic SNPs: 5,922,306
No. triallelic SNPs: 938,414
No. quadriallelic SNPs: 57,943

Feedback and suggestions#

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.