Af1 cloud data access

Af1 cloud data access#

This notebook provides information about how to download data from the MalariaGEN Vector Observatory Anopheles funestus Genomic Surveillance Project via Google Cloud. This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls.

This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via Google Colab which are free interactive computing service running in the cloud.

To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.

Data hosting#

All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the vo_afun_release_master_us_central1 bucket, which is a single-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access.

Setup#

Running this notebook requires some Python packages to be installed:

%pip install -q malariagen_data

To make accessing these data more convenient, we’ve created the malariagen_data Python package. This is experimental so please let us know if you find any bugs or have any suggestions. See the Af1 API docs for documentation of all functions available from this package.

Import other packages we’ll need to use here.

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
# silence some warnings
dask.config.set(**{'array.slicing.split_large_chunks': False})
import allel
import malariagen_data

Af1 data access from Google Cloud is set up with the following code:

af1 = malariagen_data.Af1()
af1

MalariaGEN Af1 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Af1 API docs.
Storage URL	gs://vo_afun_release_master_us_central1
Data releases available	1.0
Results cache	None
Cohorts analysis	20231215
Site filters analysis	dt_20200416
Software version	malariagen_data 10.0.0
Client location	Iowa, United States (Google Cloud us-central1)

Note: To access the Af1.1, Af1.2 & Af1.3 releases, you need to use the pre=True flag in code above.

This flag is used when more data will be added to this release. In the case of Af1.1, Af1.2 & Af1.3; CNV data for the sample sets on these releases will be included at a future date.

Sample sets#

Data are organised into different releases. As an example, data in Af1.0 are organised into 8 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets.

To see which sample sets are available, load the sample set manifest into a pandas dataframe:

df_sample_sets = af1.sample_sets(release="1.0")
df_sample_sets

	sample_set	sample_count	study_id	study_url	release
0	1229-VO-GH-DADZIE-VMF00095	36	1229-VO-GH-DADZIE	https://www.malariagen.net/network/where-we-wo...	1.0
1	1230-VO-GA-CF-AYALA-VMF00045	50	1230-VO-MULTI-AYALA	https://www.malariagen.net/network/where-we-wo...	1.0
2	1231-VO-MULTI-WONDJI-VMF00043	320	1231-VO-MULTI-WONDJI	https://www.malariagen.net/network/where-we-wo...	1.0
3	1232-VO-KE-OCHOMO-VMF00044	81	1232-VO-KE-OCHOMO	https://www.malariagen.net/network/where-we-wo...	1.0
4	1235-VO-MZ-PAAIJMANS-VMF00094	76	1235-VO-MZ-PAAIJMANS	https://www.malariagen.net/network/where-we-wo...	1.0
5	1236-VO-TZ-OKUMU-VMF00090	10	1236-VO-TZ-OKUMU	https://www.malariagen.net/network/where-we-wo...	1.0
6	1240-VO-CD-KOEKEMOER-VMF00099	43	1240-VO-MULTI-KOEKEMOER	https://www.malariagen.net/network/where-we-wo...	1.0
7	1240-VO-MZ-KOEKEMOER-VMF00101	40	1240-VO-MULTI-KOEKEMOER	https://www.malariagen.net/network/where-we-wo...	1.0

For more information about these sample sets, you can read about each sample set from the URLs under the field study_url.

Sample metadata#

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.

E.g., load sample metadata for all samples in the Af1.0 release into a pandas DataFrame:

df_samples = af1.sample_metadata(sample_sets="1.0")
df_samples

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
0	VBS24195	1229-GH-A-GH01	Samuel Dadzie	Ghana	Dimabi	2017	8	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_08	GH-NP_Tolon_fune_2017_Q3
1	VBS24196	1229-GH-A-GH02	Samuel Dadzie	Ghana	Gbullung	2017	7	9.488	-1.009	F	...	Northern Region	GH-NP	Kumbungu	funestus	GH-NP_fune_2017	GH-NP_fune_2017_07	GH-NP_fune_2017_Q3	GH-NP_Kumbungu_fune_2017	GH-NP_Kumbungu_fune_2017_07	GH-NP_Kumbungu_fune_2017_Q3
2	VBS24197	1229-GH-A-GH03	Samuel Dadzie	Ghana	Dimabi	2017	7	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_07	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_07	GH-NP_Tolon_fune_2017_Q3
3	VBS24198	1229-GH-A-GH04	Samuel Dadzie	Ghana	Dimabi	2017	8	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_08	GH-NP_Tolon_fune_2017_Q3
4	VBS24199	1229-GH-A-GH05	Samuel Dadzie	Ghana	Gupanarigu	2017	8	9.497	-0.952	F	...	Northern Region	GH-NP	Kumbungu	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Kumbungu_fune_2017	GH-NP_Kumbungu_fune_2017_08	GH-NP_Kumbungu_fune_2017_Q3
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
651	VBS24534	1240-MZ-A-MozF_1314	Lizette Koekemoer	Mozambique	Motinho	2015	8	-10.851	40.594	F	...	Cabo Delgado	MZ-P	Palma	funestus	MZ-P_fune_2015	MZ-P_fune_2015_08	MZ-P_fune_2015_Q3	MZ-P_Palma_fune_2015	MZ-P_Palma_fune_2015_08	MZ-P_Palma_fune_2015_Q3
652	VBS24535	1240-MZ-A-MozF_1315	Lizette Koekemoer	Mozambique	Motinho	2015	8	-10.851	40.594	F	...	Cabo Delgado	MZ-P	Palma	funestus	MZ-P_fune_2015	MZ-P_fune_2015_08	MZ-P_fune_2015_Q3	MZ-P_Palma_fune_2015	MZ-P_Palma_fune_2015_08	MZ-P_Palma_fune_2015_Q3
653	VBS24536	1240-MZ-A-MozF_1317	Lizette Koekemoer	Mozambique	Motinho	2015	8	-10.851	40.594	F	...	Cabo Delgado	MZ-P	Palma	funestus	MZ-P_fune_2015	MZ-P_fune_2015_08	MZ-P_fune_2015_Q3	MZ-P_Palma_fune_2015	MZ-P_Palma_fune_2015_08	MZ-P_Palma_fune_2015_Q3
654	VBS24537	1240-MZ-A-MozF_1319	Lizette Koekemoer	Mozambique	Motinho	2015	8	-10.851	40.594	F	...	Cabo Delgado	MZ-P	Palma	funestus	MZ-P_fune_2015	MZ-P_fune_2015_08	MZ-P_fune_2015_Q3	MZ-P_Palma_fune_2015	MZ-P_Palma_fune_2015_08	MZ-P_Palma_fune_2015_Q3
655	VBS24539	1240-MZ-A-MozF_1323	Lizette Koekemoer	Mozambique	Motinho	2015	8	-10.851	40.594	F	...	Cabo Delgado	MZ-P	Palma	funestus	MZ-P_fune_2015	MZ-P_fune_2015_08	MZ-P_fune_2015_Q3	MZ-P_Palma_fune_2015	MZ-P_Palma_fune_2015_08	MZ-P_Palma_fune_2015_Q3

656 rows × 26 columns

The sample_id column gives the sample identifier used throughout all Af1 analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

The sex_call column gives the gender as determined from the sequence data.

Pandas can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:

df_samples.groupby("taxon").size()

taxon
funestus    656
dtype: int64

SNP calls#

Data on SNP calls, including the SNP positions, alleles, site filters, and genotypes, can be accessed as an xarray Dataset.

E.g., access SNP calls for chromosome 2RL for all samples in Af1.0.

ds_snps = af1.snp_calls(region="2RL", sample_sets="1.0")
ds_snps

<xarray.Dataset> Size: 1TB
Dimensions:                       (variants: 102882611, alleles: 4,
                                   samples: 656, ploidy: 2)
Coordinates:
    variant_position              (variants) int32 412MB dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig                (variants) uint8 103MB dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id                     (samples) <U36 94kB dask.array<chunksize=(36,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele                (variants, alleles) |S1 412MB dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass_funestus  (variants) bool 103MB dask.array<chunksize=(300000,), meta=np.ndarray>
    call_genotype                 (variants, samples, ploidy) int8 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>
    call_GQ                       (variants, samples) int8 67GB dask.array<chunksize=(300000, 36), meta=np.ndarray>
    call_MQ                       (variants, samples) float32 270GB dask.array<chunksize=(300000, 36), meta=np.ndarray>
    call_AD                       (variants, samples, alleles) int16 540GB dask.array<chunksize=(300000, 36, 4), meta=np.ndarray>
    call_genotype_mask            (variants, samples, ploidy) bool 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2RL', '3RL', 'X')

The arrays within this dataset are backed by Dask arrays, and can be accessed as shown below.

SNP sites and alleles#

We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not “N”. Data on this set of genomic positions and alleles for a given chromosome (e.g., 2RL) can be accessed as Dask arrays as follows.

pos = ds_snps["variant_position"].data
pos

	Array	Chunk
Bytes	392.47 MiB	2.00 MiB
Shape	(102882611,)	(524288,)
Dask graph	197 chunks in 1 graph layer
Data type	int32 numpy.ndarray

alleles = ds_snps["variant_allele"].data
alleles

	Array	Chunk
Bytes	392.47 MiB	1.50 MiB
Shape	(102882611, 4)	(524288, 3)
Dask graph	394 chunks in 4 graph layers
Data type	\|S1 numpy.ndarray

Data can be loaded into memory as a NumPy array as shown in the following examples.

# read first 10 SNP positions into a numpy array
p = pos[:10].compute()
p

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)

# read first 10 SNP alleles into a numpy array
a = alleles[:10].compute()
a

array([[b'T', b'A', b'C', b'G'],
       [b'G', b'A', b'C', b'T'],
       [b'G', b'A', b'C', b'T'],
       [b'C', b'A', b'T', b'G'],
       [b'G', b'A', b'C', b'T'],
       [b'T', b'A', b'C', b'G'],
       [b'C', b'A', b'T', b'G'],
       [b'A', b'C', b'T', b'G'],
       [b'C', b'A', b'T', b'G'],
       [b'T', b'A', b'C', b'G']], dtype='|S1')

Here the first column contains the reference alleles, and the remaining columns contain the alternate alleles.

Note that a byte string data type is used here for efficiency. E.g., the Python code b'T' represents a byte string containing the letter “T”, which here stands for the nucleotide thymine.

Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the Af1 samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below.

Site filters#

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs.

Each set of site filters provides a “filter_pass” Boolean mask for each chromosome arm, where True indicates that the site passed the filter and is accessible to high quality SNP calling.

The site filters data can be accessed as dask arrays as shown in the examples below.

# access gamb_colu_arab site filters for chromosome 2RL as a dask array
filter_pass = ds_snps['variant_filter_pass_funestus'].data
filter_pass

	Array	Chunk
Bytes	98.12 MiB	292.97 kiB
Shape	(102882611,)	(300000,)
Dask graph	343 chunks in 1 graph layer
Data type	bool numpy.ndarray

# read filter values for first 10 SNPs (True means the site passes filters)
f = filter_pass[:10].compute()
f

array([False, False, False, False, False, False, False, False, False,
       False])

Note these filters are the result of different filter models. For the filters in this example, a decision-tree was used. These filters are the default ones used across functions in the API.

We have also produced a second set of site filters, which are the result of static cutoffs on the site summary statistics. To access these hard-filters, instantiate an API client with the site_filters_analysis parameter as shown below.

af1_sc = malariagen_data.Af1(
    site_filters_analysis="sc_20220908",
)

Now, any function call via this API client will use the hard filters. For example, to access the site filters themselves and compute how many sites pass for Chromosome 3:

x = af1_sc.site_filters(region="3RL", mask="funestus")
x

SNP genotypes#

SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

SNP genotypes can be accessed as dask arrays as shown below.

gt = ds_snps["call_genotype"].data
gt

	Array	Chunk
Bytes	125.71 GiB	28.61 MiB
Shape	(102882611, 656, 2)	(300000, 50, 2)
Dask graph	5488 chunks in 9 graph layers
Data type	int8 numpy.ndarray

Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:

df_samples = af1.sample_metadata(sample_sets="1.0")
gt = ds_snps["call_genotype"].data
len(df_samples) == gt.shape[1]

True

You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:

loc_funestus = df_samples.eval("taxon == 'funestus'").values
print(f"found {np.count_nonzero(loc_funestus)} funestus samples")

found 656 funestus samples

ds_snps_funestus = ds_snps.isel(samples=loc_funestus)
ds_snps_funestus

<xarray.Dataset> Size: 1TB
Dimensions:                       (variants: 102882611, alleles: 4,
                                   samples: 656, ploidy: 2)
Coordinates:
    variant_position              (variants) int32 412MB dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig                (variants) uint8 103MB dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id                     (samples) <U36 94kB dask.array<chunksize=(36,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele                (variants, alleles) |S1 412MB dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass_funestus  (variants) bool 103MB dask.array<chunksize=(300000,), meta=np.ndarray>
    call_genotype                 (variants, samples, ploidy) int8 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>
    call_GQ                       (variants, samples) int8 67GB dask.array<chunksize=(300000, 36), meta=np.ndarray>
    call_MQ                       (variants, samples) float32 270GB dask.array<chunksize=(300000, 36), meta=np.ndarray>
    call_AD                       (variants, samples, alleles) int16 540GB dask.array<chunksize=(300000, 36, 4), meta=np.ndarray>
    call_genotype_mask            (variants, samples, ploidy) bool 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2RL', '3RL', 'X')

Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:

g = gt[:5, :3, :].compute()
g

array([[[-1, -1],
        [-1, -1],
        [-1, -1]],

       [[-1, -1],
        [-1, -1],
        [-1, -1]],

       [[-1, -1],
        [-1, -1],
        [-1, -1]],

       [[-1, -1],
        [-1, -1],
        [-1, -1]],

       [[-1, -1],
        [-1, -1],
        [-1, -1]]], dtype=int8)

If you want to work with the genotype calls, you may find it convenient to use scikit-allel. E.g., the code below sets up a genotype array.

# use the scikit-allel wrapper class for genotype calls
gt = allel.GenotypeDaskArray(ds_snps["call_genotype"].data)
gt

	0	1	2	3	4	...	651	652	653	654	655
0	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
1	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
2	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
...	...
102882608	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
102882609	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
102882610	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.

Copy number variation (CNV) data#

Data on copy number variation within the Af1 cohort are available as three separate data types:

HMM – Genome-wide inferences of copy number state within each individual mosquito in 300 bp non-overlapping windows.
Coverage calls – Genome-wide copy number variant calls, derived from the HMM outputs by analysing contiguous regions of elevated copy number state then clustering of variants across individuals based on breakpoint proximity.

For more information on the methods used to generate these data, see the variant-calling methods page.

The malariagen_data Python package provides some convenience functions for accessing these data, illustrated below.

CNV HMM#

Access HMM data via the cnv_hmm() method, which returns an xarray dataset. E.g.:

ds_hmm = af1.cnv_hmm(region="2RL", sample_sets="1.0")
ds_hmm

<xarray.Dataset> Size: 2GB
Dimensions:                   (variants: 342946, samples: 511)
Coordinates:
    variant_position          (variants) int32 1MB dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_end               (variants) int32 1MB dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_contig            (variants) uint8 343kB dask.array<chunksize=(342946,), meta=np.ndarray>
    sample_id                 (samples) object 4kB dask.array<chunksize=(8,), meta=np.ndarray>
Dimensions without coordinates: variants, samples
Data variables:
    call_CN                   (variants, samples) int8 175MB dask.array<chunksize=(65536, 8), meta=np.ndarray>
    call_RawCov               (variants, samples) int32 701MB dask.array<chunksize=(65536, 8), meta=np.ndarray>
    call_NormCov              (variants, samples) float32 701MB dask.array<chunksize=(65536, 8), meta=np.ndarray>
    sample_coverage_variance  (samples) float32 2kB dask.array<chunksize=(8,), meta=np.ndarray>
    sample_is_high_variance   (samples) bool 511B dask.array<chunksize=(8,), meta=np.ndarray>
Attributes:
    contigs:  ('2RL', '3RL', 'X')

Here “variants” are the 300 bp windows in which coverage was calculated and the HMM fitted. Window start positions are given by the variant_position array and ends are given by variant_end.

pos = ds_hmm['variant_position'].values
pos

array([        1,       301,       601, ..., 102882901, 102883201,
       102883501], dtype=int32)

end = ds_hmm['variant_end'].values
end

array([      300,       600,       900, ..., 102883200, 102883500,
       102883511], dtype=int32)

Copy number state is given by the call_CN array, where rows are windows and columns are individual samples.

On the autosomes (2RL, 3RL) normal diploid copy number is 2. Values greater than 2 mean amplification, less then 2 mean deletion.

On the X chromosome, normal copy number is 2 in females and 1 in males.

For all chromosomes, -1 means missing, i.e., the window was not included.

Rows are variants (windows), columns are individual samples.

cn = ds_hmm['call_CN'].values
cn

array([[-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       ...,
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1],
       [-1, -1, -1, ..., -1, -1, -1]], dtype=int8)

CNV coverage calls#

Coverage calls can be accessed via the cnv_coverage_calls() method, which returns an xarray dataset.

N.B., coverage calls can only be accessed on sample set at a time, because the CNV alleles do not necessarily match between sample sets. E.g.:

ds_cnv = af1.cnv_coverage_calls(
    region='2RL', 
    analysis='funestus', 
    sample_set='1229-VO-GH-DADZIE-VMF00095'
)
ds_cnv

<xarray.Dataset> Size: 1MB
Dimensions:              (variants: 30614, samples: 8)
Coordinates:
    variant_position     (variants) int32 122kB dask.array<chunksize=(30614,), meta=np.ndarray>
    variant_end          (variants) int32 122kB dask.array<chunksize=(30614,), meta=np.ndarray>
    variant_contig       (variants) uint8 31kB dask.array<chunksize=(30614,), meta=np.ndarray>
    variant_id           (variants) object 245kB dask.array<chunksize=(30614,), meta=np.ndarray>
    sample_id            (samples) object 64B dask.array<chunksize=(8,), meta=np.ndarray>
Dimensions without coordinates: variants, samples
Data variables:
    variant_CIPOS        (variants) int32 122kB dask.array<chunksize=(30614,), meta=np.ndarray>
    variant_CIEND        (variants) int32 122kB dask.array<chunksize=(30614,), meta=np.ndarray>
    variant_filter_pass  (variants) bool 31kB dask.array<chunksize=(30614,), meta=np.ndarray>
    call_genotype        (variants, samples) int8 245kB dask.array<chunksize=(30614, 8), meta=np.ndarray>
Attributes:
    contigs:  ('2RL', '3RL', 'X')

Here “variants” are copy number variants, inferred from the HMM results. CNV start positions are given by the variant_position array and ends are given by variant_end.

pos = ds_cnv['variant_position'].values
pos

array([    80101,     80101,     83401, ..., 102870901, 102870901,
       102872101], dtype=int32)

end = ds_cnv['variant_end'].values
end

array([    82200,     83400,     86400, ..., 102872400, 102873600,
       102873600], dtype=int32)

CNV genotypes are given by the call_genotype array, coded as 0 for absence and 1 for presence of the CNV allele. Rows are CNV alleles and columns are individual samples.

gt = ds_cnv['call_genotype'].values
gt

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int8)

Note that not all samples will have been included in the coverage calls, as some are excluded due to high coverage variance. Use the sample_id array to access the samples included in a given dataset, e.g.:

samples = ds_cnv['sample_id'].values
samples[:5]

array(['VBS24196', 'VBS24197', 'VBS24201', 'VBS24213', 'VBS24216'],
      dtype=object)

Haplotypes#

The Af1 data resource also includes haplotype reference panels, which were obtained by phasing the SNP calls. Phasing involves resolving the genotypes within each individual into a pair of haplotypes providing information about the two different DNA sequences inherited, one from each parent. Haplotypes provide greater power for a range of population genetic analyses, such as genome-wide scans for signals of recent selection, or analysis of adaptive gene flow between populations.

Data can be accessed in the cloud via the haplotypes() method. E.g., access haplotypes from the “funestus” analysis for all available samples, for chromosome 2RL. This method returns an xarray Dataset.

ds_haps = af1.haplotypes(region="2RL", analysis="funestus", sample_sets="1.0")
ds_haps

<xarray.Dataset> Size: 27GB
Dimensions:           (variants: 20633859, alleles: 2, samples: 656, ploidy: 2)
Coordinates:
    variant_position  (variants) int32 83MB dask.array<chunksize=(262144,), meta=np.ndarray>
    variant_contig    (variants) uint8 21MB dask.array<chunksize=(20633859,), meta=np.ndarray>
    sample_id         (samples) object 5kB dask.array<chunksize=(36,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele    (variants, alleles) |S1 41MB dask.array<chunksize=(262144, 1), meta=np.ndarray>
    call_genotype     (variants, samples, ploidy) int8 27GB dask.array<chunksize=(262144, 36, 2), meta=np.ndarray>
Attributes:
    contigs:   ('2RL', '3RL', 'X')
    analysis:  funestus

Here we have haplotype data for 656 samples at 20,633,859 SNPs.

The SNP positions and alleles can be accessed as follows.

# access haplotype SNP positions
pos = ds_haps["variant_position"].data  # dask array
pos

	Array	Chunk
Bytes	78.71 MiB	1.00 MiB
Shape	(20633859,)	(262144,)
Dask graph	79 chunks in 1 graph layer
Data type	int32 numpy.ndarray

# load positions of first 10 haplotype SNPs
pos[:10].compute()  # numpy array

array([79113, 79116, 79117, 79120, 79128, 79129, 79130, 79133, 79134,
       79136], dtype=int32)

# access haplotype SNP alleles
alleles = ds_haps["variant_allele"].data  # dask array
alleles

	Array	Chunk
Bytes	39.36 MiB	256.00 kiB
Shape	(20633859, 2)	(262144, 1)
Dask graph	158 chunks in 5 graph layers
Data type	\|S1 numpy.ndarray

# load alleles of first 10 haplotype SNPs - note all are biallelic
alleles[:10].compute()  # numpy array

array([[b'T', b'A'],
       [b'A', b'G'],
       [b'T', b'A'],
       [b'T', b'C'],
       [b'A', b'G'],
       [b'A', b'T'],
       [b'A', b'G'],
       [b'G', b'A'],
       [b'A', b'G'],
       [b'A', b'T']], dtype='|S1')

The phased genotypes can be accessed as follows.

# access genotypes
gt = ds_haps["call_genotype"].data  # dask array
gt

	Array	Chunk
Bytes	25.21 GiB	32.00 MiB
Shape	(20633859, 656, 2)	(262144, 64, 2)
Dask graph	1106 chunks in 9 graph layers
Data type	int8 numpy.ndarray

# wrap as scikit-allel genotype array
gt = allel.GenotypeDaskArray(ds_haps["call_genotype"].data)
gt

	0	1	2	3	4	...	651	652	653	654	655
0	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
1	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
2	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
...	...
20633856	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
20633857	0/0	0/0	0/0	0/0	0/0	...	0/0	0/0	0/0	0/0	0/0
20633858	0/1	1/0	1/0	0/1	0/1	...	0/1	1/0	0/1	0/0	0/0

# convert to scikit-allel haplotype array - useful for some analyses
ht = gt.to_haplotypes()
ht

	0	1	2	3	4	...	1307	1308	1309	1310	1311
0	0	0	0	0	0	...	0	0	0	0	0
1	0	0	0	0	0	...	0	0	0	0	0
2	0	0	0	0	0	...	0	0	0	0	0
...	...
20633856	0	0	0	0	0	...	0	0	0	0	0
20633857	0	0	0	0	0	...	0	0	0	0	0
20633858	0	1	1	0	1	...	1	0	0	0	0

Note that in the haplotype array above, each row is a SNP and each column is a haplotype. There were \(656\) samples in this analysis, and so we have \(1312\) (\(2\times656\)) haplotypes.

Using sample metadata to subset haplotypes#

For some analyses, you’ll want to subset the haplotypes, e.g., by location and species. In order to perform subsetting, you need to obtain sample metadata, and align it with the haplotype data. This ensures that every row in the sample metadata dataframe corresponds to every column in the phased genotypes array. E.g.:

# load sample metadata
df_samples = af1.sample_metadata(sample_sets="1.0")
df_samples.head()

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
0	VBS24195	1229-GH-A-GH01	Samuel Dadzie	Ghana	Dimabi	2017	8	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_08	GH-NP_Tolon_fune_2017_Q3
1	VBS24196	1229-GH-A-GH02	Samuel Dadzie	Ghana	Gbullung	2017	7	9.488	-1.009	F	...	Northern Region	GH-NP	Kumbungu	funestus	GH-NP_fune_2017	GH-NP_fune_2017_07	GH-NP_fune_2017_Q3	GH-NP_Kumbungu_fune_2017	GH-NP_Kumbungu_fune_2017_07	GH-NP_Kumbungu_fune_2017_Q3
2	VBS24197	1229-GH-A-GH03	Samuel Dadzie	Ghana	Dimabi	2017	7	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_07	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_07	GH-NP_Tolon_fune_2017_Q3
3	VBS24198	1229-GH-A-GH04	Samuel Dadzie	Ghana	Dimabi	2017	8	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_08	GH-NP_Tolon_fune_2017_Q3
4	VBS24199	1229-GH-A-GH05	Samuel Dadzie	Ghana	Gupanarigu	2017	8	9.497	-0.952	F	...	Northern Region	GH-NP	Kumbungu	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Kumbungu_fune_2017	GH-NP_Kumbungu_fune_2017_08	GH-NP_Kumbungu_fune_2017_Q3

5 rows × 26 columns

# load IDs of phased samples
samples_phased = ds_haps['sample_id'].values
samples_phased[:5]

array(['VBS24195', 'VBS24196', 'VBS24197', 'VBS24198', 'VBS24199'],
      dtype=object)

# align sample metadata to haplotypes
df_samples_phased = df_samples.set_index("sample_id").loc[samples_phased].reset_index()
df_samples_phased.head()

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
0	VBS24195	1229-GH-A-GH01	Samuel Dadzie	Ghana	Dimabi	2017	8	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_08	GH-NP_Tolon_fune_2017_Q3
1	VBS24196	1229-GH-A-GH02	Samuel Dadzie	Ghana	Gbullung	2017	7	9.488	-1.009	F	...	Northern Region	GH-NP	Kumbungu	funestus	GH-NP_fune_2017	GH-NP_fune_2017_07	GH-NP_fune_2017_Q3	GH-NP_Kumbungu_fune_2017	GH-NP_Kumbungu_fune_2017_07	GH-NP_Kumbungu_fune_2017_Q3
2	VBS24197	1229-GH-A-GH03	Samuel Dadzie	Ghana	Dimabi	2017	7	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_07	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_07	GH-NP_Tolon_fune_2017_Q3
3	VBS24198	1229-GH-A-GH04	Samuel Dadzie	Ghana	Dimabi	2017	8	9.420	-1.083	F	...	Northern Region	GH-NP	Tolon	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Tolon_fune_2017	GH-NP_Tolon_fune_2017_08	GH-NP_Tolon_fune_2017_Q3
4	VBS24199	1229-GH-A-GH05	Samuel Dadzie	Ghana	Gupanarigu	2017	8	9.497	-0.952	F	...	Northern Region	GH-NP	Kumbungu	funestus	GH-NP_fune_2017	GH-NP_fune_2017_08	GH-NP_fune_2017_Q3	GH-NP_Kumbungu_fune_2017	GH-NP_Kumbungu_fune_2017_08	GH-NP_Kumbungu_fune_2017_Q3

5 rows × 26 columns

# now define some cohort of interest and locate samples
cohort_query = "country == 'Ghana' and taxon == 'funestus' and year == 2017"
loc_cohort_samples = df_samples_phased.query(cohort_query).index.values
loc_cohort_samples

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35])

# subset haplotypes to cohort
ds_haps_cohort = ds_haps.isel(samples=loc_cohort_samples)
ds_haps_cohort

<xarray.Dataset> Size: 2GB
Dimensions:           (variants: 20633859, alleles: 2, samples: 36, ploidy: 2)
Coordinates:
    variant_position  (variants) int32 83MB dask.array<chunksize=(262144,), meta=np.ndarray>
    variant_contig    (variants) uint8 21MB dask.array<chunksize=(20633859,), meta=np.ndarray>
    sample_id         (samples) object 288B dask.array<chunksize=(36,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele    (variants, alleles) |S1 41MB dask.array<chunksize=(262144, 1), meta=np.ndarray>
    call_genotype     (variants, samples, ploidy) int8 1GB dask.array<chunksize=(262144, 36, 2), meta=np.ndarray>
Attributes:
    contigs:   ('2RL', '3RL', 'X')
    analysis:  funestus

# now access subsetted haplotypes
gt_cohort = allel.GenotypeDaskArray(ds_haps_cohort['call_genotype'].data)
ht_cohort = gt_cohort.to_haplotypes()
ht_cohort

	0	1	2	3	4	...	67	68	69	70	71
0	0	0	0	0	0	...	0	0	0	0	0
1	0	0	0	0	0	...	0	0	0	0	0
2	0	0	0	0	0	...	0	0	0	0	0
...	...
20633856	0	0	0	0	0	...	0	0	0	0	0
20633857	0	0	0	0	0	...	0	0	0	0	0
20633858	0	1	1	0	1	...	0	0	1	0	1

Note there are \(36\) samples in the cohort and thus \(72\) (\(2\times36\)) haplotypes.

Example computation#

Here’s an example computation to count the number of segregating SNPs on chromosome arm 3R that also pass gamb_colu_arab site filters. This may take a minute or two, because it is scanning genotype calls at millions of SNPs in hundreds of samples.

# choose chromosome arm
region = "2RL"

# choose site filter mask
mask = "funestus"

# choose sample sets
sample_sets = ["1229-VO-GH-DADZIE-VMF00095"]

# access SNP calls
ds_snps = af1.snp_calls(region=region, sample_sets=sample_sets)

# locate pass sites
loc_pass = ds_snps[f"variant_filter_pass_{mask}"].values

# perform an allele count over genotypes
gt = allel.GenotypeDaskArray(ds_snps["call_genotype"].data)
with ProgressBar():
    ac = gt.count_alleles(max_allele=3).compute()

# locate segregating sites
loc_seg = ac.is_segregating()

# count segregating and pass sites
n_pass_seg = np.count_nonzero(loc_pass & loc_seg)
n_pass_seg

[########################################] | 100% Completed | 30.1s

Running larger computations#

Please note that free cloud computing services such as Google Colab and MyBinder provide only limited computing resources. Thus although these services are able to efficiently read Af1 data stored on Google Cloud, you may find that you run out of memory, or computations take a long time running on a single core. If you would like any suggestions regarding how to set up more powerful compute resources in the cloud, please feel free to get in touch via the malariagen/vector-data GitHub discussion board.

Feedback and suggestions#

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.

Af1 cloud data access

Contents

Af1 cloud data access#

Data hosting#

Setup#

Sample sets#

Sample metadata#

SNP calls#

SNP sites and alleles#

Site filters#

SNP genotypes#

Copy number variation (CNV) data#

CNV HMM#

CNV coverage calls#

Haplotypes#

Using sample metadata to subset haplotypes#

Example computation#

Running larger computations#

Feedback and suggestions#