Pv4 Data Access¶

This page provides information about how to access data from Plasmodium vivax version 4 (Pv4) project via Google Cloud. This includes sample metadata and single nucleotide polymorphism (SNP) calls for 1,895 samples from 27 countries. This release spans 11 partner studies and 3 external studies.

This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via MyBinder or Google Colab which are free interactive computing service running in the cloud.

To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.

Setup¶

Running this notebook requires some Python packages to be installed. These packages can be installed via pip or conda. E.g.:

!pip install -q --no-warn-conflicts malariagen_data

To make accessing these data more convenient, we’ve created the malariagen_data Python package, which is available from PyPI. This is experimental so please let us know if you find any bugs or have any suggestions.

Now import the packages we’ll need to use here.

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import allel
# silence some warnings
dask.config.set(**{'array.slicing.split_large_chunks': False})
import malariagen_data

To access the Pv4 data stored on google cloud use the following code:

pv4 = malariagen_data.Pv4()

Metadata¶

Data on the samples that were sequenced as part of this resource are available. It includes the time and place of collection, quality metrics, and accesion numbers.

To see all the information available, load sample metadata into a pandas dataframe:

pv4_metadata = pv4.sample_metadata()

pv4_metadata.head()

	Sample	Study	Site	First-level administrative division	Country	Lat	Long	Year	ENA	All samples same individual	Population	% callable	QC pass	Exclusion reason	Is returning traveller
0	BBH-1-125	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2678989	BBH-1-125	AF	88.52	True	Analysis_set	False
1	BBH_1_132	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2678991	BBH_1_132	AF	90.20	True	Analysis_set	False
2	BBH_1_137	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2679003	BBH_1_137	AF	87.09	True	Analysis_set	False
3	BBH_1_153	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2678992	BBH_1_153	AF	90.60	True	Analysis_set	False
4	BBH_1_162	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2678993	BBH_1_162	AF	91.67	True	Analysis_set	False

print("The dataset has {} samples and {} fields".format(pv4_metadata.shape[0],pv4_metadata.shape[1]))

The dataset has 1895 samples and 15 fields

We can explore each of the fields:

The Sample column gives the unique sample identifier used throughout all Pv4 analyses.
The Study refers to the partner study which collected the sample.
The Country, First-level administrative division and Site describe the location where the sample was collected from.
The Lat & Long contain the GADM coordinates for each country.
The Year column gives the time of sample collection.
The ENA column gives the run accession(s) for the sequencing read data for each sample.
The All samples same individual column identifies samples set collected from the same individual.
The Population column gives the population to which the sample has been assigned to, based on the degree of genetic similarity to other samples. The possible values are: Africa-Central (AF-C), Africa - East (AF-E), Africa - Northeast (AF-NE), Africa - West (AF-W), Asia - South Asia - East (AS-SA-E), Asia - South Asia - West (AS-SA-W), Asia - Southeast Asia - East (AS-SEA-E), Asia - Southeast Asia - West (AS-SEA-W), Oceania - New Guinea (OC-NG), South America (SA).
The % callable column refers to the % of the genome with coverage of at least 5 reads and less than 10% of reads with mapping quality 0.
The QC pass column defines whether the sample passed (True) or failed (False) QC.
The Exclusion reason describes the reason why the particular sample was excluded from the main analysis.
The Is returning traveller column states if the sample was collected from a person returning from travel. The region assigned in this case was based on the reported country of travel.

The python package Pandas can be used to explore and query the sample metadata in different ways. For example, here is a summary of the numbers of samples grouped by the country they were collected in:

pv4_metadata.groupby("Country").size()

Country
Afghanistan         250
Bangladesh           28
Bhutan                9
Brazil               71
Cambodia            236
China                 5
Colombia            112
El Salvador           2
Ethiopia            203
India                14
Indonesia           282
Iran                 15
Madagascar            4
Malaysia            109
Mauritania            1
Mexico               20
Myanmar               9
Nicaragua             1
North Korea           1
Panama                1
Papua New Guinea     47
Peru                123
Philippines           6
Sri Lanka             2
Sudan                13
Thailand            192
Vietnam             139
dtype: int64

Variant Calls¶

These files contain details of 4,571,056 discovered variant genome positions. These variants were discovered amongst all samples from the release.

945,649 of these variant positions are SNPs, with the remainder being either short insertion/deletions (indels), or a combination of SNPs and indels.

Data on variant calls, including the genomic positions, alleles, and genotypes, can be accessed as an xarray Dataset:

variant_dataset = pv4.variant_calls()
variant_dataset

<xarray.Dataset>
Dimensions:              (variants: 4571056, alleles: 7, samples: 1895, ploidy: 2)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_chrom        (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    sample_id            (samples) object dask.array<chunksize=(1895,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele       (variants, alleles) object dask.array<chunksize=(65536, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_is_snp       (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_numalt       (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_CDS          (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 dask.array<chunksize=(65536, 64, 2), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 dask.array<chunksize=(65536, 64, 7), meta=np.ndarray>

The default returns a basic set of data most commonly used for data analysis. However, for more complex analysis the full range of variables available in the zarr can be accessed by setting the extended flag to True, as shown below:

extended_variant_dataset = pv4.variant_calls(extended=True)
extended_variant_dataset

<xarray.Dataset>
Dimensions:                                   (variants: 4571056, alleles: 7, samples: 1895, ploidy: 2, genotypes: 3, alt_alleles: 6)
Coordinates:
    variant_position                          (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_chrom                             (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    sample_id                                 (samples) object dask.array<chunksize=(1895,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy, genotypes, alt_alleles
Data variables: (12/42)
    variant_allele                            (variants, alleles) object dask.array<chunksize=(65536, 1), meta=np.ndarray>
    variant_filter_pass                       (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_is_snp                            (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_numalt                            (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_CDS                               (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    call_genotype                             (variants, samples, ploidy) int8 dask.array<chunksize=(65536, 64, 2), meta=np.ndarray>
    ...                                        ...
    variant_SNPEFF_IMPACT                     (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_SNPEFF_TRANSCRIPT_ID              (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_SOR                               (variants) float32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_VQSLOD                            (variants) float32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_VariantType                       (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_altlen                            (variants, alt_alleles) int32 dask.array<chunksize=(65536, 6), meta=np.ndarray>

Each of the elements in this xarray dataset is a dask array. The individual dask arrays can be accessed as follows, replacing the string with the variable you are looking for:

pos = variant_dataset["variant_position"].data
pos

	Array	Chunk
Bytes	17.44 MiB	256.00 kiB
Shape	(4571056,)	(65536,)
Count	70 Tasks	70 Chunks
Type	int32	numpy.ndarray

Genotypes¶

Genotypes for individual samples are available.

Genotypes are stored as a three-dimensional array, where:

the first dimension corresponds to genomic positions,
the second dimension is samples,
the third dimension is ploidy (2).

Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

Variant genotypes can be accessed as dask arrays as shown below.

gt = variant_dataset["call_genotype"].data
gt

	Array	Chunk
Bytes	16.13 GiB	8.00 MiB
Shape	(4571056, 1895, 2)	(65536, 64, 2)
Count	2100 Tasks	2100 Chunks
Type	int8	numpy.ndarray

Note that the columns of this array (second dimension) match the rows in the sample metadata. You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:

loc_cambodia = pv4_metadata.eval("Country == 'Cambodia'").values
print(f"found {np.count_nonzero(loc_cambodia)} samples from Cambodia")
variant_dataset_cambodia = variant_dataset.isel(samples=loc_cambodia)
variant_dataset_cambodia

found 236 samples from Cambodia

<xarray.Dataset>
Dimensions:              (variants: 4571056, alleles: 7, samples: 236, ploidy: 2)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_chrom        (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    sample_id            (samples) object dask.array<chunksize=(236,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele       (variants, alleles) object dask.array<chunksize=(65536, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_is_snp       (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_numalt       (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_CDS          (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 dask.array<chunksize=(65536, 2, 2), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 dask.array<chunksize=(65536, 2, 7), meta=np.ndarray>

The data on genomic variants can be loaded into memory as numpy arrays as shown in the following example, where we read genotypes for the first 5 SNPs and the first 3 samples:

g = gt[:5, :3, :].compute()
g

array([[[-1, -1],
        [ 0,  0],
        [-1, -1]],

       [[-1, -1],
        [ 0,  0],
        [-1, -1]],

       [[-1, -1],
        [ 0,  0],
        [-1, -1]],

       [[-1, -1],
        [ 0,  0],
        [-1, -1]],

       [[-1, -1],
        [ 0,  0],
        [-1, -1]]], dtype=int8)

If you want to work with the genotype calls, you may find it convenient to use scikit-allel. E.g., the code below sets up a genotype array using the Cambodian samples subset we created above.

# use the scikit-allel wrapper class for genotype calls
gt = allel.GenotypeDaskArray(variant_dataset_cambodia["call_genotype"].data)
gt

	0	1	2	3	4	...	231	232	233	234	235
0	./.	./.	./.	0/0	0/0	...	./.	0/0	./.	0/0	./.
1	./.	./.	./.	0/0	0/0	...	./.	0/0	./.	0/0	./.
2	./.	./.	./.	0/0	0/0	...	./.	0/0	./.	./.	./.
...	...
4571053	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
4571054	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
4571055	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.

Genome Annotations¶

Gene annotations provide information on which regions of the genome contain DNA sequences that encode genes, which are transcribed and spliced into messenger RNA (mRNA) and then translated to make proteins.

For convenience, we’ve added some functionality to the malariagen_data package for loading these gene annotations into a pandas data frame as shown below:

genome_features = pv4.genome_features()
genome_features

	contig	source	type	start	end	score	strand	phase	ID	Parent	Name	alias
0	PvP01_00_v1	chado	contig	1	402823	NaN	+	NaN	Transfer.PvP01_00_1.final	NaN	NaN	NaN
1	PvP01_00_v1	chado	gene	53	669	NaN	+	NaN	PVP01_0000010	NaN	NaN	NaN
2	PvP01_00_v1	chado	mRNA	53	669	NaN	+	NaN	PVP01_0000010.1	PVP01_0000010	NaN	NaN
3	PvP01_00_v1	chado	CDS	53	135	NaN	+	0.0	PVP01_0000010.1:exon:1	PVP01_0000010.1	NaN	NaN
4	PvP01_00_v1	chado	CDS	199	415	NaN	+	1.0	PVP01_0000010.1:exon:2	PVP01_0000010.1	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
38676	PvP01_MIT_v1	chado	CDS	4782	5912	NaN	+	0.0	PVP01_MIT03400.1:exon:1	PVP01_MIT03400.1	NaN	NaN
38677	PvP01_MIT_v1	chado	polypeptide	4782	5912	NaN	+	NaN	PVP01_MIT03400.1:pep	NaN	NaN	NaN
38678	PvP01_MIT_v1	chado	gene	5914	5985	NaN	+	NaN	PVP01_MIT03500	NaN	NaN	NaN
38679	PvP01_MIT_v1	chado	rRNA	5914	5985	NaN	+	NaN	PVP01_MIT03500.1	PVP01_MIT03500	NaN	NaN
38680	PvP01_MIT_v1	chado	CDS	5914	5985	NaN	+	0.0	PVP01_MIT03500.1:exon:1	PVP01_MIT03500.1	NaN	NaN

38681 rows × 12 columns

The above loads a default set of attributes "ID", "Parent", "Name", "alias". To access all features set attributes to "*".

pv4.genome_features(attributes="*")

	contig	source	type	start	end	score	strand	phase	Dbxref	Derives_from	...	non_cytoplasmic_polypeptide_region	orthologous_to	polypeptide_domain	previous_systematic_id	product	signal_peptide	stop_codon_redefined_as_selenocysteine	synonym	translation	transmembrane_polypeptide_region
0	PvP01_00_v1	chado	contig	1	402823	NaN	+	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	PvP01_00_v1	chado	gene	53	669	NaN	+	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	PvP01_00_v1	chado	mRNA	53	669	NaN	+	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	PvP01_00_v1	chado	CDS	53	135	NaN	+	0.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	PvP01_00_v1	chado	CDS	199	415	NaN	+	1.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
38676	PvP01_MIT_v1	chado	CDS	4782	5912	NaN	+	0.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
38677	PvP01_MIT_v1	chado	polypeptide	4782	5912	NaN	+	NaN	NaN	PVP01_MIT03400.1	...	;query 47-73;,;query 126-134;,;query 192-218;,...	Pknowlesi:PKNH_MIT01900 link=PKNH_MIT01900.1:p...	iprscan;InterPro:IPR036150 :\tCytochrome b/b6,...	NaN	term=cytochrome b;	NaN	NaN	NaN	mnyysinlakahllnypcplninflwnygfllgiiffiqiltgvfl...	;query 24-46;,;query 74-96;,;query 103-125;,;q...
38678	PvP01_MIT_v1	chado	gene	5914	5985	NaN	+	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
38679	PvP01_MIT_v1	chado	rRNA	5914	5985	NaN	+	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
38680	PvP01_MIT_v1	chado	CDS	5914	5985	NaN	+	0.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

38681 rows × 34 columns

Or to get a specific set of attributes specify them in a list

pv4.genome_features(attributes=['alias','comment','product'])

	contig	source	type	start	end	score	strand	phase	alias	comment	product
0	PvP01_00_v1	chado	contig	1	402823	NaN	+	NaN	NaN	Archived from fasta_record feature Transfer.Pv...	NaN
1	PvP01_00_v1	chado	gene	53	669	NaN	+	NaN	NaN	NaN	NaN
2	PvP01_00_v1	chado	mRNA	53	669	NaN	+	NaN	NaN	NaN	NaN
3	PvP01_00_v1	chado	CDS	53	135	NaN	+	0.0	NaN	NaN	NaN
4	PvP01_00_v1	chado	CDS	199	415	NaN	+	1.0	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...
38676	PvP01_MIT_v1	chado	CDS	4782	5912	NaN	+	0.0	NaN	NaN	NaN
38677	PvP01_MIT_v1	chado	polypeptide	4782	5912	NaN	+	NaN	NaN	NaN	term=cytochrome b;
38678	PvP01_MIT_v1	chado	gene	5914	5985	NaN	+	NaN	NaN	NaN	NaN
38679	PvP01_MIT_v1	chado	rRNA	5914	5985	NaN	+	NaN	NaN	NaN	NaN
38680	PvP01_MIT_v1	chado	CDS	5914	5985	NaN	+	0.0	NaN	NaN	NaN

38681 rows × 11 columns

Genome Reference¶

We mapped sequence reads for all samples against the PvP01 reference genome.

For convenience, the reference genome sequence can be loaded as a dask array, e.g.:

ref = pv4.genome_sequence()
ref 

	Array	Chunk
Bytes	23.13 MiB	430.00 kiB
Shape	(24250245,)	(440322,)
Count	156 Tasks	78 Chunks
Type	\|S1	numpy.ndarray

This can be loaded as a numpy array using the following

ref.compute()

array([b'a', b'a', b't', ..., b't', b'a', b'a'], dtype='|S1')

The reference can also be subset by contig.

The set of contigs used can be accessed as follows:

pv4.contigs

['PvP01_01_v1',
 'PvP01_02_v1',
 'PvP01_03_v1',
 'PvP01_04_v1',
 'PvP01_05_v1',
 'PvP01_06_v1',
 'PvP01_07_v1',
 'PvP01_08_v1',
 'PvP01_09_v1',
 'PvP01_10_v1',
 'PvP01_11_v1',
 'PvP01_12_v1',
 'PvP01_13_v1',
 'PvP01_14_v1',
 'PvP01_API_v1',
 'PvP01_MIT_v1']

To load a single contig

pv4.genome_sequence(region='PvP01_01_v1')

	Array	Chunk
Bytes	0.97 MiB	249.43 kiB
Shape	(1021664,)	(255416,)
Count	4 Tasks	4 Chunks
Type	\|S1	numpy.ndarray

To load multiple contigs specify them in a list. The data will be concatenated.

pv4.genome_sequence(region=['PvP01_01_v1','PvP01_06_v1','PvP01_10_v1'])

	Array	Chunk
Bytes	3.45 MiB	378.14 kiB
Shape	(3613299,)	(387211,)
Count	24 Tasks	12 Chunks
Type	\|S1	numpy.ndarray

You can also specify a specific region of the contig.

pv4.genome_sequence(region=['PvP01_01_v1','PvP01_06_v1:15-20','PvP01_10_v1:40-50'])

	Array	Chunk
Bytes	0.97 MiB	249.43 kiB
Shape	(1021681,)	(255416,)
Count	20 Tasks	6 Chunks
Type	\|S1	numpy.ndarray

If you know the gene name you would like to access, but aren’t sure what the ID would be you can access this through the annotations. Below is an example for ?.

gene_name = str(genome_features.loc[genome_features.Name == 'ORC2'].ID.values)
print(gene_name)

['PVP01_0105500']

You can then enter this as the region

pv4.genome_sequence(region='PVP01_0105500')

	Array	Chunk
Bytes	2.56 kiB	2.56 kiB
Shape	(2625,)	(2625,)
Count	5 Tasks	1 Chunks
Type	\|S1	numpy.ndarray

MalariaGEN parasite data user guide

Pv4 Data Access

Contents