Pf8 Data Access¶

For a quick overview on the spatial and geographical distribution of the samples available, please visit our Pf8 web-app.

Default Access: no authentication required¶

Plasmodium falciparum version 8 (Pf8) data are stored and openly accessible from the Wellcome Sanger Institute’s S3 storage. No authentication is required to access these data using the malariagen_data Python package’s default settings.

Google Cloud Access: how to authenticate¶

Some Pf8 data (sample metadata, reference genome, genome annotations, genotype information in zarr format, genetic distance matrix) are accessible via Google Cloud Storage. Access to Google Cloud requires users to register for authentication. For instructions on how to authenticate, please see here.

Acessing Pf8 with the malariagen_data Python package¶

This notebook illustrates how to read data directly from the cloud (S3 or Google), without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via MyBinder or Google Colab which are free interactive computing service running in the cloud.

To launch this notebook in the cloud and run it for yourself, click the launch icon (shaped like a rocket) at the top of the page and select one of the cloud computing services available.

Setup¶

Running this notebook requires some Python packages to be installed. These packages can be installed via pip or conda. E.g.:

!pip install -q --no-warn-conflicts malariagen_data

To make accessing these data more convenient, we’ve created the malariagen_data Python package, which is available from PyPI. This is experimental so please let us know if you find any bugs or have any suggestions.

Now import the packages we’ll need to use here.

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import allel
# silence some warnings
dask.config.set(**{'array.slicing.split_large_chunks': False})
import malariagen_data

To access the pf8 data stored on S3 (default option), requiring no authentication, use the following code:

pf8 = malariagen_data.Pf8()

To access the pf8 data stored on google cloud use the following code:

pf8_gcs = malariagen_data.Pf8("gs://pf8-release/")

Metadata¶

We will continue using the data accessed from S3.

Data on the samples that were sequenced as part of this resource are available. It includes the time and place of collection, quality metrics, and accession numbers.

To see all the information available, load sample metadata into a pandas dataframe:

pf8_metadata = pf8.sample_metadata()

pf8_metadata.head()

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	All samples same case	Population	% callable	QC pass	Exclusion reason	Sample type	Sample was in Pf7
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	FP0008-C	AF-W	82.48	True	Analysis_set	gDNA	True
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	FP0009-C	AF-W	88.95	True	Analysis_set	gDNA	True
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	FP0010-CW	AF-W	87.01	True	Analysis_set	sWGA	True
3	FP0011-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889624	FP0011-CW	AF-W	86.95	True	Analysis_set	sWGA	True
4	FP0012-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889627	FP0012-CW	AF-W	89.86	True	Analysis_set	sWGA	True

print("The data set has {} samples and {} fields".format(pf8_metadata.shape[0],pf8_metadata.shape[1]))

The data set has 33325 samples and 17 fields

We can explore each of the fields:

The Sample column gives the unique sample identifier used throughout all Pf8 analyses.
The Study refers to the partner study which collected the sample.
The Country & Admin level 1 describe the location where the sample was collected from.
The Country latitude, Country longitude, Admin level 1 latitude and Admin 1 longitude contain the GADM coordinates for each country & administrative level 1.
The Year column gives the time of sample collection.
The ENA column gives the run accession(s) for the sequencing read data for each sample.
The All samples same case column identifies samples set collected from the same individual.
The Population column gives the population to which the sample has been assigned. The possible values are: Africa - West (AF-W), Africa-Central (AF-C), Africa - East (AF-E), Africa - Northeast (AF-NE), Asia - South - East (AS-S-E), Asia - South – Far East (AS-S-FE), Asia - Southeast - West (AS-SE-W), Asia - Southeast - East (AS-SE-E), Oceania - New Guinea (OC-NG), South America (SA).
The % callable column refers to the % of the genome with coverage of at least 5 reads and less than 10% of reads with mapping quality 0.
The QC pass column defines whether the sample passed (True) or failed (False) QC.
The Exclusion reason describes the reason why the particular sample was excluded from the main analysis.
The Sample type column gives details on the DNA preparation method used
The Sample was in Pf7 column defines whether the sample was included in the previous version of the data release (Pf7) or if it is new to Pf8.

The python package Pandas can be used to explore and query the sample metadata in different ways. For example, here is a summary of the numbers of samples grouped by the country they were collected in:

pf8_metadata.groupby("Country").size()

	0
Country
Bangladesh	1658
Benin	334
Burkina Faso	58
Cambodia	2282
Cameroon	294
Colombia	167
Côte d'Ivoire	71
Democratic Republic of the Congo	1549
Ethiopia	35
Gabon	59
Gambia	1998
Ghana	6653
Guinea	199
Honduras	8
India	318
Indonesia	133
Kenya	2142
Laos	1994
Madagascar	25
Malawi	681
Mali	2428
Mauritania	104
Mozambique	1348
Myanmar	1268
Nigeria	1303
Papua New Guinea	251
Peru	106
Senegal	155
Sudan	356
Tanzania	1144
Thailand	1157
Uganda	15
Venezuela	2
Vietnam	2700

dtype: int64

Variant Calls¶

Two variant callset versions were created for Pf8 from all samples in the release:

a Full callset: Contains details of 12,493,205 discovered variant genome positions. Variants are single nucleotide polymorphisms (SNPs), short insertion/deletions (indels), or a combination of SNPs and indels.
a SNP-only callset: A subset of the full callset, with all indel variants removed, leaving only SNPs. There are 10,821,552 SNPs in this callset.

Data on variant calls, including the genomic positions, alleles, and genotypes, can be accessed as an xarray Dataset:

# Access the full callset
variant_dataset = pf8.variant_calls()
variant_dataset

<xarray.Dataset> Size: 7TB
Dimensions:              (variants: 12493205, alleles: 7, samples: 33325,
                          ploidy: 2)
Coordinates:
    variant_position     (variants) int32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_chrom        (variants) object 100MB dask.array<chunksize=(131072,), meta=np.ndarray>
    sample_id            (samples) object 267kB dask.array<chunksize=(16663,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele       (variants, alleles) object 700MB dask.array<chunksize=(131072, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_is_snp       (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_numalt       (variants) int32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_CDS          (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 833GB dask.array<chunksize=(131072, 100, 2), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 6TB dask.array<chunksize=(131072, 100, 7), meta=np.ndarray>

# Access the SNP-only callset
pf8_snp_only  = malariagen_data.Pf8('s3://pf8-release/snp-only/')
variant_dataset_snp_only = pf8_snp_only.variant_calls()
variant_dataset_snp_only

<xarray.Dataset> Size: 6TB
Dimensions:              (variants: 10821552, alleles: 7, samples: 33325,
                          ploidy: 2)
Coordinates:
    variant_position     (variants) int32 43MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_chrom        (variants) object 87MB dask.array<chunksize=(131072,), meta=np.ndarray>
    sample_id            (samples) object 267kB dask.array<chunksize=(16663,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele       (variants, alleles) object 606MB dask.array<chunksize=(131072, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool 11MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_is_snp       (variants) bool 11MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_numalt       (variants) int32 43MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_CDS          (variants) bool 11MB dask.array<chunksize=(131072,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 721GB dask.array<chunksize=(131072, 100, 2), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 5TB dask.array<chunksize=(131072, 100, 7), meta=np.ndarray>

The default returns a basic set of data most commonly used for data analysis. However, for more complex analysis the full range of variables available in the zarr can be accessed by setting the extended flag to True, as shown below:

extended_variant_dataset = pf8.variant_calls(extended=True)
extended_variant_dataset

<xarray.Dataset> Size: 31TB
Dimensions:                                        (variants: 12493205,
                                                    alleles: 7, samples: 33325,
                                                    ploidy: 2, genotypes: 3,
                                                    sb_statistics: 4,
                                                    alt_alleles: 6)
Coordinates:
    variant_position                               (variants) int32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_chrom                                  (variants) object 100MB dask.array<chunksize=(131072,), meta=np.ndarray>
    sample_id                                      (samples) object 267kB dask.array<chunksize=(16663,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy, genotypes,
                                sb_statistics, alt_alleles
Data variables: (12/90)
    variant_allele                                 (variants, alleles) object 700MB dask.array<chunksize=(131072, 1), meta=np.ndarray>
    variant_filter_pass                            (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_is_snp                                 (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_numalt                                 (variants) int32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_CDS                                    (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    call_genotype                                  (variants, samples, ploidy) int8 833GB dask.array<chunksize=(131072, 100, 2), meta=np.ndarray>
    ...                                             ...
    variant_ReadPosRankSum                         (variants) float32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_RegionType                             (variants) object 100MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_SOR                                    (variants) float32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_VQSLOD                                 (variants) float32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_culprit                                (variants) object 100MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_set                                    (variants) object 100MB dask.array<chunksize=(131072,), meta=np.ndarray>

Each of the elements in this xarray dataset is a dask array. The individual dask arrays can be accessed as follows, replacing the string with the variable you are looking for:

pos = variant_dataset["variant_position"].data
pos

	Array	Chunk
Bytes	47.66 MiB	512.00 kiB
Shape	(12493205,)	(131072,)
Dask graph	96 chunks in 1 graph layer
Data type	int32 numpy.ndarray

Genotypes¶

Genotypes for individual samples are available.

Genotypes are stored as a three-dimensional array, where:

the first dimension corresponds to genomic positions,
the second dimension is samples,
the third dimension is ploidy (2).

Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

Variant genotypes can be accessed as dask arrays as shown below.

gt = variant_dataset["call_genotype"].data
gt

	Array	Chunk
Bytes	775.49 GiB	25.00 MiB
Shape	(12493205, 33325, 2)	(131072, 100, 2)
Dask graph	32064 chunks in 1 graph layer
Data type	int8 numpy.ndarray

Note that the columns of this array (second dimension) match the rows in the sample metadata. You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:

loc_colombia = pf8_metadata.eval("Country == 'Colombia'").values
print(f"found {np.count_nonzero(loc_colombia)} samples from Colombia")
variant_dataset_colombia = variant_dataset.isel(samples=loc_colombia)
variant_dataset_colombia

found 167 samples from Colombia

<xarray.Dataset> Size: 34GB
Dimensions:              (variants: 12493205, alleles: 7, samples: 167,
                          ploidy: 2)
Coordinates:
    variant_position     (variants) int32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_chrom        (variants) object 100MB dask.array<chunksize=(131072,), meta=np.ndarray>
    sample_id            (samples) object 1kB dask.array<chunksize=(167,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele       (variants, alleles) object 700MB dask.array<chunksize=(131072, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_is_snp       (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_numalt       (variants) int32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_CDS          (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 4GB dask.array<chunksize=(131072, 99, 2), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 29GB dask.array<chunksize=(131072, 99, 7), meta=np.ndarray>

The data on genomic variants can be loaded into memory as numpy arrays as shown in the following example, where we read genotypes for the first 5 SNPs and the first 3 samples:

g = gt[:5, :3, :].compute()
g

array([[[-1, -1],
        [ 0,  0],
        [-1, -1]],

       [[-1, -1],
        [ 0,  0],
        [-1, -1]],

       [[-1, -1],
        [ 0,  0],
        [-1, -1]],

       [[-1, -1],
        [ 0,  0],
        [-1, -1]],

       [[-1, -1],
        [ 0,  0],
        [-1, -1]]], dtype=int8)

If you want to work with the genotype calls, you may find it convenient to use scikit-allel. E.g., the code below sets up a genotype array using the Colombian samples subset we created above.

# use the scikit-allel wrapper class for genotype calls
gt = allel.GenotypeDaskArray(variant_dataset_colombia["call_genotype"].data)
gt

	0	1	2	3	4	...	162	163	164	165	166
0	0/0	0/0	./.	0/0	0/0	...	./.	./.	./.	./.	./.
1	0/0	0/0	./.	0/0	0/0	...	./.	./.	./.	./.	./.
2	0/0	0/0	./.	0/0	0/0	...	./.	./.	./.	./.	./.
...	...
12493202	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
12493203	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.
12493204	./.	./.	./.	./.	./.	...	./.	./.	./.	./.	./.

Genome Annotations¶

Gene annotations provide information on which regions of the genome contain DNA sequences that encode genes, which are transcribed and spliced into messenger RNA (mRNA) and then translated to make proteins.

For convenience, we’ve added some functionality to the malariagen_data package for loading these gene annotations into a pandas data frame as shown below:

genome_features = pf8.genome_features()
genome_features

	contig	source	type	start	end	score	strand	phase	ID	Parent	Name
0	Pf3D7_13_v3	VEuPathDB	protein_coding_gene	624510	626292	NaN	+	NaN	PF3D7_1314600	NaN	LipL1
1	Pf3D7_13_v3	VEuPathDB	mRNA	624510	626292	NaN	+	NaN	PF3D7_1314600.1	PF3D7_1314600	NaN
2	Pf3D7_13_v3	VEuPathDB	exon	624510	626292	NaN	+	NaN	exon_PF3D7_1314600.1-E1	PF3D7_1314600.1	NaN
3	Pf3D7_13_v3	VEuPathDB	CDS	624785	626011	NaN	+	0.0	PF3D7_1314600.1-p1-CDS1	PF3D7_1314600.1	NaN
4	Pf3D7_13_v3	VEuPathDB	five_prime_UTR	624510	624784	NaN	+	NaN	utr_PF3D7_1314600.1_1	PF3D7_1314600.1	NaN
...	...	...	...	...	...	...	...	...	...	...	...
50065	Pf3D7_13_v3	VEuPathDB	five_prime_UTR	957496	958095	NaN	-	NaN	utr_PF3D7_1322600.1_2	PF3D7_1322600.1	NaN
50066	Pf3D7_02_v3	VEuPathDB	protein_coding_gene	701245	702894	NaN	+	NaN	PF3D7_0216900	NaN	NaN
50067	Pf3D7_02_v3	VEuPathDB	mRNA	701245	702894	NaN	+	NaN	PF3D7_0216900.1	PF3D7_0216900	NaN
50068	Pf3D7_02_v3	VEuPathDB	exon	701245	702894	NaN	+	NaN	exon_PF3D7_0216900.1-E1	PF3D7_0216900.1	NaN
50069	Pf3D7_02_v3	VEuPathDB	CDS	701245	702894	NaN	+	0.0	PF3D7_0216900.1-p1-CDS1	PF3D7_0216900.1	NaN

50070 rows × 11 columns

The above loads a default set of attributes "ID", "Parent", "Name", "alias". To access all features set attributes to "*".

pf8.genome_features(attributes="*")

	contig	source	type	start	end	score	strand	phase	ID	Name	Note	Parent	description	gene_id	protein_source_id
0	Pf3D7_13_v3	VEuPathDB	protein_coding_gene	624510	626292	NaN	+	NaN	PF3D7_1314600	LipL1	NaN	NaN	lipoate-protein ligase 1	NaN	NaN
1	Pf3D7_13_v3	VEuPathDB	mRNA	624510	626292	NaN	+	NaN	PF3D7_1314600.1	NaN	2.3.1.181 (Lipoyl(octanoyl) transferase);2.7.7...	PF3D7_1314600	lipoate-protein ligase 1	NaN	NaN
2	Pf3D7_13_v3	VEuPathDB	exon	624510	626292	NaN	+	NaN	exon_PF3D7_1314600.1-E1	NaN	NaN	PF3D7_1314600.1	NaN	PF3D7_1314600	NaN
3	Pf3D7_13_v3	VEuPathDB	CDS	624785	626011	NaN	+	0.0	PF3D7_1314600.1-p1-CDS1	NaN	NaN	PF3D7_1314600.1	NaN	PF3D7_1314600	PF3D7_1314600.1-p1
4	Pf3D7_13_v3	VEuPathDB	five_prime_UTR	624510	624784	NaN	+	NaN	utr_PF3D7_1314600.1_1	NaN	NaN	PF3D7_1314600.1	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
50065	Pf3D7_13_v3	VEuPathDB	five_prime_UTR	957496	958095	NaN	-	NaN	utr_PF3D7_1322600.1_2	NaN	NaN	PF3D7_1322600.1	NaN	NaN	NaN
50066	Pf3D7_02_v3	VEuPathDB	protein_coding_gene	701245	702894	NaN	+	NaN	PF3D7_0216900	NaN	NaN	NaN	conserved Plasmodium protein, unknown function	NaN	NaN
50067	Pf3D7_02_v3	VEuPathDB	mRNA	701245	702894	NaN	+	NaN	PF3D7_0216900.1	NaN	NaN	PF3D7_0216900	conserved Plasmodium protein, unknown function	NaN	NaN
50068	Pf3D7_02_v3	VEuPathDB	exon	701245	702894	NaN	+	NaN	exon_PF3D7_0216900.1-E1	NaN	NaN	PF3D7_0216900.1	NaN	PF3D7_0216900	NaN
50069	Pf3D7_02_v3	VEuPathDB	CDS	701245	702894	NaN	+	0.0	PF3D7_0216900.1-p1-CDS1	NaN	NaN	PF3D7_0216900.1	NaN	PF3D7_0216900	PF3D7_0216900.1-p1

50070 rows × 15 columns

Or to get a specific set of attributes specify them in a list

pf8.genome_features(attributes=['ID','Note','description'])

	contig	source	type	start	end	score	strand	phase	ID	Note	description
0	Pf3D7_13_v3	VEuPathDB	protein_coding_gene	624510	626292	NaN	+	NaN	PF3D7_1314600	NaN	lipoate-protein ligase 1
1	Pf3D7_13_v3	VEuPathDB	mRNA	624510	626292	NaN	+	NaN	PF3D7_1314600.1	2.3.1.181 (Lipoyl(octanoyl) transferase);2.7.7...	lipoate-protein ligase 1
2	Pf3D7_13_v3	VEuPathDB	exon	624510	626292	NaN	+	NaN	exon_PF3D7_1314600.1-E1	NaN	NaN
3	Pf3D7_13_v3	VEuPathDB	CDS	624785	626011	NaN	+	0.0	PF3D7_1314600.1-p1-CDS1	NaN	NaN
4	Pf3D7_13_v3	VEuPathDB	five_prime_UTR	624510	624784	NaN	+	NaN	utr_PF3D7_1314600.1_1	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...
50065	Pf3D7_13_v3	VEuPathDB	five_prime_UTR	957496	958095	NaN	-	NaN	utr_PF3D7_1322600.1_2	NaN	NaN
50066	Pf3D7_02_v3	VEuPathDB	protein_coding_gene	701245	702894	NaN	+	NaN	PF3D7_0216900	NaN	conserved Plasmodium protein, unknown function
50067	Pf3D7_02_v3	VEuPathDB	mRNA	701245	702894	NaN	+	NaN	PF3D7_0216900.1	NaN	conserved Plasmodium protein, unknown function
50068	Pf3D7_02_v3	VEuPathDB	exon	701245	702894	NaN	+	NaN	exon_PF3D7_0216900.1-E1	NaN	NaN
50069	Pf3D7_02_v3	VEuPathDB	CDS	701245	702894	NaN	+	0.0	PF3D7_0216900.1-p1-CDS1	NaN	NaN

50070 rows × 11 columns

Genome Reference¶

We mapped sequence reads for all samples against the P. falciparum 3D7 v3 reference genome.

For convenience, the reference genome sequence can be loaded as a dask array, e.g.:

ref = pf8.genome_sequence()
ref

	Array	Chunk
Bytes	22.25 MiB	412.03 kiB
Shape	(23332839,)	(421914,)
Dask graph	63 chunks in 17 graph layers
Data type	\|S1 numpy.ndarray

This can be loaded as a numpy array using the following

ref.compute()

array([b'T', b'G', b'A', ..., b'A', b'T', b'A'], dtype='|S1')

The reference can also be subset by contig.

The set of contigs used can be accessed as follows:

pf8.contigs

['Pf3D7_01_v3',
 'Pf3D7_02_v3',
 'Pf3D7_03_v3',
 'Pf3D7_04_v3',
 'Pf3D7_05_v3',
 'Pf3D7_06_v3',
 'Pf3D7_07_v3',
 'Pf3D7_08_v3',
 'Pf3D7_09_v3',
 'Pf3D7_10_v3',
 'Pf3D7_11_v3',
 'Pf3D7_12_v3',
 'Pf3D7_13_v3',
 'Pf3D7_14_v3',
 'Pf3D7_API_v3',
 'Pf3D7_MIT_v3']

To load a single contig

pf8.genome_sequence(region='Pf3D7_01_v3')

	Array	Chunk
Bytes	625.83 kiB	412.03 kiB
Shape	(640851,)	(421914,)
Dask graph	2 chunks in 1 graph layer
Data type	\|S1 numpy.ndarray

To load multiple contigs specify them in a list. The data will be concatenated.

pf8.genome_sequence(region=['Pf3D7_07_v3','Pf3D7_02_v3','Pf3D7_03_v3'])

	Array	Chunk
Bytes	3.30 MiB	412.03 kiB
Shape	(3460280,)	(421914,)
Dask graph	10 chunks in 4 graph layers
Data type	\|S1 numpy.ndarray

You can also specify a specific region of the contig.

pf8.genome_sequence(region=['Pf3D7_07_v3','Pf3D7_02_v3:15-20','Pf3D7_03_v3:40-50'])

	Array	Chunk
Bytes	1.38 MiB	412.03 kiB
Shape	(1445224,)	(421914,)
Dask graph	6 chunks in 6 graph layers
Data type	\|S1 numpy.ndarray

If you know the gene name you would like to access, but aren’t sure what the ID would be you can access this through the annotations. Below is an example for CRT.

gene_name = str(genome_features.loc[genome_features.Name == 'CRT'].ID.values)
print(gene_name)

['PF3D7_0709000']

You can then enter this as the region

pf8.genome_sequence(region='PF3D7_0709000')

	Array	Chunk
Bytes	3.86 kiB	3.86 kiB
Shape	(3957,)	(3957,)
Dask graph	1 chunks in 2 graph layers
Data type	\|S1 numpy.ndarray

MalariaGEN parasite data user guide

Pf8 Data Access

Contents