Pf7 API

This page provides documentation for functions in the malariagen_data Python package for accessing Plasmodium falciparum data.

Pf7()

malariagen_data.pf7.Pf7(data_config=None, **kwargs)

Provides access to data from the Pf7 release.

Parameters
urlstr, optional

Base path to data. Default uses Google Cloud Storage “gs://pf7_release/”, or specify a local path on your file system if data have been downloaded.

data_configstr, optional

Path to config for structure of Pf7 data resource. Defaults to config included with the malariagen_data package.

**kwargs

Passed through to fsspec when setting up file system access.

Examples

Access data from Google Cloud Storage (default):

>>> import malariagen_data
>>> pf7 = malariagen_data.Pf7()

Access data downloaded to a local file system:

>>> pf7 = malariagen_data.Pf7("/local/path/to/pf7_release/")

sample_metadata()

Pf7.sample_metadata()

Access sample metadata and return as pandas dataframe. Returns ——- df : pandas.DataFrame

A dataframe of sample metadata on the samples that were sequenced as part of this resource. Includes the time and place of collection, quality metrics, and accesion numbers. One row per sample.

genome_sequence()

Pf7.genome_sequence(region='*', inline_array=True, chunks='native')

Access the reference genome sequence.

Parameters
region: str or list of str or Region or list of Region. Defaults to ‘*’

Chromosome (e.g., “Pf3D7_07_v3”), gene name (e.g., “PF3D7_0709000”), genomic region defined with coordinates (e.g., “Pf3D7_07_v3:1-500”). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“Pf3D7_07_v3:1-500”,”Pf3D7_02_v3:15-20”,”Pf3D7_03_v3:40-50”].

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
ddask.array.Array

An array of nucleotides giving the reference genome sequence for the given region/gene/contig.

genome_features()

Pf7.genome_features(attributes=('ID', 'Parent', 'Name', 'alias'))

Access genome feature annotations.

Parameters
attributeslist of str, optional

Attribute keys to unpack into columns. Provide “*” to unpack all attributes.

Returns
dfpandas.DataFrame

variant_calls()

Pf7.variant_calls(extended=False, inline_array=True, chunks='native')

Access variant sites, site filters and genotype calls.

Parameters
extendedbool, optional

If False only the default variables are returned. If True all variables from the zarr are returned. Defaults to False.

inline_arraybool, optional

Passed through to dask.array.from_array(). Defaults to True.

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also can be a target size, e.g., ‘200 MiB’. Defaults to “native”.

Returns
dsxarray.Dataset

Dataset containing either default or extended variables from the variant calls Zarr.