Pv4 API
Contents
Pv4 API¶
This page provides documentation for functions in the malariagen_data Python package for accessing Plasmodium falciparum data.
Pv4()¶
-
malariagen_data.pv4.
Pv4
(data_config=None, **kwargs)¶ Provides access to data from the Pv4 release.
- Parameters
- urlstr, optional
Base path to data. Default uses Google Cloud Storage “gs://pv4_release/”, or specify a local path on your file system if data have been downloaded.
- data_configstr, optional
Path to config for structure of Pv4 data resource. Defaults to config included with the malariagen_data package.
- **kwargs
Passed through to fsspec when setting up file system access.
Examples
Access data from Google Cloud Storage (default):
>>> import malariagen_data >>> pv4 = malariagen_data.Pv4()
Access data downloaded to a local file system:
>>> pv4 = malariagen_data.Pv4("/local/path/to/pv4_release/")
sample_metadata()¶
-
Pv4.
sample_metadata
()¶ Access sample metadata and return as pandas dataframe. Returns ——- df : pandas.DataFrame
A dataframe of sample metadata on the samples that were sequenced as part of this resource. Includes the time and place of collection, quality metrics, and accesion numbers. One row per sample.
genome_sequence()¶
-
Pv4.
genome_sequence
(region='*', inline_array=True, chunks='native')¶ Access the reference genome sequence.
- Parameters
- region: str or list of str or Region or list of Region. Defaults to ‘*’
Chromosome (e.g., “Pf3D7_07_v3”), gene name (e.g., “PF3D7_0709000”), genomic region defined with coordinates (e.g., “Pf3D7_07_v3:1-500”). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“Pf3D7_07_v3:1-500”,”Pf3D7_02_v3:15-20”,”Pf3D7_03_v3:40-50”].
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- ddask.array.Array
An array of nucleotides giving the reference genome sequence for the given region/gene/contig.
genome_features()¶
-
Pv4.
genome_features
(attributes=('ID', 'Parent', 'Name', 'alias'))¶ Access genome feature annotations.
- Parameters
- attributeslist of str, optional
Attribute keys to unpack into columns. Provide “*” to unpack all attributes.
- Returns
- dfpandas.DataFrame
variant_calls()¶
-
Pv4.
variant_calls
(extended=False, inline_array=True, chunks='native')¶ Access variant sites, site filters and genotype calls.
- Parameters
- extendedbool, optional
If False only the default variables are returned. If True all variables from the zarr are returned. Defaults to False.
- inline_arraybool, optional
Passed through to dask.array.from_array(). Defaults to True.
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also can be a target size, e.g., ‘200 MiB’. Defaults to “native”.
- Returns
- dsxarray.Dataset
Dataset containing either default or extended variables from the variant calls Zarr.