malariagen_data.ag3.Ag3.pca#

Ag3.pca(region: str | Region | Mapping | List[str | Region | Mapping] | Tuple[str | Region | Mapping, ...], n_snps: int, thin_offset: int = 0, sample_sets: Sequence[str] | str | None = None, sample_query: str | None = None, site_mask: str | None = 'default', min_minor_ac: int = 2, max_missing_an: int = 0, n_components: int = 20) Tuple[DataFrame, ndarray]#

Run a principal components analysis (PCA) using biallelic SNPs from the selected genome region and samples.

Parameters#

regionstr or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping

Region of the reference genome. Can be a contig name, region string (formatted like “{contig}:{start}-{end}”), or identifier of a genome feature such as a gene or transcript. Can also be a sequence (e.g., list) of regions.

n_snpsint

The desired number of SNPs to use when running the analysis. SNPs will be evenly thinned to approximately this number.

thin_offsetint, optional, default: 0

Starting index for SNP thinning. Change this to repeat the analysis using a different set of SNPs.

sample_setssequence of str or str or None, optional

List of sample sets and/or releases. Can also be a single sample set or release.

sample_querystr or None, optional

A pandas query string to be evaluated against the sample metadata, to select samples to be included in the returned data.

site_maskstr or None, optional, default: ‘default’

Which site filters mask to apply. See the site_mask_ids property for available values.

min_minor_acint, optional, default: 2

The minimum minor allele count. SNPs with a minor allele count below this value will be excluded prior to thinning.

max_missing_anint, optional, default: 0

The maximum number of missing allele calls to accept. SNPs with more than this value will be excluded prior to thinning. Set to 0 (default) to require no missing calls.

n_componentsint, optional, default: 20

Number of components to return.

Returns#

df_pcaDataFrame

A dataframe of sample metadata, with columns “PC1”, “PC2”, “PC3”, etc., added.

evrndarray

An array of explained variance ratios, one per component.

Notes#

This computation may take some time to run, depending on your computing environment. Results of this computation will be cached and re-used if the results_cache parameter was set when instantiating the Ag3 class.