malariagen_data.ag3.Ag3.snp_allele_frequencies_advanced#

Ag3.snp_allele_frequencies_advanced(transcript: str, area_by: str, period_by: Literal['year', 'quarter', 'month'], sample_sets: Sequence[str] | str | None = None, sample_query: str | None = None, min_cohort_size: int = 10, drop_invariant: bool = True, variant_query: str | None = None, site_mask: str | None = None, nobs_mode: Literal['called', 'fixed'] = 'called', ci_method: Literal['normal', 'agresti_coull', 'beta', 'wilson', 'binom_test'] | None = 'wilson') Dataset#

Group samples by taxon, area (space) and period (time), then compute SNP allele frequencies.

Parameters#

transcriptstr

Gene transcript identifier.

area_bystr

Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.

period_by{‘year’, ‘quarter’, ‘month’}

Length of time to group samples temporally.

sample_setssequence of str or str or None, optional

List of sample sets and/or releases. Can also be a single sample set or release.

sample_querystr or None, optional

A pandas query string to be evaluated against the sample metadata, to select samples to be included in the returned data.

min_cohort_sizeint, optional, default: 10

Minimum cohort size. Raise an error if the number of samples is less than this value.

drop_invariantbool, optional, default: True

If True, drop variants not observed in the selected samples.

variant_querystr or None, optional

A pandas query to be evaluated against variants.

site_maskstr or None, optional

Which site filters mask to apply. See the site_mask_ids property for available values.

nobs_mode{‘called’, ‘fixed’}, optional, default: ‘called’

Method for calculating the denominator when computing frequencies. If “called” then use the number of called alleles, i.e., number of samples with non-missing genotype calls multiplied by 2. If “fixed” then use the number of samples multiplied by 2.

ci_method{‘normal’, ‘agresti_coull’, ‘beta’, ‘wilson’, ‘binom_test’} or None, optional, default: ‘wilson’

Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.

Returns#

Dataset

The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.