malariagen_data.af1.Af1.gene_cnv_frequencies_advanced#

Af1.gene_cnv_frequencies_advanced(region: str | Region | Mapping | List[str | Region | Mapping] | Tuple[str | Region | Mapping, ...], area_by, period_by, sample_sets=None, sample_query=None, min_cohort_size=10, variant_query=None, drop_invariant=True, max_coverage_variance=0.2, ci_method='wilson')#

Group samples by taxon, area (space) and period (time), then compute gene CNV counts and frequencies.

Parameters#

region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

area_bystr

Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.

period_by{“year”, “quarter”, “month”}

Length of time to group samples temporally.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

min_cohort_sizeint, optional

Minimum cohort size. Any cohorts below this size are omitted.

variant_querystr, optional

A pandas query string which will be evaluated against variants.

drop_invariantbool, optional

If True, drop any rows where there is no evidence of variation.

max_coverage_variancefloat, optional

Remove samples if coverage variance exceeds this value.

ci_method{“normal”, “agresti_coull”, “beta”, “wilson”, “binom_test”}, optional

Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.

Returns#

dsxarray.Dataset

The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.