malariagen_data.af1.Af1.gene_cnv_frequencies_advanced#

Af1.gene_cnv_frequencies_advanced(region: Annotated[str | Region | Mapping | List[str | Region | Mapping] | Tuple[str | Region | Mapping, ...], '\n    Region of the reference genome. Can be a contig name, region string\n    (formatted like "{contig}:{start}-{end}"), or identifier of a genome\n    feature such as a gene or transcript. Can also be a sequence (e.g., list)\n    of regions.\n    '], area_by, period_by, sample_sets=None, sample_query=None, sample_query_options=None, min_cohort_size=10, variant_query=None, drop_invariant=True, max_coverage_variance=0.2, ci_method='wilson', chunks: Annotated[int | str | Tuple[int | str, ...] | Callable[[Tuple[int, ...]], int | str | Tuple[int | str, ...]], "\n    Define how input data being read from zarr should be divided into chunks\n    for a dask computation. If 'native', use underlying zarr chunks. If a string\n    specifying a target memory size, e.g., '300 MiB', resize chunks in arrays\n    with more than one dimension to match this size. If 'auto', let dask decide\n    chunk size.  If 'ndauto', let dask decide chunk size but only for arrays with\n    more than one dimension. If 'ndauto0', as 'ndauto' but only vary the first\n    chunk dimension. If 'ndauto1', as 'ndauto' but only vary the second chunk\n    dimension. If 'ndauto01', as 'ndauto' but only vary the first and second\n    chunk dimensions. Also, can be a tuple of integers, or a callable which\n    accepts the native chunks as a single argument and returns a valid dask\n    chunks value.\n    "] = 'native', inline_array: Annotated[bool, 'Passed through to dask `from_array()`.'] = True)#

Group samples by taxon, area (space) and period (time), then compute gene CNV counts and frequencies.

Parameters#

region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

area_bystr

Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.

period_by{“year”, “quarter”, “month”}

Length of time to group samples temporally.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

sample_query_optionsdict, optional

A dictionary of arguments that will be passed through to pandas query() or eval(), e.g. parser, engine, local_dict, global_dict, resolvers.

min_cohort_sizeint, optional

Minimum cohort size. Any cohorts below this size are omitted.

variant_querystr, optional

A pandas query string which will be evaluated against variants.

drop_invariantbool, optional

If True, drop any rows where there is no evidence of variation.

max_coverage_variancefloat, optional

Remove samples if coverage variance exceeds this value.

ci_method{“normal”, “agresti_coull”, “beta”, “wilson”, “binom_test”}, optional

Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.

Returns#

dsxarray.Dataset

The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.