malariagen_data.af1.Af1.gene_cnv_frequencies#

Af1.gene_cnv_frequencies(region: Annotated[str | Region | Mapping | List[str | Region | Mapping] | Tuple[str | Region | Mapping, ...], '\n    Region of the reference genome. Can be a contig name, region string\n    (formatted like "{contig}:{start}-{end}"), or identifier of a genome\n    feature such as a gene or transcript. Can also be a sequence (e.g., list)\n    of regions.\n    '], cohorts, sample_query=None, sample_query_options=None, min_cohort_size=10, sample_sets=None, drop_invariant=True, max_coverage_variance=0.2, chunks: Annotated[int | str | Tuple[int | str, ...] | Callable[[Tuple[int, ...]], int | str | Tuple[int | str, ...]], "\n    Define how input data being read from zarr should be divided into chunks\n    for a dask computation. If 'native', use underlying zarr chunks. If a string\n    specifying a target memory size, e.g., '300 MiB', resize chunks in arrays\n    with more than one dimension to match this size. If 'auto', let dask decide\n    chunk size.  If 'ndauto', let dask decide chunk size but only for arrays with\n    more than one dimension. If 'ndauto0', as 'ndauto' but only vary the first\n    chunk dimension. If 'ndauto1', as 'ndauto' but only vary the second chunk\n    dimension. If 'ndauto01', as 'ndauto' but only vary the first and second\n    chunk dimensions. Also, can be a tuple of integers, or a callable which\n    accepts the native chunks as a single argument and returns a valid dask\n    chunks value.\n    "] = 'native', inline_array: Annotated[bool, 'Passed through to dask `from_array()`.'] = True)#

Compute modal copy number by gene, then compute the frequency of amplifications and deletions in one or more cohorts, from HMM data.

Parameters#

region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

cohortsstr or dict

If a string, gives the name of a predefined cohort set, e.g., one of {“admin1_month”, “admin1_year”, “admin2_month”, “admin2_year”}. If a dict, should map cohort labels to sample queries, e.g., {"bf_2012_col": "country == 'Burkina Faso' and year == 2012 and taxon == 'coluzzii'"}.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

sample_query_optionsdict, optional

A dictionary of arguments that will be passed through to pandas query() or eval(), e.g. parser, engine, local_dict, global_dict, resolvers.

min_cohort_sizeint

Minimum cohort size, below which cohorts are dropped.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

drop_invariantbool, optional

If True, drop any rows where there is no evidence of variation.

max_coverage_variancefloat, optional

Remove samples if coverage variance exceeds this value.

Returns#

dfpandas.DataFrame

A dataframe of CNV amplification (amp) and deletion (del) frequencies in the specified cohorts, one row per gene and CNV type (amp/del).