malariagen_data.af1.Af1.roh_hmm#
- Af1.roh_hmm(sample: str | int, region: str | Region | Mapping, window_size: int = 20000, site_mask: str | None = 'default', sample_set: str | None = None, phet_roh: float = 0.001, phet_nonroh: Tuple[float, ...] = (0.003, 0.01), transition: float = 0.001, chunks: int | str | Tuple[int | str, ...] | Callable[[Tuple[int, ...]], int | str | Tuple[int | str, ...]] = 'native', inline_array: bool = True) DataFrame #
Infer runs of homozygosity for a single sample over a genome region.
Parameters#
- samplestr or int
Sample identifier or index within sample set.
- regionstr or Region or Mapping
Region of the reference genome. Can be a contig name, region string (formatted like “{contig}:{start}-{end}”), or identifier of a genome feature such as a gene or transcript.
- window_sizeint, optional, default: 20000
Number of sites per window.
- site_maskstr or None, optional, default: ‘default’
Which site filters mask to apply. See the site_mask_ids property for available values.
- sample_setstr or None, optional
Sample set identifier.
- phet_rohfloat, optional, default: 0.001
Probability of observing a heterozygote in a ROH.
- phet_nonrohtuple of float, optional, default: (0.003, 0.01)
One or more probabilities of observing a heterozygote outside a ROH.
- transitionfloat, optional, default: 0.001
Probability of moving between states. A larger window size may call for a larger transitional probability.
- chunksint or str or tuple of int or str or Callable[[typing.Tuple[int, …]], int or str or tuple of int or str], optional, default: ‘native’
Define how input data being read from zarr should be divided into chunks for a dask computation. If ‘native’, use underlying zarr chunks. If a string specifying a target memory size, e.g., ‘300 MiB’, resize chunks in arrays with more than one dimension to match this size. If ‘auto’, let dask decide chunk size. If ‘ndauto’, let dask decide chunk size but only for arrays with more than one dimension. If ‘ndauto0’, as ‘ndauto’ but only vary the first chunk dimension. If ‘ndauto1’, as ‘ndauto’ but only vary the second chunk dimension. If ‘ndauto01’, as ‘ndauto’ but only vary the first and second chunk dimensions. Also, can be a tuple of integers, or a callable which accepts the native chunks as a single argument and returns a valid dask chunks value.
- inline_arraybool, optional, default: True
Passed through to dask from_array().
Returns#
- DataFrame
A DataFrame where each row provides data about a single run of homozygosity. The columns are: sample_id containing the identifier of the sample, contig containing the contig, roh_start containing the start of the run of homozygosity, roh_stop containing the end of the run of homozygosity, roh_length containing the length of the run of homozygosity, roh_is_marginal containing whether the run of homozygosity is marginal.