malariagen_data.af1.Af1.cohort_diversity_stats#
- Af1.cohort_diversity_stats(cohort: str | Tuple[str, str], cohort_size: int, region: str | Region | Mapping | List[str | Region | Mapping] | Tuple[str | Region | Mapping, ...], min_cohort_size: int | None = None, max_cohort_size: int | None = None, site_mask: str | None = 'default', site_class: str | None = None, sample_sets: Sequence[str] | str | None = None, random_seed: int = 42, n_jack: int = 200, confidence_level: float = 0.95) Series #
Compute genetic diversity summary statistics for a cohort of individuals.
Parameters#
- cohortstr or tuple[str, str]
Either a string giving one of the predefined cohort labels, or a pair of strings giving a custom cohort label and a sample query.
- cohort_sizeint
Randomly down-sample to this value if the number of samples in the cohort is greater. Raise an error if the number of samples is less than this value.
- regionstr or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
Region of the reference genome. Can be a contig name, region string (formatted like “{contig}:{start}-{end}”), or identifier of a genome feature such as a gene or transcript. Can also be a sequence (e.g., list) of regions.
- min_cohort_sizeint or None, optional
Minimum cohort size. Raise an error if the number of samples is less than this value.
- max_cohort_sizeint or None, optional
Randomly down-sample to this value if the number of samples in the cohort is greater.
- site_maskstr or None, optional, default: ‘default’
Which site filters mask to apply. See the site_mask_ids property for available values.
- site_classstr or None, optional
Select sites belonging to one of the following classes: CDS_DEG_4, (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites), INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5’ splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3’ splice site), UTR_5PRIME (5’ untranslated region), UTR_3PRIME (3’ untranslated region), INTERGENIC (intergenic, more than 10 kbp from a gene).
- sample_setssequence of str or str or None, optional
List of sample sets and/or releases. Can also be a single sample set or release.
- random_seedint, optional, default: 42
Random seed used for reproducible down-sampling.
- n_jackint, optional, default: 200
Number of blocks to divide the data into for the block jackknife estimation of confidence intervals. N.B., larger is not necessarily better.
- confidence_levelfloat, optional, default: 0.95
Confidence level to use for confidence interval calculation. E.g., 0.95 means 95% confidence interval.
Returns#
- Series
A pandas series with summary statistics and their confidence intervals.