malariagen_data.ag3.Ag3.biallelic_diplotypes#

Ag3.biallelic_diplotypes(region: str | Region | Mapping | List[str | Region | Mapping] | Tuple[str | Region | Mapping, ...], sample_sets: Sequence[str] | str | None = None, sample_query: str | None = None, sample_indices: List[int] | None = None, site_mask: str | None = None, site_class: str | None = None, cohort_size: int | None = None, min_cohort_size: int | None = None, max_cohort_size: int | None = None, random_seed: int = 42, min_minor_ac: int | None = None, max_missing_an: int | None = None, n_snps: int | None = None, thin_offset: int = 0, inline_array: bool = True, chunks: str | Tuple[int, ...] | Callable[[Tuple[int, ...]], Tuple[int, ...]] = 'native') Tuple[ndarray, ndarray]#

Load biallelic SNP genotypes.

Parameters#

regionstr or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping

Region of the reference genome. Can be a contig name, region string (formatted like “{contig}:{start}-{end}”), or identifier of a genome feature such as a gene or transcript. Can also be a sequence (e.g., list) of regions.

sample_setssequence of str or str or None, optional

List of sample sets and/or releases. Can also be a single sample set or release.

sample_querystr or None, optional

A pandas query string to be evaluated against the sample metadata, to select samples to be included in the returned data.

sample_indiceslist of int or None, optional

Advanced usage parameter. A list of indices of samples to select, corresponding to the order in which the samples are found within the sample metadata. Either provide this parameter or sample_query, not both.

site_maskstr or None, optional

Which site filters mask to apply. See the site_mask_ids property for available values.

site_classstr or None, optional

Select sites belonging to one of the following classes: CDS_DEG_4, (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites), INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5’ splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3’ splice site), UTR_5PRIME (5’ untranslated region), UTR_3PRIME (3’ untranslated region), INTERGENIC (intergenic, more than 10 kbp from a gene).

cohort_sizeint or None, optional

Randomly down-sample to this value if the number of samples in the cohort is greater. Raise an error if the number of samples is less than this value.

min_cohort_sizeint or None, optional

Minimum cohort size. Raise an error if the number of samples is less than this value.

max_cohort_sizeint or None, optional

Randomly down-sample to this value if the number of samples in the cohort is greater.

random_seedint, optional, default: 42

Random seed used for reproducible down-sampling.

min_minor_acint or None, optional

The minimum minor allele count. SNPs with a minor allele count below this value will be excluded.

max_missing_anint or None, optional

The maximum number of missing allele calls to accept. SNPs with more than this value will be excluded. Set to 0 to require no missing calls.

n_snpsint or None, optional

The desired number of SNPs to use when running the analysis. SNPs will be evenly thinned to approximately this number.

thin_offsetint, optional, default: 0

Starting index for SNP thinning. Change this to repeat the analysis using a different set of SNPs.

inline_arraybool, optional, default: True

Passed through to dask from_array().

chunksstr or tuple of int or Callable[[typing.Tuple[int, …]], tuple of int], optional, default: ‘native’

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’, or a tuple of integers.

Returns#

gnndarray

An array of shape (variants, samples) where each value counts the number of alternate alleles per genotype call.

samplesndarray

Sample identifiers.