Ag3 API

This page provides documentation for functions in the malariagen_data Python package for accessing Anopheles gambiae data.

Ag3()

malariagen_data.Ag3(**kwargs)

Provides access to data from Ag 3 releases.

Parameters
urlstr

Base path to data. Give “gs://vo_agam_release/” to use Google Cloud Storage, or a local path on your file system if data have been downloaded.

**kwargs

Passed through to fsspec when setting up file system access.

Examples

Access data from Google Cloud Storage (default):

>>> import malariagen_data
>>> ag3 = malariagen_data.Ag3()

Access data downloaded to a local file system:

>>> ag3 = malariagen_data.Ag3("/local/path/to/vo_agam_release/")

sample_sets()

Ag3.sample_sets(release=None)

Access a dataframe of sample sets.

Parameters
releasestr, optional

Release identifier. Give “3.0” to access the Ag1000G phase 3 data release.

Returns
dfpandas.DataFrame

A dataframe of sample sets, one row per sample set.

sample_metadata()

Ag3.sample_metadata(sample_sets=None, species_analysis='aim_20200422', cohorts_analysis='20211101')

Access sample metadata for one or more sample sets.

Parameters
sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

species_analysis{“aim_20200422”, “pca_20200422”}, optional

Include species calls in metadata.

cohorts_analysisstr, optional

Cohort analysis identifier (date of analysis), optional, default is the latest version. Includes sample cohort calls in metadata.

Returns
dfpandas.DataFrame

A dataframe of sample metadata, one row per sample.

sample_cohorts()

Ag3.sample_cohorts(sample_sets=None, cohorts_analysis='20211101')

Access cohorts metadata for one or more sample sets.

Parameters
sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

cohorts_analysisstr

Cohort analysis identifier (date of analysis), default is the latest version.

Returns
dfpandas.DataFrame

A dataframe of cohort metadata, one row per sample.

cross_metadata()

Ag3.cross_metadata()

Load a dataframe containing metadata about samples in colony crosses, including which samples are parents or progeny in which crosses.

Returns
dfpandas.DataFrame

A dataframe of sample metadata for colony crosses.

species_calls()

Ag3.species_calls(sample_sets=None, analysis='aim_20200422')

Access species calls for one or more sample sets.

Parameters
sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”] or a release identifier (e.g., “3.0”) or a list of release identifiers.

analysis{“aim_20200422”, “pca_20200422”}

Species calling analysis.

Returns
dfpandas.DataFrame

A dataframe of species calls for one or more sample sets, one row per sample.

genome_sequence()

Ag3.genome_sequence(region, inline_array=True, chunks='native')

Access the reference genome sequence.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
ddask.array.Array

An array of nucleotides giving the reference genome sequence for the given contig.

geneset()

Ag3.geneset(region=None, attributes=('ID', 'Parent', 'Name', 'description'))

Access genome feature annotations (AgamP4.12).

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

attributeslist of str, optional

Attribute keys to unpack into columns. Provide “*” to unpack all attributes.

Returns
dfpandas.DataFrame

A dataframe of genome annotations, one row per feature.

snp_calls()

Ag3.snp_calls(region, sample_sets=None, site_mask=None, site_filters_analysis='dt_20200416', inline_array=True, chunks='native')

Access SNP sites, site filters and genotype calls.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}

Site filters mask to apply.

site_filters_analysisstr

Site filters analysis version.

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
dsxarray.Dataset

A dataset containing SNP sites, site filters and genotype calls.

snp_sites()

Ag3.snp_sites(region, field, site_mask=None, site_filters_analysis='dt_20200416', inline_array=True, chunks='native')

Access SNP site data (positions and alleles).

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

field{“POS”, “REF”, “ALT”}

Array to access.

site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}

Site filters mask to apply.

site_filters_analysisstr

Site filters analysis version.

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
ddask.array.Array

An array of either SNP positions, reference alleles or alternate alleles.

snp_genotypes()

Ag3.snp_genotypes(region, sample_sets=None, field='GT', site_mask=None, site_filters_analysis='dt_20200416', inline_array=True, chunks='native')

Access SNP genotypes and associated data.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

field{“GT”, “GQ”, “AD”, “MQ”}

Array to access.

site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}

Site filters mask to apply.

site_filters_analysisstr, optional

Site filters analysis version.

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
ddask.array.Array

An array of either genotypes (GT), genotype quality (GQ), allele depths (AD) or mapping quality (MQ) values.

site_filters()

Ag3.site_filters(region, mask, field='filter_pass', analysis='dt_20200416', inline_array=True, chunks='native')

Access SNP site filters.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

mask{“gamb_colu_arab”, “gamb_colu”, “arab”}

Mask to use.

fieldstr, optional

Array to access.

analysisstr, optional

Site filters analysis version.

inline_arraybool, optional

Passed through to dask.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
ddask.array.Array

An array of boolean values identifying sites that pass the filters.

is_accessible()

Ag3.is_accessible(region, site_mask, site_filters_analysis='dt_20200416')

Compute genome accessibility array.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}

Site filters mask to apply.

site_filters_analysisstr, optional

Site filters analysis version.

Returns
anumpy.ndarray

An array of boolean values identifying accessible genome sites.

snp_effects()

Ag3.snp_effects(transcript, site_mask=None, site_filters_analysis='dt_20200416')

Compute variant effects for a gene transcript.

Parameters
transcriptstr

Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RA”.

site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}, optional

Site filters mask to apply.

site_filters_analysisstr, optional

Site filters analysis version.

Returns
dfpandas.DataFrame

A dataframe of all possible SNP variants and their effects, one row per variant.

site_annotations()

Ag3.site_annotations(region, field, site_mask=None, site_filters_analysis='dt_20200416', inline_array=True, chunks='native')

Load site annotations.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

fieldstr

One of “codon_degeneracy”, “codon_nonsyn”, “codon_position”, “seq_cls”, “seq_flen”, “seq_relpos_start”, “seq_relpos_stop”.

site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}

Site filters mask to apply.

site_filters_analysisstr

Site filters analysis version.

inline_arraybool, optional

Passed through to dask.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
ddask.Array

An array of site annotations.

cnv_hmm()

Ag3.cnv_hmm(region, sample_sets=None, inline_array=True, chunks='native')

Access CNV HMM data from CNV calling.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
dsxarray.Dataset

A dataset of CNV HMM calls and associated data.

cnv_coverage_calls()

Ag3.cnv_coverage_calls(region, sample_set, analysis, inline_array=True, chunks='native')

Access CNV HMM data from genome-wide CNV discovery and filtering.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

sample_setstr

Sample set identifier.

analysis{‘gamb_colu’, ‘arab’, ‘crosses’}

Name of CNV analysis.

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
dsxarray.Dataset

A dataset of CNV alleles and genotypes.

cnv_discordant_read_calls()

Ag3.cnv_discordant_read_calls(contig, sample_sets=None, inline_array=True, chunks='native')

Access CNV discordant read calls data.

Parameters
contigstr or list of str

Chromosome arm, e.g., “3R”. Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“2R”, “3R”].

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
dsxarray.Dataset

A dataset of CNV alleles and genotypes.

gene_cnv()

Ag3.gene_cnv(region, sample_sets=None)

Compute modal copy number by gene, from HMM data.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

sample_setsstr or list of str

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

Returns
dsxarray.Dataset

A dataset of modal copy number per gene and associated data.

haplotypes()

Ag3.haplotypes(region, analysis, sample_sets=None, inline_array=True, chunks='native')

Access haplotype data.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}

Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

inline_arraybool, optional

Passed through to dask.array.from_array().

chunksstr, optional

If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.

Returns
dsxarray.Dataset

A dataset of haplotypes and associated data.

snp_allele_frequencies()

Ag3.snp_allele_frequencies(transcript, cohorts, sample_query=None, cohorts_analysis='20211101', min_cohort_size=10, site_mask=None, site_filters_analysis='dt_20200416', species_analysis='aim_20200422', sample_sets=None, drop_invariant=True, effects=True)

Compute per variant allele frequencies for a gene transcript.

Parameters
transcriptstr

Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RD”.

cohortsstr or dict

If a string, gives the name of a predefined cohort set, e.g., one of {“admin1_month”, “admin1_year”, “admin2_month”, “admin2_year”}. If a dict, should map cohort labels to sample queries, e.g., {“bf_2012_col”: “country == ‘Burkina Faso’ and year == 2012 and taxon == ‘coluzzii’”}.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

cohorts_analysisstr

Cohort analysis version, default is the latest version.

min_cohort_sizeint

Minimum cohort size. Any cohorts below this size are omitted.

site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}

Site filters mask to apply.

site_filters_analysisstr, optional

Site filters analysis version.

species_analysis{“aim_20200422”, “pca_20200422”}, optional

Species calls analysis version.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

drop_invariantbool, optional

If True, variants with no alternate allele calls in any cohorts are dropped from the result.

effectsbool, optional

If True, add SNP effect columns.

Returns
dfpandas.DataFrame

A dataframe of SNP frequencies, one row per variant.

Notes

Cohorts with fewer samples than min_cohort_size will be excluded from output.

aa_allele_frequencies()

Ag3.aa_allele_frequencies(transcript, cohorts, sample_query=None, cohorts_analysis='20211101', min_cohort_size=10, site_mask=None, site_filters_analysis='dt_20200416', species_analysis='aim_20200422', sample_sets=None, drop_invariant=True)

Compute per amino acid allele frequencies for a gene transcript.

Parameters
transcriptstr

Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RA”.

cohortsstr or dict

If a string, gives the name of a predefined cohort set, e.g., one of {“admin1_month”, “admin1_year”, “admin2_month”, “admin2_year”}. If a dict, should map cohort labels to sample queries, e.g., {“bf_2012_col”: “country == ‘Burkina Faso’ and year == 2012 and taxon == ‘coluzzii’”}.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

cohorts_analysisstr

Cohort analysis identifier (date of analysis), default is the latest version.

min_cohort_sizeint

Minimum cohort size, below which allele frequencies are not calculated for cohorts. Please note, NaNs will be returned for any cohorts with fewer samples than min_cohort_size, these can be removed from the output dataframe using pandas df.dropna(axis=’columns’).

site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}

Site filters mask to apply.

site_filters_analysisstr, optional

Site filters analysis version.

species_analysis{“aim_20200422”, “pca_20200422”}, optional

Include species calls in metadata.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

drop_invariantbool, optional

If True, variants with no alternate allele calls in any cohorts are dropped from the result.

Returns
dfpandas.DataFrame

A dataframe of amino acid allele frequencies, one row per replacement.

Notes

Cohorts with fewer samples than min_cohort_size will be excluded from output.

gene_cnv_frequencies()

Ag3.gene_cnv_frequencies(region, cohorts, sample_query=None, cohorts_analysis='20211101', min_cohort_size=10, species_analysis='aim_20200422', sample_sets=None, drop_invariant=True, max_coverage_variance=0.2)

Compute modal copy number by gene, then compute the frequency of amplifications and deletions in one or more cohorts, from HMM data.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

cohortsstr or dict

If a string, gives the name of a predefined cohort set, e.g., one of {“admin1_month”, “admin1_year”, “admin2_month”, “admin2_year”}. If a dict, should map cohort labels to sample queries, e.g., {“bf_2012_col”: “country == ‘Burkina Faso’ and year == 2012 and taxon == ‘coluzzii’”}.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

cohorts_analysisstr

Cohort analysis identifier (date of analysis), default is the latest version.

min_cohort_sizeint

Minimum cohort size, below which cohorts are dropped.

species_analysis{“aim_20200422”, “pca_20200422”}, optional

Include species calls in metadata.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

drop_invariantbool, optional

If True, drop any rows where there is no evidence of variation.

max_coverage_variancefloat, optional

Remove samples if coverage variance exceeds this value.

Returns
dfpandas.DataFrame

A dataframe of CNV amplification (amp) and deletion (del) frequencies in the specified cohorts, one row per gene and CNV type (amp/del).

plot_frequencies_heatmap()

static Ag3.plot_frequencies_heatmap(df, index='label', max_len=100, x_label='Cohorts', y_label='Variants', colorbar=True, col_width=40, width=None, row_height=20, height=None, text_auto='.0%', aspect='auto', color_continuous_scale='Reds', title=True, **kwargs)

Plot a heatmap from a pandas DataFrame of frequencies, e.g., output from Ag3.snp_allele_frequencies() or Ag3.gene_cnv_frequencies(). It’s recommended to filter the input DataFrame to just rows of interest, i.e., fewer rows than max_len.

Parameters
dfpandas DataFrame

A DataFrame of frequencies, e.g., output from snp_allele_frequencies() or gene_cnv_frequencies().

indexstr or list of str

One or more column headers that are present in the input dataframe. This becomes the heatmap y-axis row labels. The column/s must produce a unique index.

max_lenint, optional

Displaying large styled dataframes may cause ipython notebooks to crash.

x_labelstr, optional

This is the x-axis label that will be displayed on the heatmap.

y_labelstr, optional

This is the y-axis label that will be displayed on the heatmap.

colorbarbool, optional

If False, colorbar is not output.

col_widthint, optional

Plot width per column in pixels (px).

widthint, optional

Plot width in pixels (px), overrides col_width.

row_heightint, optional

Plot height per row in pixels (px).

heightint, optional

Plot height in pixels (px), overrides row_height.

text_autostr, optional

Formatting for frequency values.

aspectstr, optional

Control the aspect ratio of the heatmap.

color_continuous_scalestr, optional

Color scale to use.

titlebool or str, optional

If True, attempt to use metadata from input dataset as a plot title. Otherwise, use supplied value as a title.

**kwargs

Other parameters are passed through to px.imshow().

plot_frequencies_time_series()

static Ag3.plot_frequencies_time_series(ds, height=None, width=None, title=True, **kwargs)

Create a time series plot of variant frequencies using plotly.

Parameters
dsxarray.Dataset

A dataset of variant frequencies, such as returned by Ag3.snp_allele_frequencies_advanced(), Ag3.aa_allele_frequencies_advanced() or Ag3.gene_cnv_frequencies_advanced().

heightint, optional

Height of plot in pixels.

widthint, optional

Width of plot in pixels

titlebool or str, optional

If True, attempt to use metadata from input dataset as a plot title. Otherwise, use supplied value as a title.

**kwargs

Passed through to px.line().

Returns
figplotly.graph_objects.Figure

A plotly figure containing line graphs. The resulting figure will have one panel per cohort, grouped into columns by taxon, and grouped into rows by area. Markers and lines show frequencies of variants.

plot_frequencies_interactive_map()

static Ag3.plot_frequencies_interactive_map(ds, center=(- 2, 20), zoom=3, title=True, epilogue='\n            Variant frequencies are shown as coloured markers. Opacity of color\n            denotes frequency. Click on a marker for more information.\n        ')

Create an interactive map with markers showing variant frequencies for cohorts grouped by area (space), period (time) and taxon.

Parameters
dsxarray.Dataset

A dataset of variant frequencies, such as returned by Ag3.snp_allele_frequencies_advanced(), Ag3.aa_allele_frequencies_advanced() or Ag3.gene_cnv_frequencies_advanced().

centertuple of int, optional

Location to center the map.

zoomint, optional

Initial zoom level.

titlebool or str, optional

If True, attempt to use metadata from input dataset as a plot title. Otherwise, use supplied value as a title.

epiloguestr, optional

Additional text to display below the map.

Returns
outipywidgets.Widget

An interactive map with widgets for selecting which variant, taxon and time period to display.

plot_frequencies_map_markers()

static Ag3.plot_frequencies_map_markers(m, ds, variant, taxon, period, clear=True)

Plot markers on a map showing variant frequencies for cohorts grouped by area (space), period (time) and taxon.

Parameters
mipyleaflet.Map

The map on which to add the markers.

dsxarray.Dataset

A dataset of variant frequencies, such as returned by Ag3.snp_allele_frequencies_advanced(), Ag3.aa_allele_frequencies_advanced() or Ag3.gene_cnv_frequencies_advanced().

variantint or str

Index or label of variant to plot.

taxonstr

Taxon to show markers for.

periodpd.Period

Time period to show markers for.

clearbool, optional

If True, clear all layers (except the base layer) from the map before adding new markers.

snp_allele_frequencies_advanced()

Ag3.snp_allele_frequencies_advanced(transcript, area_by, period_by, sample_sets=None, sample_query=None, min_cohort_size=10, drop_invariant=True, variant_query=None, site_mask=None, nobs_mode='called', ci_method='wilson', cohorts_analysis='20211101', species_analysis='aim_20200422', site_filters_analysis='dt_20200416')

Group samples by taxon, area (space) and period (time), then compute SNP allele counts and frequencies.

Parameters
transcriptstr

Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RD”.

area_bystr

Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.

period_by{“year”, “quarter”, “month”}

Length of time to group samples temporally.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

min_cohort_sizeint, optional

Minimum cohort size. Any cohorts below this size are omitted.

drop_invariantbool, optional

If True, variants with no alternate allele calls in any cohorts are dropped from the result.

variant_querystr, optional
site_maskstr, optional

Site filters mask to apply.

nobs_mode{“called”, “fixed”}

Method for calculating the denominator when computing frequencies. If “called” then use the number of called alleles, i.e., number of samples with non-missing genotype calls multiplied by 2. If “fixed” then use the number of samples multiplied by 2.

ci_method{“normal”, “agresti_coull”, “beta”, “wilson”, “binom_test”}, optional

Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.

cohorts_analysisstr, optional

Cohort analysis version, default is the latest version.

species_analysisstr, optional

Species calls analysis version.

site_filters_analysisstr, optional

Site filters analysis version.

Returns
dsxarray.Dataset

The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.

aa_allele_frequencies_advanced()

Ag3.aa_allele_frequencies_advanced(transcript, area_by, period_by, sample_sets=None, sample_query=None, min_cohort_size=10, variant_query=None, site_mask=None, nobs_mode='called', ci_method='wilson', cohorts_analysis='20211101', species_analysis='aim_20200422', site_filters_analysis='dt_20200416')

Group samples by taxon, area (space) and period (time), then compute amino acid change allele counts and frequencies.

Parameters
transcriptstr

Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RD”.

area_bystr

Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.

period_by{“year”, “quarter”, “month”}

Length of time to group samples temporally.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

min_cohort_sizeint, optional

Minimum cohort size. Any cohorts below this size are omitted.

variant_querystr, optional
site_maskstr, optional

Site filters mask to apply.

nobs_mode{“called”, “fixed”}

Method for calculating the denominator when computing frequencies. If “called” then use the number of called alleles, i.e., number of samples with non-missing genotype calls multiplied by 2. If “fixed” then use the number of samples multiplied by 2.

ci_method{“normal”, “agresti_coull”, “beta”, “wilson”, “binom_test”}, optional

Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.

cohorts_analysisstr, optional

Cohort analysis version, default is the latest version.

species_analysisstr, optional

Species calls analysis version.

site_filters_analysisstr, optional

Site filters analysis version.

Returns
dsxarray.Dataset

The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.

gene_cnv_frequencies_advanced()

Ag3.gene_cnv_frequencies_advanced(region, area_by, period_by, sample_sets=None, sample_query=None, min_cohort_size=10, variant_query=None, drop_invariant=True, max_coverage_variance=0.2, ci_method='wilson', cohorts_analysis='20211101', species_analysis='aim_20200422')

Group samples by taxon, area (space) and period (time), then compute gene CNV counts and frequencies.

Parameters
region: str or list of str or Region or list of Region

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].

area_bystr

Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.

period_by{“year”, “quarter”, “month”}

Length of time to group samples temporally.

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

min_cohort_sizeint, optional

Minimum cohort size. Any cohorts below this size are omitted.

variant_querystr, optional
drop_invariantbool, optional

If True, drop any rows where there is no evidence of variation.

max_coverage_variancefloat, optional

Remove samples if coverage variance exceeds this value.

ci_method{“normal”, “agresti_coull”, “beta”, “wilson”, “binom_test”}, optional

Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.

cohorts_analysisstr, optional

Cohort analysis version, default is the latest version.

species_analysisstr, optional

Species calls analysis version.

Returns
dsxarray.Dataset

The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.

plot_genes()

Ag3.plot_genes(region, width=800, height=120, show=True, toolbar_location='above', x_range=None, title='Genes')

Plot a genes track, using bokeh.

Parameters
regionstr

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).

widthint, optional

Plot width in pixels (px).

heightint, optional

Plot height in pixels (px).

showbool, optional

If true, show the plot.

toolbar_locationstr, optional

Location of bokeh toolbar.

x_rangebokeh.models.Range1d, optional

X axis range (for linking to other tracks).

titlestr, optional

Plot title.

Returns
figFigure

Bokeh figure.

plot_transcript()

Ag3.plot_transcript(transcript, width=700, height=120, show=True, x_range=None, toolbar_location='above', title=True)

Plot a transcript, using bokeh.

Parameters
transcriptstr

Transcript identifier, e.g., “AGAP004707-RD”.

widthint, optional

Plot width in pixels (px).

heightint, optional

Plot height in pixels (px).

showbool, optional

If true, show the plot.

toolbar_locationstr, optional

Location of bokeh toolbar.

x_rangebokeh.models.Range1d, optional

X axis range (for linking to other tracks).

titlestr, optional

Plot title.

Returns
figFigure

Bokeh figure.

plot_cnv_hmm_coverage()

Ag3.plot_cnv_hmm_coverage(sample, sample_set, region, y_max='auto', width=800, track_height=170, genes_height=100, circle_kwargs=None, line_kwargs=None, show=True)

Plot CNV HMM data for a single sample, together with a genes track, using bokeh.

Parameters
samplestr or int

Sample identifier or index within sample set.

sample_setstr

Sample set identifier.

regionstr

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).

y_maxstr or int, optional

Maximum Y axis value.

widthint, optional

Plot width in pixels (px).

track_heightint, optional

Height of CNV HMM track in pixels (px).

genes_heightint, optional

Height of genes track in pixels (px).

circle_kwargsdict, optional

Passed through to bokeh circle() function.

line_kwargsdict, optional

Passed through to bokeh line() function.

showbool, optional

If true, show the plot.

Returns
figFigure

Bokeh figure.

plot_cnv_hmm_heatmap()

Ag3.plot_cnv_hmm_heatmap(region, sample_sets=None, sample_query=None, width=800, row_height=3, track_height=None, genes_height=100, show=True, species_analysis='aim_20200422', cohorts_analysis='20211101')

Plot CNV HMM data for multiple samples as a heatmap, with a genes track, using bokeh.

Parameters
regionstr

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).

sample_setsstr or list of str, optional

Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.

sample_querystr, optional

A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.

widthint, optional

Plot width in pixels (px).

row_heightint, optional

Plot height per row (sample) in pixels (px).

track_heightint, optional

Absolute plot height for HMM track in pixels (px), overrides row_height.

genes_heightint, optional

Height of genes track in pixels (px).

showbool, optional

If true, show the plot.

species_analysis{“aim_20200422”, “pca_20200422”}, optional

Include species calls in metadata.

cohorts_analysisstr

Cohort analysis identifier (date of analysis), default is the latest version.

Returns
figFigure

Bokeh figure.

resolve_region()

Ag3.resolve_region(region)

Convert a genome region into a standard data structure.

Parameters
region: str

Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).

Returns
outRegion

A named tuple with attributes contig, start and end.