Ag3#

This page provides a curated list of functions and properties available in the malariagen_data API for data on mosquitoes from the Anopheles gambiae complex.

To set up the API, use the following code:

import malariagen_data
ag3 = malariagen_data.Ag3()

All the functions below can then be accessed as methods on the ag3 object. E.g., to call the sample_metadata() function, do:

df_samples = ag3.sample_metadata()

For more information about the data and terns of use, please see the MalariaGEN Anopheles gambiae genomic surveillance project home page.

Basic data access#

`releases`	Currently available data releases.
`sample_sets`([release])	Access a dataframe of sample sets.
`lookup_release`(sample_set)	Find which release a sample set was included in.
`lookup_study`(sample_set)	Find which study a sample set belongs to.

Reference genome data access#

`contigs`	Contigs in the reference genome.
`genome_sequence`(region[, inline_array, chunks])	Access the reference genome sequence.
`genome_features`([region, attributes])	Access genome feature annotations.
`plot_transcript`(transcript[, sizing_mode, ...])	Plot a transcript, using bokeh.
`plot_genes`(region[, sizing_mode, width, ...])	Plot a genes track, using bokeh.

Sample metadata access#

`sample_metadata`([sample_sets, sample_query, ...])	Access sample metadata for one or more sample sets.
`add_extra_metadata`(data[, on])	Add extra sample metadata, e.g., including additional columns which you would like to use to query and group samples.
`clear_extra_metadata`()	Clear any extra metadata previously added.
`count_samples`([sample_sets, sample_query, ...])	Create a pivot table showing numbers of samples available by space, time and taxon.
`lookup_sample`(sample[, sample_set])	Get the metadata for a specific sample and sample set.
`plot_samples_bar`(x[, color, sort, ...])	Plot a bar chart showing the number of samples available, grouped by some variable such as country or year.
`plot_samples_interactive_map`([sample_sets, ...])	Plot an interactive map showing sampling locations using ipyleaflet.
`wgs_data_catalog`(sample_set)	Load a data catalog providing URLs for downloading BAM, VCF and Zarr files for samples in a given sample set.
`cohorts`(cohort_set)	Read data for a specific cohort set, including cohort size, country code, taxon, administrative units name, ISO code, geoBoundaries shape ID and representative latitude and longitude points.

SNP data access#

`site_mask_ids`	Identifiers for the different site masks that are available.
`snp_calls`(region[, sample_sets, ...])	Access SNP sites, site filters and genotype calls.
`snp_allele_counts`(region[, sample_sets, ...])	Compute SNP allele counts.
`plot_snps`(region[, sample_sets, ...])	Plot SNPs in a given genome region.
`site_annotations`(region[, site_mask, ...])	Load site annotations.
`is_accessible`(region[, site_mask, ...])	Compute genome accessibility array.
`biallelic_snp_calls`(region[, sample_sets, ...])	Access SNP calls at sites which are biallelic within the selected samples.
`biallelic_diplotypes`(region[, sample_sets, ...])	Load biallelic SNP genotypes.

Haplotype data access#

`phasing_analysis_ids`	Identifiers for the different phasing analyses that are available.
`haplotypes`(region[, analysis, sample_sets, ...])	Access haplotype data.
`haplotype_sites`(region, field[, analysis, ...])	Access haplotype site data (positions or alleles).

AIM data access#

`aim_ids`
`aim_variants`(aims)	Access ancestry informative marker variants.
`aim_calls`(aims[, sample_sets, sample_query])	Access ancestry informative marker SNP sites, alleles and genotype calls.
`plot_aim_heatmap`(aims[, sample_sets, ...])	Plot a heatmap of ancestry-informative marker (AIM) genotypes.

CNV data access#

`coverage_calls_analysis_ids`	Identifiers for the different coverage calls analyses that are available.
`cnv_hmm`(region[, sample_sets, sample_query, ...])	Access CNV HMM data from CNV calling.
`cnv_coverage_calls`(region, sample_set, analysis)	Access CNV HMM data from genome-wide CNV discovery and filtering.
`cnv_discordant_read_calls`(contig[, ...])	Access CNV discordant read calls data.
`plot_cnv_hmm_coverage`(sample, region[, ...])	Plot CNV HMM data for a single sample, together with a genes track, using bokeh.
`plot_cnv_hmm_heatmap`(region[, sample_sets, ...])	Plot CNV HMM data for multiple samples as a heatmap, with a genes track, using bokeh.
`gene_cnv`(region[, sample_sets, ...])	Compute modal copy number by gene, from HMM data.

Integrative genomics viewer (IGV)#

`igv`(region[, tracks, init])	Create an IGV browser and inject into the current notebook.
`view_alignments`(region, sample[, ...])	Launch IGV and view sequence read alignments and SNP genotypes from the given sample.

SNP and CNV frequency analysis#

`snp_allele_frequencies`(transcript, cohorts)	Compute SNP allele frequencies for a gene transcript.
`snp_allele_frequencies_advanced`(transcript, ...)	Group samples by taxon, area (space) and period (time), then compute SNP allele frequencies.
`aa_allele_frequencies`(transcript, cohorts[, ...])	Compute amino acid substitution frequencies for a gene transcript.
`aa_allele_frequencies_advanced`(transcript, ...)	Group samples by taxon, area (space) and period (time), then compute amino acid change allele frequencies.
`gene_cnv_frequencies`(region, cohorts[, ...])	Compute modal copy number by gene, then compute the frequency of amplifications and deletions in one or more cohorts, from HMM data.
`gene_cnv_frequencies_advanced`(region, ...[, ...])	Group samples by taxon, area (space) and period (time), then compute gene CNV counts and frequencies.
`plot_frequencies_heatmap`(df[, index, ...])	Plot a heatmap from a pandas DataFrame of frequencies, e.g., output from snp_allele_frequencies() or gene_cnv_frequencies().
`plot_frequencies_time_series`(ds[, height, ...])	Create a time series plot of variant frequencies using plotly.
`plot_frequencies_interactive_map`(ds[, ...])	Create an interactive map with markers showing variant frequencies or cohorts grouped by area (space), period (time) and taxon.

Principal components analysis (PCA)#

`pca`(region, n_snps[, n_components, ...])	Run a principal components analysis (PCA) using biallelic SNPs from the selected genome region and samples.
`plot_pca_variance`(evr[, width, height, ...])	Plot explained variance ratios from a principal components analysis (PCA) using a plotly bar plot.
`plot_pca_coords`(data[, x, y, color, symbol, ...])	Plot sample coordinates from a principal components analysis (PCA) as a plotly scatter plot.
`plot_pca_coords_3d`(data[, x, y, z, color, ...])	Plot sample coordinates from a principal components analysis (PCA) as a plotly 3D scatter plot.

Genetic distance and neighbour-joining trees (NJT)#

`plot_njt`(region, n_snps[, color, symbol, ...])	Plot an unrooted neighbour-joining tree, computed from pairwise distances between samples using biallelic SNP genotypes.
`biallelic_diplotype_pairwise_distances`(...)	Compute pairwise distances between samples using biallelic SNP genotypes.

Heterozygosity analysis#

`plot_heterozygosity`(sample, region[, ...])	Plot windowed heterozygosity for a single sample over a genome region.
`roh_hmm`(sample, region[, window_size, ...])	Infer runs of homozygosity for a single sample over a genome region.
`plot_roh`(sample, region[, window_size, ...])	Plot windowed heterozygosity and inferred runs of homozygosity for a single sample over a genome region.

Diversity analysis#

`cohort_diversity_stats`(cohort, cohort_size, ...)	Compute genetic diversity summary statistics for a cohort of individuals.
`diversity_stats`(cohorts, cohort_size, region)	Compute genetic diversity summary statistics for multiple cohorts.
`plot_diversity_stats`(df_stats[, color, ...])	Plot diversity summary statistics for multiple cohorts.

Genome-wide selection scans#

`h12_calibration`(contig[, analysis, ...])	Generate h12 GWSS calibration data for different window sizes.
`plot_h12_calibration`(contig[, analysis, ...])	Plot h12 GWSS calibration data for different window sizes.
`h12_gwss`(contig, window_size[, analysis, ...])	Run h12 genome-wide selection scan.
`plot_h12_gwss`(contig, window_size[, ...])	Plot h12 GWSS data.
`h1x_gwss`(contig, window_size, cohort1_query, ...)	Run a H1X genome-wide scan to detect genome regions with shared selective sweeps between two cohorts.
`plot_h1x_gwss`(contig, window_size, ...[, ...])	Run and plot a H1X genome-wide scan to detect genome regions with shared selective sweeps between two cohorts.
`g123_calibration`(contig[, sites, site_mask, ...])	Generate G123 GWSS calibration data for different window sizes.
`plot_g123_calibration`(contig, sites[, ...])	Plot G123 GWSS calibration data for different window sizes.
`g123_gwss`(contig, window_size[, sites, ...])	Run a G123 genome-wide selection scan.
`plot_g123_gwss`(contig, window_size[, sites, ...])	Plot G123 GWSS data.
`ihs_gwss`(contig[, analysis, sample_sets, ...])	Run iHS GWSS.
`plot_ihs_gwss`(contig[, analysis, ...])	Run and plot iHS GWSS data.
`xpehh_gwss`(contig[, analysis, sample_sets, ...])	Run XP-EHH GWSS.
`plot_xpehh_gwss`(contig[, analysis, ...])	Run and plot XP-EHH GWSS data.

Haplotype clustering and network analysis#

`plot_haplotype_clustering`(region[, ...])	Hierarchically cluster haplotypes in region and produce an interactive plot.
`plot_haplotype_network`(region[, analysis, ...])	Construct a median-joining haplotype network and display it using Cytoscape.
`haplotype_pairwise_distances`(region[, ...])	Compute pairwise distances between haplotypes.

Fst analysis#

`average_fst`(region, cohort1_query, cohort2_query)	Compute average Hudson's Fst between two specified cohorts.
`pairwise_average_fst`(region, cohorts[, ...])	Compute pairwise average Hudson's Fst between a set of specified cohorts.
`plot_pairwise_average_fst`(fst_df[, ...])	Plot a heatmap of pairwise average Fst values.
`fst_gwss`(contig, window_size, cohort1_query, ...)	Run a Fst genome-wide scan to investigate genetic differentiation between two cohorts.
`plot_fst_gwss`(contig, window_size, ...[, ...])	Run and plot a Fst genome-wide scan to investigate genetic differentiation between two cohorts.