Ag3#

This page provides a curated list of functions and properties available in the malariagen_data API for data on mosquitoes from the Anopheles gambiae complex.

To set up the API, use the following code:

import malariagen_data
ag3 = malariagen_data.Ag3()

All the functions below can then be accessed as methods on the ag3 object. E.g., to call the sample_metadata() function, do:

df_samples = ag3.sample_metadata()

For more information about the data and terms of use, please see the MalariaGEN Anopheles gambiae genomic surveillance project home page.

Basic data access#

releases

Currently available data releases.

sample_sets([release])

Access a dataframe of sample sets.

lookup_release(sample_set)

Find which release a sample set was included in.

lookup_study(sample_set)

Find which study a sample set belongs to.

Reference genome data access#

contigs

Contigs in the reference genome.

genome_sequence(region[, inline_array, chunks])

Access the reference genome sequence.

genome_features([region, attributes])

Access genome feature annotations.

plot_transcript(transcript[, sizing_mode, ...])

Plot a transcript, using bokeh.

plot_genes(region[, sizing_mode, width, ...])

Plot a genes track, using bokeh.

Sample metadata access#

sample_metadata([sample_sets, sample_query, ...])

Access sample metadata for one or more sample sets.

add_extra_metadata(data[, on])

Add extra sample metadata, e.g., including additional columns which you would like to use to query and group samples.

clear_extra_metadata()

Clear any extra metadata previously added.

cross_metadata()

Load a dataframe containing metadata about samples in colony crosses, including which samples are parents or progeny in which crosses.

count_samples([sample_sets, sample_query, ...])

Create a pivot table showing numbers of samples available by space, time and taxon.

lookup_sample(sample[, sample_set])

Get the metadata for a specific sample and sample set.

plot_samples_bar(x[, color, sort, ...])

Plot a bar chart showing the number of samples available, grouped by some variable such as country or year.

plot_samples_interactive_map([sample_sets, ...])

Plot an interactive map showing sampling locations using ipyleaflet.

plot_sample_location_mapbox(*, sample_sets)

Plot markers on a map showing sample locations as a Mapbox scatter plot.

plot_sample_location_geo(*, sample_sets[, ...])

Plot markers on a map showing sample locations as a geographic scatter plot.

wgs_data_catalog(sample_set)

Load a data catalog providing URLs for downloading BAM, VCF and Zarr files for samples in a given sample set.

cohorts(cohort_set)

Read data for a specific cohort set, including cohort size, country code, taxon, administrative units name, ISO code, geoBoundaries shape ID and representative latitude and longitude points.

SNP data access#

site_mask_ids

Identifiers for the different site masks that are available.

snp_calls(region[, sample_sets, ...])

Access SNP sites, site filters and genotype calls.

snp_allele_counts(region[, sample_sets, ...])

Compute SNP allele counts.

plot_snps(region[, sample_sets, ...])

Plot SNPs in a given genome region.

site_annotations(region[, site_mask, ...])

Load site annotations.

is_accessible(region[, site_mask, ...])

Compute genome accessibility array.

biallelic_snp_calls(region[, sample_sets, ...])

Access SNP calls at sites which are biallelic within the selected samples.

biallelic_diplotypes(region[, sample_sets, ...])

Load biallelic SNP genotypes.

biallelic_snps_to_plink(output_dir, region, ...)

Write Anopheles biallelic SNP data to the Plink binary file format.

Haplotype data access#

phasing_analysis_ids

Identifiers for the different phasing analyses that are available.

haplotypes(region[, analysis, sample_sets, ...])

Access haplotype data.

haplotype_sites(region, field[, analysis, ...])

Access haplotype site data (positions or alleles).

AIM data access#

aim_ids

aim_variants(aims)

Access ancestry informative marker variants.

aim_calls(aims[, sample_sets, sample_query, ...])

Access ancestry informative marker SNP sites, alleles and genotype calls.

plot_aim_heatmap(aims[, sample_sets, ...])

Plot a heatmap of ancestry-informative marker (AIM) genotypes.

CNV data access#

coverage_calls_analysis_ids

Identifiers for the different coverage calls analyses that are available.

cnv_hmm(region[, sample_sets, sample_query, ...])

Access CNV HMM data from CNV calling.

cnv_coverage_calls(region, sample_set, analysis)

Access CNV HMM data from genome-wide CNV discovery and filtering.

cnv_discordant_read_calls(contig[, ...])

Access CNV discordant read calls data.

plot_cnv_hmm_coverage(sample, region[, ...])

Plot CNV HMM data for a single sample, together with a genes track, using bokeh.

plot_cnv_hmm_heatmap(region[, sample_sets, ...])

Plot CNV HMM data for multiple samples as a heatmap, with a genes track, using bokeh.

gene_cnv(region[, sample_sets, ...])

Compute modal copy number by gene, from HMM data.

Integrative genomics viewer (IGV)#

igv(region[, tracks, init])

Create an IGV browser and inject into the current notebook.

view_alignments(region, sample[, ...])

Launch IGV and view sequence read alignments and SNP genotypes from the given sample.

SNP and CNV frequency analysis#

snp_allele_frequencies(transcript, cohorts)

Compute SNP allele frequencies for a gene transcript.

snp_allele_frequencies_advanced(transcript, ...)

Group samples by taxon, area (space) and period (time), then compute SNP allele frequencies.

aa_allele_frequencies(transcript, cohorts[, ...])

Compute amino acid substitution frequencies for a gene transcript.

aa_allele_frequencies_advanced(transcript, ...)

Group samples by taxon, area (space) and period (time), then compute amino acid change allele frequencies.

gene_cnv_frequencies(region, cohorts[, ...])

Compute modal copy number by gene, then compute the frequency of amplifications and deletions in one or more cohorts, from HMM data.

gene_cnv_frequencies_advanced(region, ...[, ...])

Group samples by taxon, area (space) and period (time), then compute gene CNV counts and frequencies.

haplotypes_frequencies(region, cohorts[, ...])

Compute haplotype frequencies for a region.

haplotypes_frequencies_advanced(region, ...)

Group samples by taxon, area (space) and period (time), then compute haplotype frequencies.

plot_frequencies_heatmap(df[, index, ...])

Plot a heatmap from a pandas DataFrame of frequencies, e.g., output from snp_allele_frequencies() or gene_cnv_frequencies().

plot_frequencies_time_series(ds[, height, ...])

Create a time series plot of variant frequencies using plotly.

plot_frequencies_interactive_map(ds[, ...])

Create an interactive map with markers showing variant frequencies or cohorts grouped by area (space), period (time) and taxon.

Principal components analysis (PCA)#

pca(region, n_snps[, n_components, ...])

Run a principal components analysis (PCA) using biallelic SNPs from the selected genome region and samples.

plot_pca_variance(evr[, width, height, ...])

Plot explained variance ratios from a principal components analysis (PCA) using a plotly bar plot.

plot_pca_coords(data[, x, y, color, symbol, ...])

Plot sample coordinates from a principal components analysis (PCA) as a plotly scatter plot.

plot_pca_coords_3d(data[, x, y, z, color, ...])

Plot sample coordinates from a principal components analysis (PCA) as a plotly 3D scatter plot.

Genetic distance and neighbour-joining trees (NJT)#

plot_njt(region, n_snps[, color, symbol, ...])

Plot an unrooted neighbour-joining tree, computed from pairwise distances between samples using biallelic SNP genotypes.

njt(region, n_snps[, algorithm, metric, ...])

Construct a neighbour-joining tree between samples using biallelic SNP genotypes.

biallelic_diplotype_pairwise_distances(...)

Compute pairwise distances between samples using biallelic SNP genotypes.

Heterozygosity analysis#

plot_heterozygosity(sample, region[, ...])

Plot windowed heterozygosity for a single sample over a genome region.

roh_hmm(sample, region[, window_size, ...])

Infer runs of homozygosity for a single sample over a genome region.

plot_roh(sample, region[, window_size, ...])

Plot windowed heterozygosity and inferred runs of homozygosity for a single sample over a genome region.

Diversity analysis#

cohort_diversity_stats(cohort, cohort_size, ...)

Compute genetic diversity summary statistics for a cohort of individuals.

diversity_stats(cohorts, cohort_size, region)

Compute genetic diversity summary statistics for multiple cohorts.

plot_diversity_stats(df_stats[, color, ...])

Plot diversity summary statistics for multiple cohorts.

Genome-wide selection scans#

h12_calibration(contig[, analysis, ...])

Generate h12 GWSS calibration data for different window sizes.

plot_h12_calibration(contig[, analysis, ...])

Plot h12 GWSS calibration data for different window sizes.

h12_gwss(contig, window_size[, analysis, ...])

Run h12 genome-wide selection scan.

plot_h12_gwss(contig, window_size[, ...])

Plot h12 GWSS data.

plot_h12_gwss_multi_panel(contig, cohorts, ...)

Plot h12 GWSS data with multiple tracks.

plot_h12_gwss_multi_overlay(contig, cohorts, ...)

Plot h12 GWSS data with multiple traces overlaid.

h1x_gwss(contig, window_size, cohort1_query, ...)

Run a H1X genome-wide scan to detect genome regions with shared selective sweeps between two cohorts.

plot_h1x_gwss(contig, window_size, ...[, ...])

Run and plot a H1X genome-wide scan to detect genome regions with shared selective sweeps between two cohorts.

g123_calibration(contig[, sites, site_mask, ...])

Generate G123 GWSS calibration data for different window sizes.

plot_g123_calibration(contig, sites[, ...])

Plot G123 GWSS calibration data for different window sizes.

g123_gwss(contig, window_size[, sites, ...])

Run a G123 genome-wide selection scan.

plot_g123_gwss(contig, window_size[, sites, ...])

Plot G123 GWSS data.

ihs_gwss(contig[, analysis, sample_sets, ...])

Run iHS GWSS.

plot_ihs_gwss(contig[, analysis, ...])

Run and plot iHS GWSS data.

xpehh_gwss(contig[, analysis, sample_sets, ...])

Run XP-EHH GWSS.

plot_xpehh_gwss(contig[, analysis, ...])

Run and plot XP-EHH GWSS data.

Haplotype clustering and network analysis#

plot_haplotype_clustering(region[, ...])

Hierarchically cluster haplotypes in region and produce an interactive plot.

plot_haplotype_network(region[, analysis, ...])

Construct a median-joining haplotype network and display it using Cytoscape.

haplotype_pairwise_distances(region[, ...])

Compute pairwise distances between haplotypes.

Diplotype clustering#

plot_diplotype_clustering(region[, ...])

Hierarchically cluster diplotypes in region and produce an interactive plot.

plot_diplotype_clustering_advanced(region[, ...])

Perform diplotype clustering, annotated with heterozygosity, gene copy number and amino acid variants.

Fst analysis#

average_fst(region, cohort1_query, cohort2_query)

Compute average Hudson's Fst between two specified cohorts.

pairwise_average_fst(region, cohorts[, ...])

Compute pairwise average Hudson's Fst between a set of specified cohorts.

plot_pairwise_average_fst(fst_df[, ...])

Plot a heatmap of pairwise average Fst values.

fst_gwss(contig, window_size, cohort1_query, ...)

Run a Fst genome-wide scan to investigate genetic differentiation between two cohorts.

plot_fst_gwss(contig, window_size, ...[, ...])

Run and plot a Fst genome-wide scan to investigate genetic differentiation between two cohorts.