Ag3#
This page provides a curated list of functions and properties available in the malariagen_data
API
for data on mosquitoes from the Anopheles gambiae complex.
To set up the API, use the following code:
import malariagen_data
ag3 = malariagen_data.Ag3()
All the functions below can then be accessed as methods on the ag3
object. E.g., to call the
sample_metadata()
function, do:
df_samples = ag3.sample_metadata()
For more information about the data and terms of use, please see the MalariaGEN Anopheles gambiae genomic surveillance project home page.
Basic data access#
Currently available data releases. |
|
|
Access a dataframe of sample sets. |
|
Find which release a sample set was included in. |
|
Find which study a sample set belongs to. |
Reference genome data access#
Contigs in the reference genome. |
|
|
Access the reference genome sequence. |
|
Access genome feature annotations. |
|
Plot a transcript, using bokeh. |
|
Plot a genes track, using bokeh. |
Sample metadata access#
|
Access sample metadata for one or more sample sets. |
|
Add extra sample metadata, e.g., including additional columns which you would like to use to query and group samples. |
Clear any extra metadata previously added. |
|
Load a dataframe containing metadata about samples in colony crosses, including which samples are parents or progeny in which crosses. |
|
|
Create a pivot table showing numbers of samples available by space, time and taxon. |
|
Get the metadata for a specific sample and sample set. |
|
Plot a bar chart showing the number of samples available, grouped by some variable such as country or year. |
|
Plot an interactive map showing sampling locations using ipyleaflet. |
|
Plot markers on a map showing sample locations as a Mapbox scatter plot. |
|
Plot markers on a map showing sample locations as a geographic scatter plot. |
|
Load a data catalog providing URLs for downloading BAM, VCF and Zarr files for samples in a given sample set. |
|
Read data for a specific cohort set, including cohort size, country code, taxon, administrative units name, ISO code, geoBoundaries shape ID and representative latitude and longitude points. |
SNP data access#
Identifiers for the different site masks that are available. |
|
|
Access SNP sites, site filters and genotype calls. |
|
Compute SNP allele counts. |
|
Plot SNPs in a given genome region. |
|
Load site annotations. |
|
Compute genome accessibility array. |
|
Access SNP calls at sites which are biallelic within the selected samples. |
|
Load biallelic SNP genotypes. |
|
Write Anopheles biallelic SNP data to the Plink binary file format. |
Haplotype data access#
Identifiers for the different phasing analyses that are available. |
|
|
Access haplotype data. |
|
Access haplotype site data (positions or alleles). |
AIM data access#
|
Access ancestry informative marker variants. |
|
Access ancestry informative marker SNP sites, alleles and genotype calls. |
|
Plot a heatmap of ancestry-informative marker (AIM) genotypes. |
CNV data access#
Identifiers for the different coverage calls analyses that are available. |
|
|
Access CNV HMM data from CNV calling. |
|
Access CNV HMM data from genome-wide CNV discovery and filtering. |
|
Access CNV discordant read calls data. |
|
Plot CNV HMM data for a single sample, together with a genes track, using bokeh. |
|
Plot CNV HMM data for multiple samples as a heatmap, with a genes track, using bokeh. |
|
Compute modal copy number by gene, from HMM data. |
Integrative genomics viewer (IGV)#
|
Create an IGV browser and inject into the current notebook. |
|
Launch IGV and view sequence read alignments and SNP genotypes from the given sample. |
SNP and CNV frequency analysis#
|
Compute SNP allele frequencies for a gene transcript. |
|
Group samples by taxon, area (space) and period (time), then compute SNP allele frequencies. |
|
Compute amino acid substitution frequencies for a gene transcript. |
|
Group samples by taxon, area (space) and period (time), then compute amino acid change allele frequencies. |
|
Compute modal copy number by gene, then compute the frequency of amplifications and deletions in one or more cohorts, from HMM data. |
|
Group samples by taxon, area (space) and period (time), then compute gene CNV counts and frequencies. |
|
Compute haplotype frequencies for a region. |
|
Group samples by taxon, area (space) and period (time), then compute haplotype frequencies. |
|
Plot a heatmap from a pandas DataFrame of frequencies, e.g., output from snp_allele_frequencies() or gene_cnv_frequencies(). |
|
Create a time series plot of variant frequencies using plotly. |
|
Create an interactive map with markers showing variant frequencies or cohorts grouped by area (space), period (time) and taxon. |
Principal components analysis (PCA)#
|
Run a principal components analysis (PCA) using biallelic SNPs from the selected genome region and samples. |
|
Plot explained variance ratios from a principal components analysis (PCA) using a plotly bar plot. |
|
Plot sample coordinates from a principal components analysis (PCA) as a plotly scatter plot. |
|
Plot sample coordinates from a principal components analysis (PCA) as a plotly 3D scatter plot. |
Genetic distance and neighbour-joining trees (NJT)#
|
Plot an unrooted neighbour-joining tree, computed from pairwise distances between samples using biallelic SNP genotypes. |
|
Construct a neighbour-joining tree between samples using biallelic SNP genotypes. |
Compute pairwise distances between samples using biallelic SNP genotypes. |
Heterozygosity analysis#
|
Plot windowed heterozygosity for a single sample over a genome region. |
|
Infer runs of homozygosity for a single sample over a genome region. |
|
Plot windowed heterozygosity and inferred runs of homozygosity for a single sample over a genome region. |
Diversity analysis#
|
Compute genetic diversity summary statistics for a cohort of individuals. |
|
Compute genetic diversity summary statistics for multiple cohorts. |
|
Plot diversity summary statistics for multiple cohorts. |
Genome-wide selection scans#
|
Generate h12 GWSS calibration data for different window sizes. |
|
Plot h12 GWSS calibration data for different window sizes. |
|
Run h12 genome-wide selection scan. |
|
Plot h12 GWSS data. |
|
Plot h12 GWSS data with multiple tracks. |
|
Plot h12 GWSS data with multiple traces overlaid. |
|
Run a H1X genome-wide scan to detect genome regions with shared selective sweeps between two cohorts. |
|
Run and plot a H1X genome-wide scan to detect genome regions with shared selective sweeps between two cohorts. |
|
Generate G123 GWSS calibration data for different window sizes. |
|
Plot G123 GWSS calibration data for different window sizes. |
|
Run a G123 genome-wide selection scan. |
|
Plot G123 GWSS data. |
|
Run iHS GWSS. |
|
Run and plot iHS GWSS data. |
|
Run XP-EHH GWSS. |
|
Run and plot XP-EHH GWSS data. |
Haplotype clustering and network analysis#
|
Hierarchically cluster haplotypes in region and produce an interactive plot. |
|
Construct a median-joining haplotype network and display it using Cytoscape. |
|
Compute pairwise distances between haplotypes. |
Diplotype clustering#
|
Hierarchically cluster diplotypes in region and produce an interactive plot. |
|
Perform diplotype clustering, annotated with heterozygosity, gene copy number and amino acid variants. |
Fst analysis#
|
Compute average Hudson's Fst between two specified cohorts. |
|
Compute pairwise average Hudson's Fst between a set of specified cohorts. |
|
Plot a heatmap of pairwise average Fst values. |
|
Run a Fst genome-wide scan to investigate genetic differentiation between two cohorts. |
|
Run and plot a Fst genome-wide scan to investigate genetic differentiation between two cohorts. |