Ag3 API
Contents
Ag3 API#
This page provides documentation for functions in the malariagen_data Python package for accessing Anopheles gambiae data.
Ag3()#
- malariagen_data.Ag3(bokeh_output_notebook=True, results_cache=None, log=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, debug=False, show_progress=True, check_location=True, cohorts_analysis=None, species_analysis=None, site_filters_analysis=None, pre=False, **kwargs)#
Provides access to data from Ag3.x releases.
- Parameters
- urlstr
Base path to data. Give “gs://vo_agam_release/” to use Google Cloud Storage, or a local path on your file system if data have been downloaded.
- cohorts_analysisstr
Cohort analysis version.
- species_analysis{“aim_20200422”, “pca_20200422”}, optional
Species analysis version.
- site_filters_analysisstr, optional
Site filters analysis version.
- bokeh_output_notebookbool, optional
If True (default), configure bokeh to output plots to the notebook.
- results_cachestr, optional
Path to directory on local file system to save results.
- logstr or stream, optional
File path or stream output for logging messages.
- debugbool, optional
Set to True to enable debug level logging.
- show_progressbool, optional
If True, show a progress bar during longer-running computations.
- check_locationbool, optional
If True, use ipinfo to check the location of the client system.
- **kwargs
Passed through to fsspec when setting up file system access.
Examples
Access data from Google Cloud Storage (default):
>>> import malariagen_data >>> ag3 = malariagen_data.Ag3()
Access data downloaded to a local file system:
>>> ag3 = malariagen_data.Ag3("/local/path/to/vo_agam_release/")
Access data from Google Cloud Storage, with caching on the local file system in a directory named “gcs_cache”:
>>> ag3 = malariagen_data.Ag3( ... "simplecache::gs://vo_agam_release", ... simplecache=dict(cache_storage="gcs_cache"), ... )
Set up caching of some longer-running computations on the local file system, in a directory named “results_cache”:
>>> ag3 = malariagen_data.Ag3(results_cache="results_cache")
sample_sets()#
- Ag3.sample_sets(release=None)#
Access a dataframe of sample sets.
- Parameters
- releasestr, optional
Release identifier, e.g. give “3.0” to access the v3.0 data release.
- Returns
- dfpandas.DataFrame
A dataframe of sample sets, one row per sample set.
sample_metadata()#
- Ag3.sample_metadata(sample_sets=None, sample_query=None)#
Access sample metadata for one or more sample sets.
- Parameters
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “country == ‘Burkina Faso’”.
- Returns
- df_samplespandas.DataFrame
A dataframe of sample metadata, one row per sample.
sample_cohorts()#
- Ag3.sample_cohorts(sample_sets=None)#
Access cohorts metadata for one or more sample sets.
- Parameters
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- Returns
- dfpandas.DataFrame
A dataframe of cohort metadata, one row per sample.
count_samples()#
- Ag3.count_samples(sample_sets=None, sample_query=None, index=('country', 'admin1_iso', 'admin1_name', 'admin2_name', 'year'), columns='taxon')#
Create a pivot table showing numbers of samples available by space, time and taxon.
- Parameters
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- indexstr or tuple of str
Sample metadata columns to use for the index.
- columnsstr or tuple of str
Sample metadata columns to use for the columns.
- Returns
- dfpandas.DataFrame
Pivot table of sample counts.
plot_samples_interactive_map()#
- Ag3.plot_samples_interactive_map(sample_sets=None, sample_query=None, basemap=None, center=(- 2, 20), zoom=3, min_samples=1)#
Plot an interactive map showing sampling locations using ipyleaflet.
- Parameters
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- basemapdict
Basemap description coming from ipyleaflet.basemaps.
- centertuple of int, optional
Location to center the map.
- zoomint, optional
Initial zoom level.
- min_samplesint, optional
Minimum number of samples required to show a marker for a given location.
- Returns
- samples_mapipyleaflet.Map
Ipyleaflet map widget.
cross_metadata()#
- Ag3.cross_metadata()#
Load a dataframe containing metadata about samples in colony crosses, including which samples are parents or progeny in which crosses.
- Returns
- dfpandas.DataFrame
A dataframe of sample metadata for colony crosses.
species_calls()#
- Ag3.species_calls(sample_sets=None)#
Access species calls for one or more sample sets.
- Parameters
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”] or a release identifier (e.g., “3.0”) or a list of release identifiers.
- Returns
- dfpandas.DataFrame
A dataframe of species calls for one or more sample sets, one row per sample.
genome_sequence()#
- Ag3.genome_sequence(region, inline_array=True, chunks='native')#
Access the reference genome sequence.
- Parameters
- region: str or list of str or Region or list of Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- ddask.array.Array
An array of nucleotides giving the reference genome sequence for the given contig.
geneset()#
- Ag3.geneset(*args, **kwargs)#
Deprecated, this method has been renamed to genome_features().
snp_calls()#
- Ag3.snp_calls(region, sample_sets=None, sample_query=None, site_mask=None, site_class=None, inline_array=True, chunks='native', cohort_size=None, random_seed=42)#
Access SNP sites, site filters and genotype calls.
- Parameters
- region: str or list of str or Region or list of Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “country == ‘Burkina Faso’”.
- site_maskstr, optional
Site filters mask to apply, e.g. “gamb_colu”
- site_classstr, optional
Select sites belonging to one of the following classes: CDS_DEG_4, (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites), INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5’ splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3’ splice site), UTR_5PRIME (5’ untranslated region), UTR_3PRIME (3’ untranslated region), INTERGENIC (intergenic, more than 10 kbp from a gene).
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- random_seedint, optional
Random seed used for down-sampling.
- Returns
- dsxarray.Dataset
A dataset containing SNP sites, site filters and genotype calls.
snp_sites()#
- Ag3.snp_sites(region, field, site_mask=None, inline_array=True, chunks='native')#
Access SNP site data (positions and alleles).
- Parameters
- region: str or list of str or Region or list of Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- field{“POS”, “REF”, “ALT”}
Array to access.
- site_maskstr, optional
Site filters mask to apply, e.g. “gamb_colu”
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- ddask.array.Array
An array of either SNP positions, reference alleles or alternate alleles.
snp_genotypes()#
- Ag3.snp_genotypes(region, sample_sets=None, sample_query=None, field='GT', site_mask=None, inline_array=True, chunks='native')#
Access SNP genotypes and associated data.
- Parameters
- region: str or list of str or Region or list of Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- field{“GT”, “GQ”, “AD”, “MQ”}
Array to access.
- site_maskstr, optional
Site filters mask to apply, e.g. “gamb_colu”
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- ddask.array.Array
An array of either genotypes (GT), genotype quality (GQ), allele depths (AD) or mapping quality (MQ) values.
site_filters()#
- Ag3.site_filters(region, mask, field='filter_pass', inline_array=True, chunks='native')#
Access SNP site filters.
- Parameters
- region: str or list of str or Region or list of Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- maskstr
Mask to use, e.g. “gamb_colu”
- fieldstr, optional
Array to access.
- inline_arraybool, optional
Passed through to dask.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- ddask.array.Array
An array of boolean values identifying sites that pass the filters.
is_accessible()#
- Ag3.is_accessible(region, site_mask)#
Compute genome accessibility array.
- Parameters
- region: str or list of str or Region or list of Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- site_maskstr
Site filters mask to apply, e.g. “gamb_colu”
- Returns
- anumpy.ndarray
An array of boolean values identifying accessible genome sites.
snp_effects()#
- Ag3.snp_effects(transcript, site_mask=None)#
Compute variant effects for a gene transcript.
- Parameters
- transcriptstr
Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RA”.
- site_maskstr, optional
Site filters mask to apply, e.g. “gamb_colu”
- Returns
- dfpandas.DataFrame
A dataframe of all possible SNP variants and their effects, one row per variant.
site_annotations()#
- Ag3.site_annotations(region, site_mask=None, inline_array=True, chunks='auto')#
Load site annotations.
- Parameters
- region: str or list of str or Region or list of Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- site_maskstr
Site filters mask to apply, e.g. “gamb_colu”
- inline_arraybool, optional
Passed through to dask.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- dsxarray.Dataset
A dataset of site annotations.
cnv_hmm()#
- Ag3.cnv_hmm(region, sample_sets=None, sample_query=None, max_coverage_variance=0.2, inline_array=True, chunks='native')#
Access CNV HMM data from CNV calling.
- Parameters
- region: str or list of str or Region or list of Region
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- max_coverage_variancefloat, optional
Remove samples if coverage variance exceeds this value.
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- dsxarray.Dataset
A dataset of CNV HMM calls and associated data.
cnv_coverage_calls()#
- Ag3.cnv_coverage_calls(region, sample_set, analysis, inline_array=True, chunks='native')#
Access CNV HMM data from genome-wide CNV discovery and filtering.
- Parameters
- region: str or list of str or Region or list of Region
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- sample_setstr
Sample set identifier.
- analysis{‘gamb_colu’, ‘arab’, ‘crosses’}
Name of CNV analysis.
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- dsxarray.Dataset
A dataset of CNV alleles and genotypes.
cnv_discordant_read_calls()#
- Ag3.cnv_discordant_read_calls(contig, sample_sets=None, inline_array=True, chunks='native')#
Access CNV discordant read calls data.
- Parameters
- contigstr or list of str
Chromosome arm, e.g., “3R”. Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“2R”, “3R”].
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- Returns
- dsxarray.Dataset
A dataset of CNV alleles and genotypes.
gene_cnv()#
- Ag3.gene_cnv(region, sample_sets=None, sample_query=None, max_coverage_variance=0.2)#
Compute modal copy number by gene, from HMM data.
- Parameters
- region: str or list of str or Region or list of Region
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- sample_setsstr or list of str
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- max_coverage_variancefloat, optional
Remove samples if coverage variance exceeds this value.
- Returns
- dsxarray.Dataset
A dataset of modal copy number per gene and associated data.
haplotypes()#
- Ag3.haplotypes(region, analysis, sample_sets=None, sample_query=None, inline_array=True, chunks='native', cohort_size=None, random_seed=42)#
Access haplotype data.
- Parameters
- region: str or list of str or Region or list of Region
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- inline_arraybool, optional
Passed through to dask.array.from_array().
- chunksstr, optional
If ‘auto’ let dask decide chunk size. If ‘native’ use native zarr chunks. Also, can be a target size, e.g., ‘200 MiB’.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- random_seedint, optional
Random seed used for down-sampling.
- Returns
- dsxarray.Dataset
A dataset of haplotypes and associated data.
snp_allele_frequencies()#
- Ag3.snp_allele_frequencies(transcript, cohorts, sample_query=None, min_cohort_size=10, site_mask=None, sample_sets=None, drop_invariant=True, effects=True)#
Compute per variant allele frequencies for a gene transcript.
- Parameters
- transcriptstr
Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RD”.
- cohortsstr or dict
If a string, gives the name of a predefined cohort set, e.g., one of {“admin1_month”, “admin1_year”, “admin2_month”, “admin2_year”}. If a dict, should map cohort labels to sample queries, e.g., {“bf_2012_col”: “country == ‘Burkina Faso’ and year == 2012 and taxon == ‘coluzzii’”}.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- min_cohort_sizeint
Minimum cohort size. Any cohorts below this size are omitted.
- site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}
Site filters mask to apply.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- drop_invariantbool, optional
If True, variants with no alternate allele calls in any cohorts are dropped from the result.
- effectsbool, optional
If True, add SNP effect columns.
- Returns
- dfpandas.DataFrame
A dataframe of SNP frequencies, one row per variant.
Notes
Cohorts with fewer samples than min_cohort_size will be excluded from output.
aa_allele_frequencies()#
- Ag3.aa_allele_frequencies(transcript, cohorts, sample_query=None, min_cohort_size=10, site_mask=None, sample_sets=None, drop_invariant=True)#
Compute per amino acid allele frequencies for a gene transcript.
- Parameters
- transcriptstr
Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RA”.
- cohortsstr or dict
If a string, gives the name of a predefined cohort set, e.g., one of {“admin1_month”, “admin1_year”, “admin2_month”, “admin2_year”}. If a dict, should map cohort labels to sample queries, e.g., {“bf_2012_col”: “country == ‘Burkina Faso’ and year == 2012 and taxon == ‘coluzzii’”}.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- min_cohort_sizeint
Minimum cohort size, below which allele frequencies are not calculated for cohorts.
- site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}
Site filters mask to apply.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- drop_invariantbool, optional
If True, variants with no alternate allele calls in any cohorts are dropped from the result.
- Returns
- dfpandas.DataFrame
A dataframe of amino acid allele frequencies, one row per replacement.
Notes
Cohorts with fewer samples than min_cohort_size will be excluded from output.
gene_cnv_frequencies()#
- Ag3.gene_cnv_frequencies(region, cohorts, sample_query=None, min_cohort_size=10, sample_sets=None, drop_invariant=True, max_coverage_variance=0.2)#
Compute modal copy number by gene, then compute the frequency of amplifications and deletions in one or more cohorts, from HMM data.
- Parameters
- region: str or list of str or Region or list of Region
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- cohortsstr or dict
If a string, gives the name of a predefined cohort set, e.g., one of {“admin1_month”, “admin1_year”, “admin2_month”, “admin2_year”}. If a dict, should map cohort labels to sample queries, e.g., {“bf_2012_col”: “country == ‘Burkina Faso’ and year == 2012 and taxon == ‘coluzzii’”}.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- min_cohort_sizeint
Minimum cohort size, below which cohorts are dropped.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- drop_invariantbool, optional
If True, drop any rows where there is no evidence of variation.
- max_coverage_variancefloat, optional
Remove samples if coverage variance exceeds this value.
- Returns
- dfpandas.DataFrame
A dataframe of CNV amplification (amp) and deletion (del) frequencies in the specified cohorts, one row per gene and CNV type (amp/del).
plot_frequencies_heatmap()#
- Ag3.plot_frequencies_heatmap(df, index='label', max_len=100, x_label='Cohorts', y_label='Variants', colorbar=True, col_width=40, width=None, row_height=20, height=None, text_auto='.0%', aspect='auto', color_continuous_scale='Reds', title=True, **kwargs)#
Plot a heatmap from a pandas DataFrame of frequencies, e.g., output from Ag3.snp_allele_frequencies() or Ag3.gene_cnv_frequencies(). It’s recommended to filter the input DataFrame to just rows of interest, i.e., fewer rows than max_len.
- Parameters
- dfpandas DataFrame
A DataFrame of frequencies, e.g., output from snp_allele_frequencies() or gene_cnv_frequencies().
- indexstr or list of str
One or more column headers that are present in the input dataframe. This becomes the heatmap y-axis row labels. The column/s must produce a unique index.
- max_lenint, optional
Displaying large styled dataframes may cause ipython notebooks to crash.
- x_labelstr, optional
This is the x-axis label that will be displayed on the heatmap.
- y_labelstr, optional
This is the y-axis label that will be displayed on the heatmap.
- colorbarbool, optional
If False, colorbar is not output.
- col_widthint, optional
Plot width per column in pixels (px).
- widthint, optional
Plot width in pixels (px), overrides col_width.
- row_heightint, optional
Plot height per row in pixels (px).
- heightint, optional
Plot height in pixels (px), overrides row_height.
- text_autostr, optional
Formatting for frequency values.
- aspectstr, optional
Control the aspect ratio of the heatmap.
- color_continuous_scalestr, optional
Color scale to use.
- titlebool or str, optional
If True, attempt to use metadata from input dataset as a plot title. Otherwise, use supplied value as a title.
- **kwargs
Other parameters are passed through to px.imshow().
- Returns
- figplotly.graph_objects.Figure
plot_frequencies_time_series()#
- Ag3.plot_frequencies_time_series(ds, height=None, width=None, title=True, **kwargs)#
Create a time series plot of variant frequencies using plotly.
- Parameters
- dsxarray.Dataset
A dataset of variant frequencies, such as returned by Ag3.snp_allele_frequencies_advanced(), Ag3.aa_allele_frequencies_advanced() or Ag3.gene_cnv_frequencies_advanced().
- heightint, optional
Height of plot in pixels (px).
- widthint, optional
Width of plot in pixels (px).
- titlebool or str, optional
If True, attempt to use metadata from input dataset as a plot title. Otherwise, use supplied value as a title.
- **kwargs
Passed through to px.line().
- Returns
- figplotly.graph_objects.Figure
A plotly figure containing line graphs. The resulting figure will have one panel per cohort, grouped into columns by taxon, and grouped into rows by area. Markers and lines show frequencies of variants.
plot_frequencies_interactive_map()#
- Ag3.plot_frequencies_interactive_map(ds, center=(- 2, 20), zoom=3, title=True, epilogue=True)#
Create an interactive map with markers showing variant frequencies or cohorts grouped by area (space), period (time) and taxon.
- Parameters
- dsxarray.Dataset
A dataset of variant frequencies, such as returned by Ag3.snp_allele_frequencies_advanced(), Ag3.aa_allele_frequencies_advanced() or Ag3.gene_cnv_frequencies_advanced().
- centertuple of int, optional
Location to center the map.
- zoomint, optional
Initial zoom level.
- titlebool or str, optional
If True, attempt to use metadata from input dataset as a plot title. Otherwise, use supplied value as a title.
- epiloguebool or str, optional
Additional text to display below the map.
- Returns
- outipywidgets.Widget
An interactive map with widgets for selecting which variant, taxon and time period to display.
plot_frequencies_map_markers()#
- Ag3.plot_frequencies_map_markers(m, ds, variant, taxon, period, clear=True)#
Plot markers on a map showing variant frequencies for cohorts grouped by area (space), period (time) and taxon.
- Parameters
- mipyleaflet.Map
The map on which to add the markers.
- dsxarray.Dataset
A dataset of variant frequencies, such as returned by Ag3.snp_allele_frequencies_advanced(), Ag3.aa_allele_frequencies_advanced() or Ag3.gene_cnv_frequencies_advanced().
- variantint or str
Index or label of variant to plot.
- taxonstr
Taxon to show markers for.
- periodpd.Period
Time period to show markers for.
- clearbool, optional
If True, clear all layers (except the base layer) from the map before adding new markers.
snp_allele_frequencies_advanced()#
- Ag3.snp_allele_frequencies_advanced(transcript, area_by, period_by, sample_sets=None, sample_query=None, min_cohort_size=10, drop_invariant=True, variant_query=None, site_mask=None, nobs_mode='called', ci_method='wilson')#
Group samples by taxon, area (space) and period (time), then compute SNP allele counts and frequencies.
- Parameters
- transcriptstr
Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RD”.
- area_bystr
Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.
- period_by{“year”, “quarter”, “month”}
Length of time to group samples temporally.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- min_cohort_sizeint, optional
Minimum cohort size. Any cohorts below this size are omitted.
- drop_invariantbool, optional
If True, variants with no alternate allele calls in any cohorts are dropped from the result.
- variant_querystr, optional
- site_maskstr, optional
Site filters mask to apply.
- nobs_mode{“called”, “fixed”}
Method for calculating the denominator when computing frequencies. If “called” then use the number of called alleles, i.e., number of samples with non-missing genotype calls multiplied by 2. If “fixed” then use the number of samples multiplied by 2.
- ci_method{“normal”, “agresti_coull”, “beta”, “wilson”, “binom_test”}, optional
Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.
- Returns
- dsxarray.Dataset
The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.
aa_allele_frequencies_advanced()#
- Ag3.aa_allele_frequencies_advanced(transcript, area_by, period_by, sample_sets=None, sample_query=None, min_cohort_size=10, variant_query=None, site_mask=None, nobs_mode='called', ci_method='wilson')#
Group samples by taxon, area (space) and period (time), then compute amino acid change allele counts and frequencies.
- Parameters
- transcriptstr
Gene transcript ID (AgamP4.12), e.g., “AGAP004707-RD”.
- area_bystr
Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.
- period_by{“year”, “quarter”, “month”}
Length of time to group samples temporally.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- min_cohort_sizeint, optional
Minimum cohort size. Any cohorts below this size are omitted.
- variant_querystr, optional
- site_maskstr, optional
Site filters mask to apply.
- nobs_mode{“called”, “fixed”}
Method for calculating the denominator when computing frequencies. If “called” then use the number of called alleles, i.e., number of samples with non-missing genotype calls multiplied by 2. If “fixed” then use the number of samples multiplied by 2.
- ci_method{“normal”, “agresti_coull”, “beta”, “wilson”, “binom_test”}, optional
Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.
- Returns
- dsxarray.Dataset
The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.
gene_cnv_frequencies_advanced()#
- Ag3.gene_cnv_frequencies_advanced(region, area_by, period_by, sample_sets=None, sample_query=None, min_cohort_size=10, variant_query=None, drop_invariant=True, max_coverage_variance=0.2, ci_method='wilson')#
Group samples by taxon, area (space) and period (time), then compute gene CNV counts and frequencies.
- Parameters
- region: str or list of str or Region or list of Region
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- area_bystr
Column name in the sample metadata to use to group samples spatially. E.g., use “admin1_iso” or “admin1_name” to group by level 1 administrative divisions, or use “admin2_name” to group by level 2 administrative divisions.
- period_by{“year”, “quarter”, “month”}
Length of time to group samples temporally.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- min_cohort_sizeint, optional
Minimum cohort size. Any cohorts below this size are omitted.
- variant_querystr, optional
- drop_invariantbool, optional
If True, drop any rows where there is no evidence of variation.
- max_coverage_variancefloat, optional
Remove samples if coverage variance exceeds this value.
- ci_method{“normal”, “agresti_coull”, “beta”, “wilson”, “binom_test”}, optional
Method to use for computing confidence intervals, passed through to statsmodels.stats.proportion.proportion_confint.
- Returns
- dsxarray.Dataset
The resulting dataset contains data has dimensions “cohorts” and “variants”. Variables prefixed with “cohort” are 1-dimensional arrays with data about the cohorts, such as the area, period, taxon and cohort size. Variables prefixed with “variant” are 1-dimensional arrays with data about the variants, such as the contig, position, reference and alternate alleles. Variables prefixed with “event” are 2-dimensional arrays with the allele counts and frequency calculations.
plot_genes()#
- Ag3.plot_genes(region, width=800, height=120, show=True, toolbar_location='above', x_range=None, title='Genes')#
Plot a genes track, using bokeh.
- Parameters
- regionstr or Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- widthint, optional
Plot width in pixels (px).
- heightint, optional
Plot height in pixels (px).
- showbool, optional
If true, show the plot.
- toolbar_locationstr, optional
Location of bokeh toolbar.
- x_rangebokeh.models.Range1d, optional
X axis range (for linking to other tracks).
- titlestr, optional
Plot title.
- Returns
- figFigure
Bokeh figure.
plot_transcript()#
- Ag3.plot_transcript(transcript, width=800, height=120, show=True, x_range=None, toolbar_location='above', title=True)#
Plot a transcript, using bokeh.
- Parameters
- transcriptstr
Transcript identifier, e.g., “AGAP004707-RD”.
- widthint, optional
Plot width in pixels (px).
- heightint, optional
Plot height in pixels (px).
- showbool, optional
If true, show the plot.
- toolbar_locationstr, optional
Location of bokeh toolbar.
- x_rangebokeh.models.Range1d, optional
X axis range (for linking to other tracks).
- titlestr, optional
Plot title.
- Returns
- figFigure
Bokeh figure.
plot_cnv_hmm_coverage()#
- Ag3.plot_cnv_hmm_coverage(sample, region, sample_set=None, y_max='auto', width=800, track_height=170, genes_height=100, circle_kwargs=None, line_kwargs=None, show=True)#
Plot CNV HMM data for a single sample, together with a genes track, using bokeh.
- Parameters
- samplestr or int
Sample identifier or index within sample set.
- regionstr
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- sample_setstr, optional
Sample set identifier.
- y_maxstr or int, optional
Maximum Y axis value.
- widthint, optional
Plot width in pixels (px).
- track_heightint, optional
Height of CNV HMM track in pixels (px).
- genes_heightint, optional
Height of genes track in pixels (px).
- circle_kwargsdict, optional
Passed through to bokeh circle() function.
- line_kwargsdict, optional
Passed through to bokeh line() function.
- showbool, optional
If true, show the plot.
- Returns
- figFigure
Bokeh figure.
plot_cnv_hmm_heatmap()#
- Ag3.plot_cnv_hmm_heatmap(region, sample_sets=None, sample_query=None, max_coverage_variance=0.2, width=800, row_height=7, track_height=None, genes_height=100, show=True)#
Plot CNV HMM data for multiple samples as a heatmap, with a genes track, using bokeh.
- Parameters
- regionstr
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- max_coverage_variancefloat, optional
Remove samples if coverage variance exceeds this value.
- widthint, optional
Plot width in pixels (px).
- row_heightint, optional
Plot height per row (sample) in pixels (px).
- track_heightint, optional
Absolute plot height for HMM track in pixels (px), overrides row_height.
- genes_heightint, optional
Height of genes track in pixels (px).
- showbool, optional
If true, show the plot.
- Returns
- figFigure
Bokeh figure.
resolve_region()#
- Ag3.resolve_region(region)#
Convert a genome region into a standard data structure.
- Parameters
- region: str
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- Returns
- outRegion
A named tuple with attributes contig, start and end.
igv()#
- Ag3.igv(region, tracks=None)#
Create an IGV browser and display it within the notebook.
- Parameters
- region: str or Region
Genomic region defined with coordinates, e.g., “2L:2422600-2422700”.
- trackslist of dict, optional
Configuration for any additional tracks.
- Returns
- browserigv_notebook.Browser
view_alignments()#
- Ag3.view_alignments(region, sample, visibility_window=20000)#
Launch IGV and view sequence read alignments and SNP genotypes from the given sample.
- Parameters
- region: str or Region
Genomic region defined with coordinates, e.g., “2L:2422600-2422700”.
- samplestr
Sample identifier, e.g., “AR0001-C”.
- visibility_windowint, optional
Zoom level in base pairs at which alignment and SNP data will become visible.
Notes
Only samples from the Ag3.0 release are currently supported.
wgs_data_catalog()#
- Ag3.wgs_data_catalog(sample_set)#
Load a data catalog providing URLs for downloading BAM, VCF and Zarr files for samples in a given sample set.
- Parameters
- sample_setstr
Sample set identifier.
- Returns
- dfpandas.DataFrame
One row per sample, columns provide URLs.
snp_allele_counts()#
- Ag3.snp_allele_counts(region, sample_sets=None, sample_query=None, site_mask=None, site_class=None, cohort_size=None, random_seed=42)#
Compute SNP allele counts. This returns the number of times each SNP allele was observed in the selected samples.
- Parameters
- regionstr or Region
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}
Site filters mask to apply.
- site_classstr, optional
Select sites belonging to one of the following classes: CDS_DEG_4, (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites), INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5’ splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3’ splice site), UTR_5PRIME (5’ untranslated region), UTR_3PRIME (3’ untranslated region), INTERGENIC (intergenic, more than 10 kbp from a gene).
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size before computing allele counts.
- random_seedint, optional
Random seed used for down-sampling.
- Returns
- acnp.ndarray
A numpy array of shape (n_variants, 4), where the first column has the reference allele (0) counts, the second column has the first alternate allele (1) counts, the third column has the second alternate allele (2) counts, and the fourth column has the third alternate allele (3) counts.
Notes
This computation may take some time to run, depending on your computing environment. Results of this computation will be cached and re-used if the results_cache parameter was set when instantiating the Ag3 class.
pca()#
- Ag3.pca(region, n_snps, thin_offset=0, sample_sets=None, sample_query=None, site_mask='default', min_minor_ac=2, max_missing_an=0, n_components=20)#
Run a principal components analysis (PCA) using biallelic SNPs from the selected genome region and samples.
- Parameters
- regionstr
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- n_snpsint
The desired number of SNPs to use when running the analysis. SNPs will be evenly thinned to approximately this number.
- thin_offsetint, optional
Starting index for SNP thinning. Change this to repeat the analysis using a different set of SNPs.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “country == ‘Burkina Faso’”.
- site_maskstr, optional
Site filters mask to apply, e.g. “gamb_colu”
- min_minor_acint, optional
The minimum minor allele count. SNPs with a minor allele count below this value will be excluded prior to thinning.
- max_missing_anint, optional
The maximum number of missing allele calls to accept. SNPs with more than this value will be excluded prior to thinning. Set to 0 (default) to require no missing calls.
- n_componentsint, optional
Number of components to return.
- Returns
- df_pcapandas.DataFrame
A dataframe of sample metadata, with columns “PC1”, “PC2”, “PC3”, etc., added.
- evrnp.ndarray
An array of explained variance ratios, one per component.
Notes
This computation may take some time to run, depending on your computing environment. Results of this computation will be cached and re-used if the results_cache parameter was set when instantiating the Ag3 class.
plot_pca_variance()#
- Ag3.plot_pca_variance(evr, width=900, height=400, **kwargs)#
Plot explained variance ratios from a principal components analysis (PCA) using a plotly bar plot.
- Parameters
- evrnp.ndarray
An array of explained variance ratios, one per component.
- widthint, optional
Plot width in pixels (px).
- heightint, optional
Plot height in pixels (px).
- **kwargs
Passed through to px.bar().
- Returns
- figFigure
A plotly figure.
plot_pca_coords()#
- Ag3.plot_pca_coords(data, x='PC1', y='PC2', color=None, symbol=None, jitter_frac=0.02, random_seed=42, width=900, height=600, marker_size=10, **kwargs)#
Plot sample coordinates from a principal components analysis (PCA) as a plotly scatter plot.
- Parameters
- datapandas.DataFrame
A dataframe of sample metadata, with columns “PC1”, “PC2”, “PC3”, etc., added.
- xstr, optional
Name of principal component to plot on the X axis.
- ystr, optional
Name of principal component to plot on the Y axis.
- colorstr, optional
Name of column in the input dataframe to use to color the markers.
- symbolstr, optional
Name of column in the input dataframe to use to choose marker symbols.
- jitter_fracfloat, optional
Randomly jitter points by this fraction of their range.
- random_seedint, optional
Random seed for jitter.
- widthint, optional
Plot width in pixels (px).
- heightint, optional
Plot height in pixels (px).
- marker_sizeint, optional
Marker size.
- Returns
- figFigure
A plotly figure.
plot_pca_coords_3d()#
- Ag3.plot_pca_coords_3d(data, x='PC1', y='PC2', z='PC3', color=None, symbol=None, jitter_frac=0.02, random_seed=42, width=900, height=600, marker_size=5, **kwargs)#
Plot sample coordinates from a principal components analysis (PCA) as a plotly 3D scatter plot.
- Parameters
- datapandas.DataFrame
A dataframe of sample metadata, with columns “PC1”, “PC2”, “PC3”, etc., added.
- xstr, optional
Name of principal component to plot on the X axis.
- ystr, optional
Name of principal component to plot on the Y axis.
- zstr, optional
Name of principal component to plot on the Z axis.
- colorstr, optional
Name of column in the input dataframe to use to color the markers.
- symbolstr, optional
Name of column in the input dataframe to use to choose marker symbols.
- jitter_fracfloat, optional
Randomly jitter points by this fraction of their range.
- random_seedint, optional
Random seed for jitter.
- widthint, optional
Plot width in pixels (px).
- heightint, optional
Plot height in pixels (px).
- marker_sizeint, optional
Marker size.
- Returns
- figFigure
A plotly figure.
plot_snps()#
- Ag3.plot_snps(region, sample_sets=None, sample_query=None, site_mask='default', width=800, track_height=80, genes_height=120, max_snps=200000, show=True)#
Plot SNPs in a given genome region. SNPs are shown as rectangles, with segregating and non-segregating SNPs positioned on different levels, and coloured by site filter.
- Parameters
- regionstr
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “country == ‘Burkina Faso’”.
- site_maskstr, optional
Site filters mask to apply, e.g. “gamb_colu”
- widthint, optional
Width of plot in pixels (px).
- track_heightint, optional
Height of SNPs track in pixels (px).
- genes_heightint, optional
Height of genes track in pixels (px).
- max_snpsint, optional
Maximum number of SNPs to show.
- showbool, optional
If True, show the plot.
- Returns
- figFigure
Bokeh figure.
aim_variants()#
- Ag3.aim_variants(aims)#
Open ancestry informative marker variants.
- Parameters
- aims{‘gamb_vs_colu’, ‘gambcolu_vs_arab’}
Which ancestry informative markers to use.
- Returns
- dsxarray.Dataset
A dataset containing AIM positions and discriminating alleles.
aim_calls()#
- Ag3.aim_calls(aims, sample_sets=None, sample_query=None)#
Access ancestry informative marker SNP sites, alleles and genotype calls.
- Parameters
- aims{‘gamb_vs_colu’, ‘gambcolu_vs_arab’}
Which ancestry informative markers to use.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- Returns
- dsxarray.Dataset
A dataset containing AIM SNP sites, alleles and genotype calls.
plot_aim_heatmap()#
- Ag3.plot_aim_heatmap(aims, sample_sets=None, sample_query=None, sort=True, row_height=4, colors='T10', xgap=0, ygap=0.5)#
Plot a heatmap of ancestry-informative marker (AIM) genotypes.
- Parameters
- aims{‘gamb_vs_colu’, ‘gambcolu_vs_arab’}
Which ancestry informative markers to use.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- sortbool, optional
If true (default), sort the samples by the total fraction of AIM alleles for the second species in the comparison.
- row_heightint, optional
Height per sample in px.
- colorsstr, optional
Choose your favourite color palette.
- xgapfloat, optional
Creates lines between columns (variants).
- ygapfloat, optional
Creates lines between rows (samples).
- Returns
- figplotly.graph_objects.Figure
cohort_diversity_stats()#
- Ag3.cohort_diversity_stats(cohort, cohort_size, region, site_mask, site_class, sample_sets=None, random_seed=42, n_jack=200, confidence_level=0.95)#
Compute genetic diversity summary statistics for a cohort of individuals.
- Parameters
- cohortstr or (str, str)
Either a string giving one of the predefined cohort labels, or a pair of strings giving a custom cohort label and a sample query to select samples in the cohort.
- cohort_sizeint
Number of individuals to use for computation of summary statistics. If the cohort is larger than this size, it will be randomly down-sampled.
- regionstr
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}
Site filters mask to apply.
- site_classstr, optional
Select sites belonging to one of the following classes: CDS_DEG_4, (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites), INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5’ splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3’ splice site), UTR_5PRIME (5’ untranslated region), UTR_3PRIME (3’ untranslated region), INTERGENIC (intergenic, more than 10 kbp from a gene).
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- random_seedint, optional
Seed for random number generator.
- n_jackint, optional
Number of blocks to divide the data into for the block jackknife estimation of confidence intervals. N.B., larger is not necessarily better.
- confidence_levelfloat, optional
Confidence level to use for confidence interval calculation. 0.95 means 95% confidence interval.
- Returns
- statspandas.Series
A series with summary statistics and their confidence intervals.
diversity_stats()#
- Ag3.diversity_stats(cohorts, cohort_size, region, site_mask, site_class, sample_query=None, sample_sets=None, random_seed=42, n_jack=200, confidence_level=0.95)#
Compute genetic diversity summary statistics for multiple cohorts.
- Parameters
- cohortsstr or dict
Either a string giving one of the predefined cohort columns, or a dictionary mapping cohort labels to sample queries.
- cohort_sizeint
Number of individuals to use for computation of summary statistics. If the cohort is larger than this size, it will be randomly down-sampled.
- regionstr
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- site_mask{“gamb_colu_arab”, “gamb_colu”, “arab”}
Site filters mask to apply.
- site_classstr, optional
Select sites belonging to one of the following classes: CDS_DEG_4, (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites), INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5’ splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3’ splice site), UTR_5PRIME (5’ untranslated region), UTR_3PRIME (3’ untranslated region), INTERGENIC (intergenic, more than 10 kbp from a gene).
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- random_seedint, optional
Seed for random number generator.
- n_jackint, optional
Number of blocks to divide the data into for the block jackknife estimation of confidence intervals. N.B., larger is not necessarily better.
- confidence_levelfloat, optional
Confidence level to use for confidence interval calculation. 0.95 means 95% confidence interval.
- Returns
- df_statspandas.DataFrame
A DataFrame where each row provides summary statistics and their confidence intervals for a single cohort.
plot_diversity_stats()#
- Ag3.plot_diversity_stats(df_stats, color=None, bar_plot_height=450, bar_width=30, scatter_plot_height=500, scatter_plot_width=500, template='plotly_white', plot_kwargs=None)#
Plot diversity statistics.
- Parameters
- df_statspandas.DataFrame
Output from diversity_stats().
- colorstr, optional
Column to color by.
- bar_plot_heightint, optional
Height of bar plots in pixels (px).
- bar_widthint, optional
Width per bar in pixels (px).
- scatter_plot_heightint, optional
Height of scatter plot in pixels (px).
- scatter_plot_widthint, optional
Width of scatter plot in pixels (px).
- templatestr, optional
Plotly template.
- plot_kwargsdict, optional
Extra plotting parameters
plot_heterozygosity()#
- Ag3.plot_heterozygosity(sample, region, site_mask, window_size, sample_set=None, y_max=0.03, width=800, track_height=170, genes_height=120, circle_kwargs=None, show=True)#
Plot windowed heterozygosity for a single sample over a genome region.
- Parameters
- samplestr or int
Sample identifier or index within sample set.
- regionstr
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- site_maskstr
Site filters mask to apply, e.g. “gamb_colu”
- window_sizeint
Number of sites per window.
- sample_setstr, optional
Sample set identifier. Not needed if sample parameter gives a sample identifier.
- y_maxfloat, optional
Y axis limit.
- widthint, optional
Plot width in pixels (px).
- track_heightint, optional
Heterozygosity track height in pixels (px).
- genes_heightint, optional
Genes track height in pixels (px).
- circle_kwargsdict, optional
Passed through to bokeh circle() function.
- showbool, optional
If true, show the plot.
- Returns
- figFigure
Bokeh figure.
roh_hmm()#
- Ag3.roh_hmm(sample, region, site_mask, window_size, sample_set=None, phet_roh=0.001, phet_nonroh=(0.003, 0.01), transition=0.001)#
Infer runs of homozygosity for a single sample over a genome region.
- Parameters
- samplestr or int
Sample identifier or index within sample set.
- regionstr
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- site_maskstr
Site filters mask to apply, e.g. “gamb_colu”
- window_sizeint
Number of sites per window.
- sample_setstr, optional
Sample set identifier. Not needed if sample parameter gives a sample identifier.
- phet_roh: float, optional
Probability of observing a heterozygote in a ROH.
- phet_nonroh: tuple of floats, optional
One or more probabilities of observing a heterozygote outside a ROH.
- transition: float, optional
Probability of moving between states. A larger window size may call for a larger transitional probability.
- Returns
- df_rohpandas.DataFrame
A DataFrame where each row provides data about a single run of homozygosity.
plot_roh()#
- Ag3.plot_roh(sample, region, site_mask, window_size, sample_set=None, phet_roh=0.001, phet_nonroh=(0.003, 0.01), transition=0.001, y_max=0.03, width=800, heterozygosity_height=170, roh_height=50, genes_height=120, circle_kwargs=None, show=True)#
Plot windowed heterozygosity and inferred runs of homozygosity for a single sample over a genome region.
- Parameters
- samplestr or int
Sample identifier or index within sample set.
- regionstr
Contig name (e.g., “2L”), gene name (e.g., “AGAP007280”) or genomic region defined with coordinates (e.g., “2L:44989425-44998059”).
- site_maskstr
Site filters mask to apply, e.g. “gamb_colu”
- window_sizeint
Number of sites per window.
- sample_setstr, optional
Sample set identifier. Not needed if sample parameter gives a sample identifier.
- phet_roh: float, optional
Probability of observing a heterozygote in a ROH.
- phet_nonroh: tuple of floats, optional
One or more probabilities of observing a heterozygote outside a ROH.
- transition: float, optional
Probability of moving between states. A larger window size may call for a larger transitional probability.
- y_maxfloat, optional
Y axis limit.
- widthint, optional
Plot width in pixels (px).
- heterozygosity_heightint, optional
Heterozygosity track height in pixels (px).
- roh_heightint, optional
ROH track height in pixels (px).
- genes_heightint, optional
Genes track height in pixels (px).
- circle_kwargsdict, optional
Passed through to bokeh circle() function.
- showbool, optional
If true, show the plot.
- Returns
- figFigure
Bokeh figure.
h12_calibration()#
- Ag3.h12_calibration(contig, analysis, sample_query=None, sample_sets=None, cohort_size=30, window_sizes=(100, 200, 500, 1000, 2000, 5000, 10000, 20000), random_seed=42)#
Generate h12 GWSS calibration data for different window sizes.
- Parameters
- contig: str
Chromosome arm (e.g., “2L”)
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- window_sizesint or list of int, optional
The sizes of windows used to calculate h12 over. Multiple window sizes should be used to calibrate the optimal size for h12 analysis.
- random_seedint, optional
Random seed used for down-sampling.
- Returns
- calibration runslist of numpy.ndarrays
A list of h12 calibration run arrays for each window size, containing values and percentiles.
plot_h12_calibration()#
- Ag3.plot_h12_calibration(contig, analysis, sample_query=None, sample_sets=None, cohort_size=30, window_sizes=(100, 200, 500, 1000, 2000, 5000, 10000, 20000), random_seed=42, title=None)#
Plot h12 GWSS calibration data for different window sizes.
- Parameters
- contig: str
Chromosome arm (e.g., “2L”)
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- window_sizesint or list of int, optional
The sizes of windows used to calculate h12 over. Multiple window sizes should be used to calibrate the optimal size for h12 analysis.
- random_seedint, optional
Random seed used for down-sampling.
- titlestr, optional
If provided, title string is used to label plot.
- Returns
- figfigure
A plot showing h12 calibration run percentiles for different window sizes.
h12_gwss()#
- Ag3.h12_gwss(contig, analysis, window_size, sample_sets=None, sample_query=None, cohort_size=30, random_seed=42)#
Run h12 GWSS.
- Parameters
- contig: str
Chromosome arm (e.g., “2L”)
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- window_sizeint
The size of windows used to calculate h12 over.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- random_seedint, optional
Random seed used for down-sampling.
- Returns
- xnumpy.ndarray
An array containing the window centre point genomic positions.
- h12numpy.ndarray
An array with h12 statistic values for each window.
plot_h12_gwss()#
- Ag3.plot_h12_gwss(contig, analysis, window_size, sample_sets=None, sample_query=None, cohort_size=30, random_seed=42, title=None, width=800, track_height=170, genes_height=100)#
Plot h12 GWSS data.
- Parameters
- contig: str
Chromosome arm (e.g., “2L”)
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- window_sizeint
The size of windows used to calculate h12 over.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- random_seedint, optional
Random seed used for down-sampling.
- titlestr, optional
If provided, title string is used to label plot.
- widthint, optional
Plot width in pixels (px).
- track_heightint. optional
GWSS track height in pixels (px).
- genes_heightint. optional
Gene track height in pixels (px).
- Returns
- figfigure
A plot showing windowed h12 statistic with gene track on x-axis.
h1x_gwss()#
- Ag3.h1x_gwss(contig, analysis, window_size, cohort1_query, cohort2_query, sample_sets=None, cohort_size=30, random_seed=42)#
Run a H1X genome-wide scan to detect genome regions with shared selective sweeps between two cohorts.
- Parameters
- contig: str
Chromosome arm (e.g., “2L”)
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- window_sizeint
The size of windows used to calculate h12 over.
- cohort1_querystr
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- cohort2_querystr
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- random_seedint, optional
Random seed used for down-sampling.
- Returns
- xnumpy.ndarray
An array containing the window centre point genomic positions.
- h1xnumpy.ndarray
An array with H1X statistic values for each window.
plot_h1x_gwss()#
- Ag3.plot_h1x_gwss(contig, analysis, window_size, cohort1_query, cohort2_query, sample_sets=None, cohort_size=30, random_seed=42, title=None, width=800, track_height=190, genes_height=100)#
Run and plot a H1X genome-wide scan to detect genome regions with shared selective sweeps between two cohorts.
- Parameters
- contig: str
Chromosome arm (e.g., “2L”)
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- window_sizeint
The size of windows used to calculate h12 over.
- cohort1_querystr
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- cohort2_querystr
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- random_seedint, optional
Random seed used for down-sampling.
- titlestr, optional
If provided, title string is used to label plot.
- widthint, optional
Plot width in pixels (px).
- track_heightint. optional
GWSS track height in pixels (px).
- genes_heightint. optional
Gene track height in pixels (px).
- Returns
- figfigure
A plot showing windowed H1X statistic with gene track on x-axis.
plot_haplotype_clustering()#
- Ag3.plot_haplotype_clustering(region, analysis, sample_sets=None, sample_query=None, color=None, symbol=None, linkage_method='single', count_sort=True, distance_sort=False, cohort_size=None, random_seed=42, width=1000, height=500, **kwargs)#
Hierarchically cluster haplotypes in region and produce an interactive plot.
- Parameters
- region: str or list of str or Region or list of Region
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- colorstr, optional
Identifies a column in the sample metadata which determines the colour of dendrogram leaves (haplotypes).
- symbolstr, optional
Identifies a column in the sample metadata which determines the shape of dendrogram leaves (haplotypes).
- linkage_method: str, optional
The linkage algorithm to use, valid options are ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ and ‘ward’. See the Linkage Methods section of the scipy.cluster.hierarchy.linkage docs for full descriptions.
- count_sort: bool, optional
For each node n, the order (visually, from left-to-right) n’s two descendant links are plotted is determined by this parameter. If True, the child with the minimum number of original objects in its cluster is plotted first. Note distance_sort and count_sort cannot both be True.
- distance_sort: bool, optional
For each node n, the order (visually, from left-to-right) n’s two descendant links are plotted is determined by this parameter. If True, The child with the minimum distance between its direct descendants is plotted first.
- cohort_sizeint, optional
If provided, randomly down-sample to the given cohort size.
- random_seedint, optional
Random seed used for down-sampling.
- widthint, optional
The figure width in pixels
- height: int, optional
The figure height in pixels
- Returns
- figFigure
Plotly figure.
plot_haplotype_network()#
- Ag3.plot_haplotype_network(region, analysis, sample_sets=None, sample_query=None, max_dist=2, color=None, color_discrete_sequence=None, color_discrete_map=None, category_orders=None, node_size_factor=50, server_mode='inline', height=650, width='100%', layout='cose', layout_params=None, server_port=None)#
Construct a median-joining haplotype network and display it using Cytoscape.
A haplotype network provides a visualisation of the genetic distance between haplotypes. Each node in the network represents a unique haplotype. The size (area) of the node is scaled by the number of times that unique haplotype was observed within the selected samples. A connection between two nodes represents a single SNP difference between the corresponding haplotypes.
- Parameters
- region: str or list of str or Region or list of Region
Chromosome arm (e.g., “2L”), gene name (e.g., “AGAP007280”), genomic region defined with coordinates (e.g., “2L:44989425-44998059”) or a named tuple with genomic location Region(contig, start, end). Multiple values can be provided as a list, in which case data will be concatenated, e.g., [“3R”, “3L”].
- analysis{“arab”, “gamb_colu”, “gamb_colu_arab”}
Which phasing analysis to use. If analysing only An. arabiensis, the “arab” analysis is best. If analysing only An. gambiae and An. coluzzii, the “gamb_colu” analysis is best. Otherwise, use the “gamb_colu_arab” analysis.
- sample_setsstr or list of str, optional
Can be a sample set identifier (e.g., “AG1000G-AO”) or a list of sample set identifiers (e.g., [“AG1000G-BF-A”, “AG1000G-BF-B”]) or a release identifier (e.g., “3.0”) or a list of release identifiers.
- sample_querystr, optional
A pandas query string which will be evaluated against the sample metadata e.g., “taxon == ‘coluzzii’ and country == ‘Burkina Faso’”.
- max_distint, optional
Join network components up to a maximum distance of 2 SNP differences.
- colorstr, optional
Identifies a column in the sample metadata which determines the colour of pie chart segments within nodes.
- color_discrete_sequencelist, optional
Provide a list of colours to use.
- color_discrete_mapdict, optional
Provide an explicit mapping from values to colours.
- category_orderslist, optional
Control the order in which values appear in the legend.
- node_size_factorint, optional
Control the sizing of nodes.
- server_mode{“inline”, “external”, “jupyterlab”}
Controls how the Jupyter Dash app will be launched. See https://medium.com/plotly/introducing-jupyterdash-811f1f57c02e for more information.
- heightint, optional
Height of the plot.
- widthint, optional
Width of the plot.
- layoutstr
Name of the network layout to use to position nodes.
- layout_params
Additional parameters to the layout algorithm.
- server_port
Manually override the port on which the Dash app will run.
- Returns
- app
The running Dash app.