Ag3 data downloads

This notebook provides information about how to download data from the Anopheles gambiae 1000 Genomes project (Ag1000G) phase 3. This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls.

Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.

Examples in this notebook assume you are downloading data to a local folder within your home directory at the path ~/vo_agam_release/. Change this if you want to download to a different folder on the local file system.

Data hosting

Ag3 data are hosted by several different services.

Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). This guide provides examples of downloading data from ENA via FTP using the wget command line tool, but please note that there are several other options for downloading data, see the ENA documentation on how to download data files for more information.

Sample metadata in CSV format and SNP calls in VCF and Zarr formats are hosted on Google Cloud Storage (GCS) in the vo_agam_release bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible and do not require any authentication to access. This guide provides examples of downloading data from GCS to a local computer using the wget and gsutil command line tools. For more information about gsutil, see the gsutil tool documentation.

Sample sets

Data in this release are organised into 28 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to download data from only specific sample sets, or all sample sets. For convenience there is a tab-delimited manifest file listing all sample sets in the release. Here is a direct download link for the sample set manifest:

The sample set manifest can also be downloaded via gsutil to a directory on the local file system, e.g.:

!mkdir -pv ~/vo_agam_release/v3/
!gsutil cp gs://vo_agam_release/v3/manifest.tsv ~/vo_agam_release/v3/

Here are the file contents:

!cat ~/vo_agam_release/v3/manifest.tsv
sample_set	sample_count
AG1000G-AO	81
AG1000G-BF-A	181
AG1000G-BF-B	102
AG1000G-BF-C	13
AG1000G-CD	76
AG1000G-CF	73
AG1000G-CI	80
AG1000G-CM-A	303
AG1000G-CM-B	97
AG1000G-CM-C	44
AG1000G-FR	23
AG1000G-GA-A	69
AG1000G-GH	100
AG1000G-GM-A	74
AG1000G-GM-B	31
AG1000G-GM-C	174
AG1000G-GN-A	45
AG1000G-GN-B	185
AG1000G-GQ	10
AG1000G-GW	101
AG1000G-KE	86
AG1000G-ML-A	60
AG1000G-ML-B	71
AG1000G-MW	41
AG1000G-MZ	74
AG1000G-TZ	300
AG1000G-UG	290
AG1000G-X	297

For more information about these sample sets, see the section on sample sets in the introduction to Ag1000G phase 3.

Sample metadata

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.

Specimen collection metadata

Specimen collection metadata can be downloaded from GCS. E.g., here is the download link for the sample metadata for sample set AG1000G-BF-A:

Sample metadata for all sample sets can also be downloaded using gsutil:

!mkdir -pv ~/vo_agam_release/v3/metadata/
!gsutil -m rsync -r gs://vo_agam_release/v3/metadata/ ~/vo_agam_release/v3/metadata/

Here are the first few rows of the sample metadata for sample set AG1000G-BF-A:

!head ~/vo_agam_release/v3/metadata/general/AG1000G-BF-A/samples.meta.csv
sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call
AB0085-Cx,BF2-4,Austin Burt,Burkina Faso,Pala,2012,7,11.150,-4.235,F
AB0086-Cx,BF2-6,Austin Burt,Burkina Faso,Pala,2012,7,11.150,-4.235,F
AB0087-C,BF3-3,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0088-C,BF3-5,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0089-Cx,BF3-8,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0090-C,BF3-10,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0091-C,BF3-12,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0092-C,BF3-13,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F
AB0094-Cx,BF3-17,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F

The sample_id columns gives the sample identifier used throughout all Ag1000G analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

The sex_call column gives the gender as determined from the sequence data.

Species calls

We have made a call for each specimen as to which species it belongs to (Anopheles gambiae, Anopheles coluzzii, Anopheles arabiensis) based on the genotypes of the samples. These calls were made from the sequence data, and there are cases where the species is not easy to determine. We report species calls using two methods, principal components analysis (PCA) and ancestry informative markers (AIMs).

Species calls can be downloaded from GCS, e.g., for sample set AG1000G-BF-A:

Alternatively if you ran the gsutil rsync command above to download sample metadata then this file will already be present on your local file system.

Here are the first few rows of the AIM species calls for sample set AG1000G-BF-A:

!head ~/vo_agam_release/v3/metadata/species_calls_20200422/AG1000G-BF-A/samples.species_aim.csv
sample_id,aim_fraction_colu,aim_fraction_arab,species_gambcolu_arabiensis,species_gambiae_coluzzii
AB0085-Cx,0.024,0.002,gamb_colu,gambiae
AB0086-Cx,0.038,0.002,gamb_colu,gambiae
AB0087-C,0.982,0.002,gamb_colu,coluzzii
AB0088-C,0.990,0.002,gamb_colu,coluzzii
AB0089-Cx,0.975,0.002,gamb_colu,coluzzii
AB0090-C,0.977,0.002,gamb_colu,coluzzii
AB0091-C,0.974,0.002,gamb_colu,coluzzii
AB0092-C,0.978,0.002,gamb_colu,coluzzii
AB0094-Cx,0.986,0.002,gamb_colu,coluzzii

The species_gambcolu_arabiensis column provides a call as to whether the specimen is arabiensis or not (gamb_colu).

The species_gambiae_coluzzii column applies to samples that are not arabiensis, and differentiates gambiae versus coluzzii.

Raw sequence reads (FASTQ format)

The raw sequence reads used in this data release can be downloaded from ENA. Note that for most samples there were multiple sequencing runs, and hence there are usually multiple ENA run accessions per sample. For most samples there were 3 sequencing runs, but some samples have 4 and some have a single sequencing run.

To find the ENA run accessions for a given sample, first download the catalog of run accessions:

Alternatively if you ran the gsutil rsync command above to download sample metadata then this file will already be present on your local file system. Inspect the file:

!head ~/vo_agam_release/v3/metadata/ena_runs.csv
sample_id,ena_run
AR0001-C,ERR347035
AR0001-C,ERR347047
AR0001-C,ERR352136
AR0002-C,ERR328585
AR0002-C,ERR323844
AR0002-C,ERR328597
AR0004-C,ERR343648
AR0004-C,ERR343636
AR0004-C,ERR343468

For example, the sequence reads for sample AR0001-C are available from three ENA accessions: ERR347035, ERR347047 and ERR352136. To download the sequence reads, visit the ENA website and search for these accessions. E.g., links to download sequence reads for run ERR352136 are available from this web page: https://www.ebi.ac.uk/ena/browser/view/ERR352136. To download the FASTQ files for this run via wget:

!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR352/ERR352136/ERR352136_1.fastq.gz
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR352/ERR352136/ERR352136_2.fastq.gz

Note that FASTQ files are relatively large, several GB per sample, so they may take a long time to download, and may require a substantial amount of disk space on your local system.

Sequence read alignments (BAM format)

Analysis-ready sequence read alignments are available in BAM format for all samples in the release and can be downloaded from ENA. A catalog file mapping sample identifiers to ENA accessions is available at this link:

Alternatively if you ran the gsutil rsync command above to download sample metadata then this file will already be present on your local file system. Here are the first few rows:

!head ~/vo_agam_release/v3/metadata/ena_alignments.csv
sample_id,ena_analysis
AR0001-C,ERZ1695275
AR0002-C,ERZ1695276
AR0004-C,ERZ1695277
AR0006-C,ERZ1695278
AR0007-C,ERZ1695279
AR0008-C,ERZ1695280
AR0009-C,ERZ1695281
AR0010-Cx,ERZ1695282
AR0011-C,ERZ1695283

Each row in this file provides a mapping from Ag1000G sample identifiers to ENA analysis accessions. To find links for downloading the data, visit the ENA website and search for the corresponding analysis accession. E.g., the analysis-ready BAM file for sample AR0001-C can be downloaded from this web page: https://www.ebi.ac.uk/ena/browser/view/ERZ1695275. To download the BAM file via wget:

!wget ftp://ftp.sra.ebi.ac.uk/vol1/ERZ169/ERZ1695275/AR0001-C.bam

Note that BAM files are relatively large, approximately 10G per sample, so they may take a long time to download, and may require a substantial amount of disk space on your local system.

SNP calls (VCF format)

SNP genotypes

SNP genotypes for individual mosquitoes in VCF format are available for download from GCS. A VCF file is available for each individual sample. To download a VCF file for a given sample, you will need the sample identifier and the sample set in which the sample belongs. For example, to download the VCF file for sample AR0001-C in sample set AG1000G-AO, run the following:

!wget --no-clobber https://storage.googleapis.com/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/AR0001-C.vcf.gz
!wget --no-clobber https://storage.googleapis.com/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/AR0001-C.vcf.gz.tbi

Alternatively, VCF files for all samples in a given sample set can be downloaded using the gsutil rsync command. E.g., to download all VCFs for sample set AG1000G-AO, run the following:

!mkdir -pv ~/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/
!gsutil -m rsync -r \
    gs://vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/ \
    ~/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/

Note that each of these VCF files is around 3 Gb, so downloading may take some time, and sufficient local storage will be needed.

Each of these VCF files is an “all sites” VCF file, meaning that genotypes have been called at all genomic positions where the reference nucleotide is not “N”, regardless of whether variation is observed in the given sample. This means that VCFs from multiple samples can be merged easily to create a multi-sample VCF, which may be required for certain analyses. For example, the code below downloads VCFs for three samples, then uses the bcftools merge command to merge the calls for chromosome arm 3R, up to 1 Mbp:

!wget --no-clobber https://storage.googleapis.com/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/AR0001-C.vcf.gz
!wget --no-clobber https://storage.googleapis.com/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/AR0001-C.vcf.gz.tbi
!wget --no-clobber https://storage.googleapis.com/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/AR0002-C.vcf.gz
!wget --no-clobber https://storage.googleapis.com/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/AR0002-C.vcf.gz.tbi
!wget --no-clobber https://storage.googleapis.com/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/AR0004-C.vcf.gz
!wget --no-clobber https://storage.googleapis.com/vo_agam_release/v3/snp_genotypes/per_sample/AG1000G-AO/vcf/AR0004-C.vcf.gz.tbi
!bcftools merge --output-type z --regions 3R:1-1000000 --output merged.vcf.gz AR0001-C.vcf.gz AR0002-C.vcf.gz AR0004-C.vcf.gz 

If you are just interested in analysing variants within a given set of samples, you might like to filter the merged VCF to remove non-variant sites and alleles, e.g., using bcftools view:

!bcftools view --output-type z --output-file merged_variant.vcf.gz --min-ac 1:nonmajor --trim-alt-alleles merged.vcf.gz

SNP calls in VCF format will also shortly be available from the European Variation Archive (EVA), check back soon for more information.

Site filters

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. We have created some sites-only VCF files with site filter information in the FILTER column. These VCF files are hosted on GCS.

Because different species may have different genome accessibility issues, we have created three separate site filters:

  • The “gamb_colu” site filter is designed for working only with samples that are not An. arabiensis.

  • The “arab” filter is designed for when only working with samples that are An. arabiensis.

  • The “gamb_colu_arab” filter is suitable for when analysing samples of any species together.

Each filter is available as a set of VCF files, one per chromosome arm. E.g., here is the direct download link for the gamb_colu_arab filters on chromosome arm 3R:

Alternatively, all site filters VCFs can be downloaded using gsutil, e.g.:

!mkdir -pv ~/vo_agam_release/v3/site_filters/dt_20200416/vcf/
!gsutil -m rsync -r \
    gs://vo_agam_release/v3/site_filters/dt_20200416/vcf/ \
    ~/vo_agam_release/v3/site_filters/dt_20200416/vcf/

SNP calls (Zarr format)

SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the Ag3 cloud data access guide for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.

The data are organised into several Zarr hierarchies.

SNP sites and alleles

Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:

!mkdir -pv ~/vo_agam_release/v3/snp_genotypes/all/sites/
!gsutil -m rsync -r \
    gs://vo_agam_release/v3/snp_genotypes/all/sites/ \
    ~/vo_agam_release/v3/snp_genotypes/all/sites/

Site filters

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. To download site filters data in Zarr format, excluding some parts of the data that you probably won’t need:

!mkdir -pv ~/vo_agam_release/v3/site_filters/
!gsutil -m rsync -r \
    -x '.*vcf.*|.*crosses_stats.*|.*[MG]Q10.*|.*[MG]Q30.*|.*[MG]Q_mean.*|.*[MG]Q_std.*|.*/lo_.*|.*/hi_.*|.*no_cov.*|.*allele_consistency.*|.*heterozygosity.*' \
    gs://vo_agam_release/v3/site_filters/ \
    ~/vo_agam_release/v3/site_filters/

SNP genotypes

SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for sample set AG1000G-BF-A, excluding some data you probably won’t need:

!mkdir -pv ~/vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-A/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-A/ \
        ~/vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-A/

Copy number variation (CNV) data

Data on copy number variation within the Ag3 cohort are available as three separate data types:

  • HMM – Genome-wide inferences of copy number state within each individual mosquito in 300 bp non-overlapping windows.

  • Coverage calls – Genome-wide copy number variant calls, derived from the HMM outputs by analysing contiguous regions of elevated copy number state then clustering of variants across individuals based on breakpoint proximity.

  • Discordant read calls – Copy number variant calls at selected insecticide resistance loci, based on analysis of read alignments at CNV breakpoints.

For more information on the methods used to generate these data, see the variant-calling methods page.

For each of these data types, data can be downloaded from Google Cloud Storage, and are available in either VCF or Zarr format.

CNV HMM

The HMM inferences of copy number state are available in VCF, Zarr and text formats, and are organised by sample set.

For example, the VCF file for sample set AG1000G-AO can be downloaded from the following URL:

VCF files for all samples sets can be downloaded via gsutil as follows:

# create a local directory to hold downloaded CNV data
!mkdir -pv ~/vo_agam_release/v3/cnv/
# download the HMM data in VCF format for all sample sets
!gsutil -m rsync -r \
    -x '.*/coverage_calls/.*|.*/discordant_read_calls/.*|.*/hmm/zarr/.*|.*/hmm/per_sample/.*' \
    gs://vo_agam_release/v3/cnv/ ~/vo_agam_release/v3/cnv/

Zarr files for all sample sets can be downloaded as follows:

# download HMM data in Zarr format for all sample sets
!gsutil -m rsync -r \
    -x '.*/coverage_calls/.*|.*/discordant_read_calls/.*|.*/hmm/vcf/.*|.*/hmm/per_sample/.*' \
    gs://vo_agam_release/v3/cnv/ ~/vo_agam_release/v3/cnv/

CNV coverage calls

Coverage-based CNV calls are available in VCF and Zarr formats, and are organised by sample set. Additionally the coverage calls were performed separately for An. arabiensis and (An. gambiae + An. coluzzii), and so are subdivided into “arab” and “gamb_colu” datasets, depending on which species are present in a given sample set.

Note that some samples were excluded from coverage calling because of high coverage variance.

For example, the VCF file for sample set AG1000G-AO and the gamb_colu callset can be downloaded from:

VCF files for all sample sets can be downloaded with:

# download coverage calls in VCF format for all sample sets
!gsutil -m rsync -r \
    -x '.*/hmm/.*|.*/discordant_read_calls/.*|.*/coverage_calls/.*/zarr/.*' \
    gs://vo_agam_release/v3/cnv/ ~/vo_agam_release/v3/cnv/

Zarr files for all sample sets can be downloaded with:

# download coverage calls in Zarr format for all sample sets
!gsutil -m rsync -r \
    -x '.*/hmm/.*|.*/discordant_read_calls/.*|.*/coverage_calls/.*/vcf/.*' \
    gs://vo_agam_release/v3/cnv/ ~/vo_agam_release/v3/cnv/

CNV discordant read calls

CNV calls for selected insecticide resistance loci are available in VCF and Zarr formats, and are organised by sample set.

For example, the VCF file for sample set AG1000G-AO can be downloaded from:

VCF and Zarr files for all sample sets can be downloaded with:

# download discordant read calls for all sample sets
!gsutil -m rsync -r \
    -x '.*/hmm/.*|.*/coverage_calls/.*' \
    gs://vo_agam_release/v3/cnv/ ~/vo_agam_release/v3/cnv/

Accessing downloaded data from Python

There are a wide variety of tools available for analysing the data from Ag1000G phase 3 once downloaded to your local system. If you would like to ask about possible approaches for a given analysis, please feel free to start a new discussion on the malariagen/vector-public-data repo on GitHub.

Within the MalariaGEN vector genomics team we primarily use the Python programming language for analysing the SNP data, making use of software packages in the Scientific Python / PyData ecosystem. This section gives some simple examples illustrating how to use these tools to read downloaded data.

These examples use data from two sample sets, AG1000G-BF-A and AG1000G-BF-B.

Firstly, here are all the commands again to download the data needed to run these examples (requires about 10G of local storage):

# download sample set manifest
!mkdir -pv ~/vo_agam_release/v3/
!gsutil cp gs://vo_agam_release/v3/manifest.tsv ~/vo_agam_release/v3/

# download sample metadata
!mkdir -pv ~/vo_agam_release/v3/metadata/
!gsutil -m rsync -r gs://vo_agam_release/v3/metadata/ ~/vo_agam_release/v3/metadata/

# download sites data
!mkdir -pv ~/vo_agam_release/v3/snp_genotypes/all/sites/
!gsutil -m rsync -r \
    gs://vo_agam_release/v3/snp_genotypes/all/sites/ \
    ~/vo_agam_release/v3/snp_genotypes/all/sites/

# download site filters data
!mkdir -pv ~/vo_agam_release/v3/site_filters/
!gsutil -m rsync -r \
    -x '.*vcf.*|.*crosses_stats.*|.*[MG]Q10.*|.*[MG]Q30.*|.*[MG]Q_mean.*|.*[MG]Q_std.*|.*/lo_.*|.*/hi_.*|.*no_cov.*|.*allele_consistency.*|.*heterozygosity.*' \
    gs://vo_agam_release/v3/site_filters/ \
    ~/vo_agam_release/v3/site_filters/

# download SNP genotype data for selected sample sets
!mkdir -pv ~/vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-A/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-A/ \
        ~/vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-A/
!mkdir -pv ~/vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-B/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-B/ \
        ~/vo_agam_release/v3/snp_genotypes/all/AG1000G-BF-B/
    
# download CNV HMM data for all sample sets
!mkdir -pv ~/vo_agam_release/v3/cnv/
!gsutil -m rsync -r \
    -x '.*/coverage_calls/.*|.*/discordant_read_calls/.*|.*/hmm/vcf/.*|.*/hmm/per_sample/.*' \
    gs://vo_agam_release/v3/cnv/ ~/vo_agam_release/v3/cnv/

The following Python packages need to be installed on the local system: numpy, dask, zarr, gcsfs, fsspec, scikit-allel. These packages can be installed via pip or conda. E.g.:

!pip install -q \
    zarr==2.6.1 \
    dask==2021.03.0 \
    xarray==0.18.0 \
    scikit-allel==1.3.5 \
    malariagen_data==0.8.0

To make accessing these data more convenient, we’ve also created a malariagen_data Python package, which is available from PyPI. This is experimental so please let us know if you find any bugs or have any suggestions. The malariagen_data package can be installed via pip, e.g.:

Set up access to the Ag1000G phase 3 data, at the local path where data were downloaded:

import malariagen_data
ag3 = malariagen_data.Ag3("~/vo_agam_release/")

Examples of reading/opening various files:

# read sample set manifest into a pandas dataframe
df_sample_sets = ag3.sample_sets()
df_sample_sets
sample_set sample_count release
0 AG1000G-AO 81 v3
1 AG1000G-BF-A 181 v3
2 AG1000G-BF-B 102 v3
3 AG1000G-BF-C 13 v3
4 AG1000G-CD 76 v3
5 AG1000G-CF 73 v3
6 AG1000G-CI 80 v3
7 AG1000G-CM-A 303 v3
8 AG1000G-CM-B 97 v3
9 AG1000G-CM-C 44 v3
10 AG1000G-FR 23 v3
11 AG1000G-GA-A 69 v3
12 AG1000G-GH 100 v3
13 AG1000G-GM-A 74 v3
14 AG1000G-GM-B 31 v3
15 AG1000G-GM-C 174 v3
16 AG1000G-GN-A 45 v3
17 AG1000G-GN-B 185 v3
18 AG1000G-GQ 10 v3
19 AG1000G-GW 101 v3
20 AG1000G-KE 86 v3
21 AG1000G-ML-A 60 v3
22 AG1000G-ML-B 71 v3
23 AG1000G-MW 41 v3
24 AG1000G-MZ 74 v3
25 AG1000G-TZ 300 v3
26 AG1000G-UG 290 v3
27 AG1000G-X 297 v3
# read sample metadata for AG1000G-BF-A and AG1000G-BF-B into a pandas dataframe
df_samples = ag3.sample_metadata(sample_sets=['AG1000G-BF-A', 'AG1000G-BF-B'])
df_samples
sample_id partner_sample_id contributor country location year month latitude longitude sex_call sample_set release aim_fraction_colu aim_fraction_arab species_gambcolu_arabiensis species_gambiae_coluzzii species
0 AB0085-Cx BF2-4 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.024 0.002 gamb_colu gambiae gambiae
1 AB0086-Cx BF2-6 Austin Burt Burkina Faso Pala 2012 7 11.150 -4.235 F AG1000G-BF-A v3 0.038 0.002 gamb_colu gambiae gambiae
2 AB0087-C BF3-3 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.982 0.002 gamb_colu coluzzii coluzzii
3 AB0088-C BF3-5 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.990 0.002 gamb_colu coluzzii coluzzii
4 AB0089-Cx BF3-8 Austin Burt Burkina Faso Bana 2012 7 11.233 -4.472 F AG1000G-BF-A v3 0.975 0.002 gamb_colu coluzzii coluzzii
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
278 AB0533-C BF13-18 Austin Burt Burkina Faso Souroukoudinga 2014 7 11.235 -4.535 F AG1000G-BF-B v3 0.021 0.002 gamb_colu gambiae gambiae
279 AB0536-C BF13-31 Austin Burt Burkina Faso Souroukoudinga 2014 7 11.235 -4.535 F AG1000G-BF-B v3 0.025 0.002 gamb_colu gambiae gambiae
280 AB0537-C BF13-32 Austin Burt Burkina Faso Souroukoudinga 2014 7 11.235 -4.535 F AG1000G-BF-B v3 0.029 0.002 gamb_colu gambiae gambiae
281 AB0538-C BF13-33 Austin Burt Burkina Faso Souroukoudinga 2014 7 11.235 -4.535 F AG1000G-BF-B v3 0.018 0.002 gamb_colu gambiae gambiae
282 AB0408-C BF14-20 Austin Burt Burkina Faso Bana 2014 7 11.233 -4.472 F AG1000G-BF-B v3 0.986 0.002 gamb_colu coluzzii coluzzii

283 rows × 17 columns

# inspect number of samples by species
df_samples.groupby("species").size()
species
arabiensis                         3
coluzzii                         135
gambiae                          144
intermediate_gambiae_coluzzii      1
dtype: int64
# access SNP positions and alleles for chromosome arm 3R as dask arrays
pos, ref, alt = ag3.snp_sites("3R")
pos
Array Chunk
Bytes 208.91 MB 2.10 MB
Shape (52226568,) (524288,)
Count 101 Tasks 100 Chunks
Type int32 numpy.ndarray
52226568 1
ref
Array Chunk
Bytes 52.23 MB 524.29 kB
Shape (52226568,) (524288,)
Count 101 Tasks 100 Chunks
Type |S1 numpy.ndarray
52226568 1
alt
Array Chunk
Bytes 156.68 MB 1.57 MB
Shape (52226568, 3) (524288, 3)
Count 101 Tasks 100 Chunks
Type |S1 numpy.ndarray
3 52226568
# read first 10 SNP positions into a numpy array
p = pos[:10].compute()
p
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int32)
# read first 10 SNP reference alleles
r = ref[:10].compute()
r
array([b'C', b'C', b'T', b'C', b'T', b'A', b'C', b'G', b'T', b'T'],
      dtype='|S1')
# read first 10 SNP alternate alleles
a = alt[:10].compute()
a
array([[b'A', b'T', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'G'],
       [b'C', b'T', b'G'],
       [b'A', b'T', b'G'],
       [b'A', b'C', b'T'],
       [b'A', b'C', b'G'],
       [b'A', b'C', b'G']], dtype='|S1')
# access gamb_colu_arab site filters for chromosome arm 3R as a dask array
filter_pass = ag3.site_filters("3R", mask="gamb_colu_arab")
filter_pass
Array Chunk
Bytes 52.23 MB 300.00 kB
Shape (52226568,) (300000,)
Count 176 Tasks 175 Chunks
Type bool numpy.ndarray
52226568 1
# read filter values for first 10 SNPs (True means the site passes filters)
f = filter_pass[:10].compute()
f
array([False, False, False, False, False, False, False, False, False,
       False])
# access SNP genotypes for AG1000G-BF-A and AG1000G-BF-B as a dask array
gt = ag3.snp_genotypes("3R", sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"])
gt
Array Chunk
Bytes 29.56 GB 30.00 MB
Shape (52226568, 283, 2) (300000, 50, 2)
Count 2452 Tasks 1225 Chunks
Type int8 numpy.ndarray
2 283 52226568

Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.

# e.g., read genotypes for the first 5 SNPs and the first 3 samples
g = gt[:5, :3, :].compute()
g
array([[[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]],

       [[0, 0],
        [0, 0],
        [0, 0]]], dtype=int8)

Here’s an example computation to count the number of segregating SNPs on chromosome arm 3R that also pass gamb_colu_arab site filters:

# import scikit-allel
import allel

# import dask 
import dask.array as da
from dask.diagnostics.progress import ProgressBar

# import numpy
import numpy as np

# choose chromosome arm
contig = "3R"

# choose site filter mask
mask = "gamb_colu_arab"

# choose sample sets
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B"]

# locate pass sites
loc_pass = ag3.site_filters(contig=contig, mask=mask).compute()

# perform an allele count over genotypes
gt = ag3.snp_genotypes(contig=contig, sample_sets=sample_sets)
gt = allel.GenotypeDaskArray(gt)
ac = gt.count_alleles(max_allele=3)

# locate segregating sites
loc_seg = ac.is_segregating()

# count segregating and pass sites
n_pass_seg = da.count_nonzero(loc_pass & loc_seg)

# run the computation
with ProgressBar():
    n_pass_seg = n_pass_seg.compute()

n_pass_seg
[########################################] | 100% Completed | 10.0s
11226723

Feedback and suggestions

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.