Adar1.0 data downloads#

This notebook provides information about how to download data from the MalariaGEN Vector Observatory Anopheles darlingi Genomic Surveillance Project, for Anopheles darlingi. These data are the first release (v1.0), and include sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls.

Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.

Examples in this notebook assume you are downloading data to a local folder within your home directory at the path ~/vo_adar_release_master_us_central1/. Change this if you want to download to a different folder on the local file system.

Data hosting#

Adar1 data are hosted by several different services.

Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). This guide provides examples of downloading data from ENA via FTP using the wget command line tool, but please note that there are several other options for downloading data, see the ENA documentation on how to download data files for more information.

SNP calls in VCF and Zarr formats are hosted on S3-compatible object storage at the Sanger Institute. This guide provides examples of downloading these data using wget.

Sample metadata in CSV format are hosted on Google Cloud Storage (GCS) in the vo_adar_release_master_us_central1 bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible but do require an authentication step, please see details on the Vector Observatory Data Access page.

The guide below provides examples of downloading data from GCS to a local computer using the wget and gsutil command line tools. For more information about gsutil, see the gsutil tool documentation.

Sample sets#

Data in these releases are organised into sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to download data from only specific sample sets, or all sample sets. For convenience there is a tab-delimited manifest file listing all sample sets in the release, this can be downloaded via gsutil to a directory on the local file system, e.g.:

!mkdir -pv ~/vo_adir_release/v1.0/
!gsutil cp gs://vo_adar_release_master_us_central1/v1.0/manifest.tsv ~/vo_adar_release/v1.0/

Here are the file contents:

!cat ~/vo_adar_release/v1.0/manifest.tsv

For more information about these sample sets, you can explore the Adar1.0 data user guide.

Sample metadata#

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.

Specimen collection metadata#

Specimen collection metadata can be downloaded from GCS. E.g., sample metadata for all sample sets can be downloaded using gsutil. If you only want the sample metadata for a single sample set, these can be accessed by including the sample set name on the link below, e.g. to access the metadata for 1357-VO-BR-SALLUM-VMF00326, you would use: gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326:

!mkdir -pv ~/vo_adar_release/v1.0/metadata/1357-VO-BR-SALLUM-VMF00326/
!gsutil -m rsync -r gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326/ ~/vo_adar_release/v1.0/metadata/1357-VO-BR-SALLUM-VMF00326/
mkdir: created directory '/home/jupyter/vo_adar_release/v1.0/metadata/1357-VO-BR-SALLUM-VMF00326/'
Building synchronization state...
Starting synchronization...
Copying gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326/samples.meta.csv...
Copying gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326/wgs_snp_data.csv...
Copying gs://vo_adar_release_master_us_central1/v1.0/metadata/general/1357-VO-BR-SALLUM-VMF00326/surveillance.flags.csv...
/ [3/3 files][118.0 KiB/118.0 KiB] 100% Done                                    
Operation completed over 3 objects/118.0 KiB.                                    

Here are the first few rows of the sample metadata for sample set 1357-VO-BR-SALLUM-VMF00326:

!head ~/vo_adar_release/v1.0/metadata/1357-VO-BR-SALLUM-VMF00326/samples.meta.csv

The sample_id column gives the sample identifier used throughout all analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

Site filters#

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. For An. funestus and An. gambiae, these are available as a VCF file. For An. darlingi, they are only available as a Zarr array (see below). If you would like to filter your VCF based on sites passing the filter, you will need to extract the data from the zarr array, and subset your VCF based on these locations (e.g. using bcftools –regions).

SNP calls (Zarr format)#

SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the Adar1 cloud data access guide for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.

The data are organised into several Zarr hierarchies.

SNP sites and alleles#

Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:

!mkdir -pv ~/vo_adar_release/v1.0/snp_genotypes/all/sites/
!gsutil -m rsync -r \
    gs://vo_adar_release_master_us_central1/v1.0/snp_genotypes/all/sites/ \
    ~/vo_adar_release/v1.0/snp_genotypes/all/sites/

Hide code cell output

mkdir: created directory '/home/jupyter/vo_adar_release/v1.0/snp_genotypes'
mkdir: created directory '/home/jupyter/vo_adar_release/v1.0/snp_genotypes/all'
mkdir: created directory '/home/jupyter/vo_adar_release/v1.0/snp_genotypes/all/sites/'
Building synchronization state...
Reauthentication required.
Caught non-retryable exception while listing gs://vo_adar_release_master_us_central1/v1.0/snp_genotypes/all/sites/: Reauthentication challenge could not be answered because you are not in an interactive session.
CommandException: Caught non-retryable exception - aborting rsync

Site filters#

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. To download site filters data in Zarr format:

!mkdir -pv ~/vo_adar_release/v1.0/site_filters/sc_20250610/darlingi/
!gsutil -m rsync -r \
    gs://vo_adar_release_master_us_central1/v1.0/site_filters/sc_20250610/darlingi/ \
    ~/vo_adar_release/v1.0/site_filters/sc_20250610/darlingi/

SNP genotypes#

SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for sample set 1357-VO-BR-SALLUM-VMF00326, excluding some data you probably won’t need:

!mkdir -pv ~/vo_adar_release/v1.0/snp_genotypes/all/1357-VO-BR-SALLUM-VMF00326/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_adar_release_master_us_central1/v1.0/snp_genotypes/all/1357-VO-BR-SALLUM-VMF00326/ \
        ~/vo_adar_release/v1.0/snp_genotypes/all/1357-VO-BR-SALLUM-VMF00326/

Feedback and suggestions#

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.