As1 data downloads#

This notebook provides information about how to download data from the Controlling Emergent Anopheles stephensi in Sudan and Ethiopia (CEASE) project, released in collaboration with the MalariaGEN Vector Observatory.

This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls.

Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.

Examples in this notebook assume you are downloading data to a local folder within your home directory at the path ~/vo_aste_release_master_us_central1/. Change this if you want to download to a different folder on the local file system.

Data hosting#

As1 data are hosted by several different services.

Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). These can be accessed at on the ENA portal.

SNP calls in VCF and Zarr formats are hosted on S3-compatible object storage. This guide provides examples of downloading these data using wget.

Sample metadata in CSV format are hosted on Google Cloud Storage (GCS) in the vo_aste_release_master_us_central1 bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible but do require an authentication step, please see details on the Vector Observatory Data Access page.

The guide below provides examples of downloading data from GCS to a local computer using the wget and gsutil command line tools. For more information about gsutil, see the gsutil tool documentation.

Sample sets#

Data in these releases are organised into sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to download data from only specific sample sets, or all sample sets. For convenience there is a tab-delimited manifest file listing all sample sets in the release, this can be downloaded via gsutil to a directory on the local file system, e.g.:

!mkdir -pv ~/vo_aste_release_master_us_central1/v1.0/
!gsutil cp gs://vo_aste_release_master_us_central1/v1.0/manifest.tsv ~/vo_aste_release_master_us_central1/v1.0/

Hide code cell output

Copying gs://vo_aste_release_master_us_central1/v1.0/manifest.tsv...
/ [1 files][  2.4 KiB/  2.4 KiB]                                                
Operation completed over 1 objects/2.4 KiB.                                      

Here are the file contents:

!cat ~/vo_aste_release_master_us_central1/v1.0/manifest.tsv
sample_set	sample_count	study_id	study_url	terms_of_use_expiry_date	terms_of_use_url
1363-VO-ET-GADISA-VMF00316	111	1363-VO-ET-GADISA	https://www.malariagen.net/network/where-we-work/1363-VO-ET-GADISA	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1364-VO-SD-KAFY-VMF00317	226	1364-VO-SD-KAFY	https://www.malariagen.net/network/where-we-work/1364-VO-SD-KAFY	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1365-VO-DJ-ADBI-VMF00318	21	1365-VO-DJ-ADBI	https://www.malariagen.net/network/where-we-work/1365-VO-DJ-ADBI	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1366-VO-YE-ALLAN-VMF00319	22	1366-VO-YE-ALLAN	https://www.malariagen.net/network/where-we-work/1366-VO-YE-ALLAN	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1367-VO-AF-DONNELLY-VMF00320	24	1367-VO-AF-DONNELLY	https://www.malariagen.net/network/where-we-work/1367-VO-AF-DONNELLY	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1368-VO-PK-DONNELLY-VMF00321	15	1368-VO-PK-DONNELLY	https://www.malariagen.net/network/where-we-work/1368-VO-PK-DONNELLY	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1369-VO-SA-AL-NAZAWI-VMF00322	42	1369-VO-SA-AL-NAZAWI	https://www.malariagen.net/network/where-we-work/1369-VO-SA-AL-NAZAWI	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1370-VO-IR-ENAYATI-VMF00323	72	1370-VO-IR-ENAYATI	https://www.malariagen.net/network/where-we-work/1370-VO-IR-ENAYATI	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1385-VO-DJ-WEETMAN-VMF00338	14	1385-VO-DJ-WEETMAN	https://www.malariagen.net/network/where-we-work/1385-VO-DJ-WEETMAN	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1386-VO-KE-OCHOMO-VMF00339	29	1386-VO-KE-OCHOMO	https://www.malariagen.net/network/where-we-work/1386-VO-KE-OCHOMO	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1458-VO-ET-YEWHALAW-VMF00340	23	1458-VO-ET-YEWHALAW	https://www.malariagen.net/network/where-we-work/1458-VO-ET-YEWHALAW	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
1459-VO-SD-AHMED-VMF00342	25	1459-VO-SD-AHMED	https://www.malariagen.net/network/where-we-work/1459-VO-SD-AHMED	2028-04-05	https://malariagen.github.io/vector-data/as1/as1.0.html
thakare-2022	15	thakare-2022	https://www.malariagen.net/network/where-we-work/thakare-2022		https://www.nature.com/articles/s41598-022-07462-3

For more information about these sample sets, you can explore the As1.0 data user guide.

Sample metadata#

Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.

Specimen collection metadata#

Specimen collection metadata can be downloaded from GCS. E.g., sample metadata for all sample sets can be downloaded using gsutil. If you only want the sample metadata for a single sample set, these can be accessed by including the sample set name on the link below, e.g. to access the metadata for 1368-VO-PK-DONNELLY-VMF00321, you would use: gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1368-VO-PK-DONNELLY-VMF00321/samples.meta.csv:

!mkdir -pv ~/vo_aste_release_master_us_central1/v1.0/metadata/
!gsutil -m rsync -r gs://vo_aste_release_master_us_central1/v1.0/metadata/general/ ~/vo_aste_release_master_us_central1/v1.0/metadata/

Hide code cell output

mkdir: created directory '/home/jupyter/vo_aste_release_master_us_central1/v1.0/metadata/'
Building synchronization state...
Starting synchronization...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1365-VO-DJ-ADBI-VMF00318/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1364-VO-SD-KAFY-VMF00317/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1363-VO-ET-GADISA-VMF00316/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1365-VO-DJ-ADBI-VMF00318/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1363-VO-ET-GADISA-VMF00316/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1366-VO-YE-ALLAN-VMF00319/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1367-VO-AF-DONNELLY-VMF00320/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1366-VO-YE-ALLAN-VMF00319/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1367-VO-AF-DONNELLY-VMF00320/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1364-VO-SD-KAFY-VMF00317/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1368-VO-PK-DONNELLY-VMF00321/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1368-VO-PK-DONNELLY-VMF00321/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1369-VO-SA-AL-NAZAWI-VMF00322/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1369-VO-SA-AL-NAZAWI-VMF00322/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1370-VO-IR-ENAYATI-VMF00323/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1385-VO-DJ-WEETMAN-VMF00338/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1370-VO-IR-ENAYATI-VMF00323/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1386-VO-KE-OCHOMO-VMF00339/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1385-VO-DJ-WEETMAN-VMF00338/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1386-VO-KE-OCHOMO-VMF00339/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1458-VO-ET-YEWHALAW-VMF00340/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1458-VO-ET-YEWHALAW-VMF00340/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1459-VO-SD-AHMED-VMF00342/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1459-VO-SD-AHMED-VMF00342/wgs_snp_data.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/README.md...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/thakare-2022/samples.meta.csv...
Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/thakare-2022/wgs_snp_data.csv...
/ [27/27 files][282.1 KiB/282.1 KiB] 100% Done                                  
Operation completed over 27 objects/282.1 KiB.                                   

Here are the first few rows of the sample metadata for sample set 1368-VO-PK-DONNELLY-VMF00321:

!head ~/vo_aste_release_master_us_central1/v1.0/metadata/1368-VO-PK-DONNELLY-VMF00321/samples.meta.csv
sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call
VMF00321-0001,DAsgPak035,Martin Donnelly,Pakistan,Asgharo,2005,-1,33.487,70.974,F
VMF00321-0002,DAzCPak017,Martin Donnelly,Pakistan,Azakhel,2005,-1,34.018,71.879,F
VMF00321-0003,DAzCPak018,Martin Donnelly,Pakistan,Azakhel,2005,-1,34.018,71.879,F
VMF00321-0004,DAzCPak019,Martin Donnelly,Pakistan,Azakhel,2005,-1,34.018,71.879,F
VMF00321-0005,DLakPak026,Martin Donnelly,Pakistan,Lakhti Banda,2005,-1,33.519,71.071,F
VMF00321-0006,PMJDE2,Martin Donnelly,Pakistan,Azakhel,2005,4,34.018,71.879,F
VMF00321-0007,PMJDE4,Martin Donnelly,Pakistan,Azakhel,2005,4,34.018,71.879,F
VMF00321-0008,PMJDF4,Martin Donnelly,Pakistan,Azakhel,2005,4,34.018,71.879,F
VMF00321-0009,PMJDG2,Martin Donnelly,Pakistan,Azakhel,2005,4,34.018,71.879,F

The sample_id column gives the sample identifier used throughout all analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

The sex_call column gives the gender as determined from the sequence data.

SNP calls (VCF format)#

SNP genotypes#

SNP genotypes for individual mosquitoes in VCF format are available for download from Sanger S3-compatible object storage. A VCF file is available for each individual sample. To download a VCF file for a given sample, you will need the sample identifier and the sample set in which the sample belongs. Then inspect the data catalog in the metadata. E.g., for sample set 1368-VO-PK-DONNELLY-VMF00321:

!head ~/vo_aste_release_master_us_central1/v1.0/metadata/1368-VO-PK-DONNELLY-VMF00321/wgs_snp_data.csv | cut -f1,4 -d,
sample_id,snp_genotypes_vcf
VMF00321-0001,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0001.vcf.gz
VMF00321-0002,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0002.vcf.gz
VMF00321-0003,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0003.vcf.gz
VMF00321-0004,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0004.vcf.gz
VMF00321-0005,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0005.vcf.gz
VMF00321-0006,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0006.vcf.gz
VMF00321-0007,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0007.vcf.gz
VMF00321-0008,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0008.vcf.gz
VMF00321-0009,https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0009.vcf.gz

A VCF file and associated tabix index can be downloaded via wget, e.g.:

!wget --no-clobber https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0001.vcf.gz
!wget --no-clobber https://1368-vo-pk-donnelly-vmf00321-aste1.cog.sanger.ac.uk/VMF00321-0001.vcf.gz.tbi

Note that each of these VCF files is around 3 Gb, so downloading may take some time, and sufficient local storage will be needed.

Each of these VCF files is an “all sites” VCF file, meaning that genotypes have been called at all genomic positions where the reference nucleotide is not “N”, regardless of whether variation is observed in the given sample. This means that VCFs from multiple samples can be merged easily to create a multi-sample VCF, which may be required for certain analyses. For example, the code below merges VCFs for two samples for chromosome arm 3R up to 1 Mbp:

!bcftools merge --output-type z --regions 3RL:1-1000000 --output merged.vcf.gz VMF00316-0002.vcf.gz VMF00316-0003.vcf.gz

If you are just interested in analysing variants within a given set of samples, you might like to filter the merged VCF to remove non-variant sites and alleles, e.g., using bcftools view:

!bcftools view --output-type z --output-file merged_variant.vcf.gz --min-ac 1:nonmajor --trim-alt-alleles merged.vcf.gz

SNP calls (Zarr format)#

SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the As1 cloud data access guide for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.

The data are organised into several Zarr hierarchies.

SNP sites and alleles#

Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:

!mkdir -pv ~/vo_aste_release_master_us_central1/v1.0/snp_genotypes/all/sites/
!gsutil -m rsync -r \
    gs://vo_aste_release_master_us_central1/v1.0/snp_genotypes/all/sites/ \
    ~/vo_aste_release_master_us_central1/v1.0/snp_genotypes/all/sites/

SNP genotypes#

SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for a sample set, excluding some data you probably won’t need:

# N.B., large data download
!mkdir -pv ~/vo_aste_release_master_us_central1/v{release}/snp_genotypes/all/{sample_set}/
!gsutil -m rsync -r \
        -x '.*/calldata/(AD|GQ|MQ)/.*' \
        gs://vo_aste_release_master_us_central1/v{release}/snp_genotypes/all/{sample_set}/ \
        ~/vo_aste_release_master_us_central1/v{release}/snp_genotypes/all/{sample_set}/

Site filters#

SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. We have created site filters using static cutoffs on site summary statistics. These data are available as Zarr datastores, one per chromosome.

These can be downloaded using gsutil, e.g.:

!mkdir -pv ~/vo_aste_release_master_us_central1/v1.0/site_filters/sc_20260401/
!gsutil -m rsync -r \
    gs://vo_aste_release_master_us_central1/v1.0/site_filters/sc_20260401/2RL/ \
    ~/vo_aste_release_master_us_central1/v1.0/site_filters/sc_20260401/2RL/

Feedback and suggestions#

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.