Amin1.0 data downloads#

This notebook provides information about how to download data from the Amin1.0 resource. This includes sample metadata, sequence read alignments and single nucleotide polymorphism (SNP) calls.

Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.

Examples in this notebook assume you are downloading data to a local folder within your home directory at the path ~/vo_amin_release/. Change this if you want to download to a different folder on the local file system.

Data hosting#

Amin1.0 metadata files are hosted on Google Cloud Storage (GCS) in the vo_amin_release bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible and do not require any authentication to access. This guide provides examples of downloading data from GCS to a local computer using the wget and gsutil command line tools. For more information about gsutil, see the gsutil tool documentation.

BAM and VCF files and associated index files are stored on S3-compatible object storage hosted at the Sanger Institute. These files can be downloaded with tools such as wget.

Sample metadata#

Data are available about the samples that were sequenced to generate this data resource are available, including the time and place of specimen collection. These data are available as a CSV file which can be downloaded from the following URL:

  • https://storage.googleapis.com/vo_amin_release/v1/metadata/samples.meta.csv

Download this file:

!mkdir -pv ~/vo_amin_release/v1/metadata
!gsutil rsync -r gs://vo_amin_release/v1/metadata/ ~/vo_amin_release/v1/metadata/

Inspect the first few rows of the sample metadata file:

!head ~/vo_amin_release/v1/metadata/samples.meta.csv
sample_id,original_sample_id,sanger_sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,season,PCA_cohort,cohort,subsampled_cohort
VBS09378-4248STDY7308980,VBS09378,4248STDY7308980,CB-2-00264,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV,
VBS09382-4248STDY7308981,VBS09382,4248STDY7308981,CB-2-00258,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV,
VBS09397-4248STDY7308982,VBS09397,4248STDY7308982,CB-2-00384,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV,PV
VBS09460-4248STDY7308986,VBS09460,4248STDY7308986,CB-2-02960,Brandy St. Laurent,Cambodia,Preah Kleang,2016,6,13.667,104.982,May-Jul (early wet),A,PV,
VBS09466-4248STDY7308989,VBS09466,4248STDY7308989,CB-2-04070,Brandy St. Laurent,Cambodia,Preah Kleang,2016,11,13.667,104.982,Nov-Jan (early dry),A,PV,
VBS09467-4248STDY7308990,VBS09467,4248STDY7308990,CB-2-04121,Brandy St. Laurent,Cambodia,Preah Kleang,2016,11,13.667,104.982,Nov-Jan (early dry),A,PV,
VBS09477-4248STDY7308994,VBS09477,4248STDY7308994,CB-2-05011,Brandy St. Laurent,Cambodia,Preah Kleang,2016,12,13.667,104.982,Nov-Jan (early dry),A,PV,PV
VBS09482-4248STDY7308996,VBS09482,4248STDY7308996,CB-2-05167,Brandy St. Laurent,Cambodia,Preah Kleang,2016,12,13.667,104.982,Nov-Jan (early dry),A,PV,PV
VBS09483-4248STDY7308997,VBS09483,4248STDY7308997,CB-2-03873,Brandy St. Laurent,Cambodia,Preah Kleang,2016,12,13.667,104.982,Nov-Jan (early dry),A,PV,PV

The sample_id column gives the sample identifier used throughout all analyses.

The country, location, latitude and longitude columns give the location where the specimen was collected.

The year and month columns give the approximate date when the specimen was collected.

The cohort column gives an assignment of individual mosquitoes to populations based on location of sampling and genetic population structure.

Sequence read alignments (BAM format) and SNP calls (VCF format)#

Analysis-ready sequence read alignments are available in BAM format for all samples in the release and can be downloaded from GCS. SNP calls are also available for download in VCF format.

A catalog file mapping sample identifiers to download URLs is available at this URL:

  • https://storage.googleapis.com/vo_amin_release/v1/metadata/wgs_snp_data.csv

Alternatively if you ran the gsutil rsync command above to download metadata then this file will already be present on your local file system.

Here are the first few rows, showing the columns with the sample IDs and the BAM file URLs:

!head ~/vo_amin_release/v1/metadata/wgs_snp_data.csv | cut -d, -f1,2
sample_id,alignments_bam
VBS09378-4248STDY7308980,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09378-4248STDY7308980-2019-03-03.bam
VBS09382-4248STDY7308981,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09382-4248STDY7308981-2019-03-03.bam
VBS09397-4248STDY7308982,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09397-4248STDY7308982-2019-03-04.bam
VBS09460-4248STDY7308986,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09460-4248STDY7308986-2019-03-07.bam
VBS09466-4248STDY7308989,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09466-4248STDY7308989-2019-03-06.bam
VBS09467-4248STDY7308990,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09467-4248STDY7308990-2019-03-06.bam
VBS09477-4248STDY7308994,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09477-4248STDY7308994-2019-03-06.bam
VBS09482-4248STDY7308996,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09482-4248STDY7308996-2019-03-06.bam
VBS09483-4248STDY7308997,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09483-4248STDY7308997-2019-03-06.bam

For example, the first row provides information about sample VBS09378-4248STDY7308980, and the value of the alignments_bam field gives the download URL for the BAM file. To download this file locally:

# N.B., large data download
!wget --no-clobber https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09378-4248STDY7308980-2019-03-03.bam

SNP calls in VCF format can also be downloaded. A VCF file is available for each individual sample. The download links for the VCF files is given by the snp_genotypes_vcf field in the catalog file.

For example, here are the first few rows of the catalog file, this time showing the sample_id and snp_genotypes_vcf columns:

!head ~/vo_amin_release/v1/metadata/wgs_snp_data.csv | cut -d, -f1,4
sample_id,snp_genotypes_vcf
VBS09378-4248STDY7308980,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09378-4248STDY7308980-2019-03-04.vcf.gz
VBS09382-4248STDY7308981,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09382-4248STDY7308981-2019-03-04.vcf.gz
VBS09397-4248STDY7308982,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09397-4248STDY7308982-2019-03-04.vcf.gz
VBS09460-4248STDY7308986,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09460-4248STDY7308986-2019-03-07.vcf.gz
VBS09466-4248STDY7308989,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09466-4248STDY7308989-2019-03-07.vcf.gz
VBS09467-4248STDY7308990,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09467-4248STDY7308990-2019-03-07.vcf.gz
VBS09477-4248STDY7308994,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09477-4248STDY7308994-2019-03-07.vcf.gz
VBS09482-4248STDY7308996,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09482-4248STDY7308996-2019-03-07.vcf.gz
VBS09483-4248STDY7308997,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09483-4248STDY7308997-2019-03-07.vcf.gz

For example, the first row provides information about sample VBS09378-4248STDY7308980, and the value of the snp_genotypes_vcf field gives the download URL for the VCF file for this sample. To download this file locally:

# N.B., large data download
!wget --no-clobber https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09378-4248STDY7308980-2019-03-04.vcf.gz

Feedback and suggestions#

If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion forum.