Amin1.0 data downloads
Amin1.0 data downloads#
This notebook provides information about how to download data from the Amin1.0 resource. This includes sample metadata, sequence read alignments and single nucleotide polymorphism (SNP) calls.
Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.
Examples in this notebook assume you are downloading data to a local folder within your home directory at the path
~/vo_amin_release/. Change this if you want to download to a different folder on the local file system.
Amin1.0 metadata files are hosted on Google Cloud Storage (GCS) in the
vo_amin_release bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible and do not require any authentication to access. This guide provides examples of downloading data from GCS to a local computer using the
gsutil command line tools. For more information about
gsutil, see the gsutil tool documentation.
BAM and VCF files and associated index files are stored on S3-compatible object storage hosted at the Sanger Institute. These files can be downloaded with tools such as
Data are available about the samples that were sequenced to generate this data resource are available, including the time and place of specimen collection. These data are available as a CSV file which can be downloaded from the following URL:
Download this file:
!mkdir -pv ~/vo_amin_release/v1/metadata !gsutil rsync -r gs://vo_amin_release/v1/metadata/ ~/vo_amin_release/v1/metadata/
Inspect the first few rows of the sample metadata file:
sample_id,original_sample_id,sanger_sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,season,PCA_cohort,cohort,subsampled_cohort VBS09378-4248STDY7308980,VBS09378,4248STDY7308980,CB-2-00264,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV, VBS09382-4248STDY7308981,VBS09382,4248STDY7308981,CB-2-00258,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV, VBS09397-4248STDY7308982,VBS09397,4248STDY7308982,CB-2-00384,Brandy St. Laurent,Cambodia,Preah Kleang,2016,3,13.667,104.982,Feb-Apr (late dry),A,PV,PV VBS09460-4248STDY7308986,VBS09460,4248STDY7308986,CB-2-02960,Brandy St. Laurent,Cambodia,Preah Kleang,2016,6,13.667,104.982,May-Jul (early wet),A,PV, VBS09466-4248STDY7308989,VBS09466,4248STDY7308989,CB-2-04070,Brandy St. Laurent,Cambodia,Preah Kleang,2016,11,13.667,104.982,Nov-Jan (early dry),A,PV, VBS09467-4248STDY7308990,VBS09467,4248STDY7308990,CB-2-04121,Brandy St. Laurent,Cambodia,Preah Kleang,2016,11,13.667,104.982,Nov-Jan (early dry),A,PV, VBS09477-4248STDY7308994,VBS09477,4248STDY7308994,CB-2-05011,Brandy St. Laurent,Cambodia,Preah Kleang,2016,12,13.667,104.982,Nov-Jan (early dry),A,PV,PV VBS09482-4248STDY7308996,VBS09482,4248STDY7308996,CB-2-05167,Brandy St. Laurent,Cambodia,Preah Kleang,2016,12,13.667,104.982,Nov-Jan (early dry),A,PV,PV VBS09483-4248STDY7308997,VBS09483,4248STDY7308997,CB-2-03873,Brandy St. Laurent,Cambodia,Preah Kleang,2016,12,13.667,104.982,Nov-Jan (early dry),A,PV,PV
sample_id column gives the sample identifier used throughout all analyses.
longitude columns give the location where the specimen was collected.
month columns give the approximate date when the specimen was collected.
cohort column gives an assignment of individual mosquitoes to populations based on location of sampling and genetic population structure.
Sequence read alignments (BAM format) and SNP calls (VCF format)#
Analysis-ready sequence read alignments are available in BAM format for all samples in the release and can be downloaded from GCS. SNP calls are also available for download in VCF format.
A catalog file mapping sample identifiers to download URLs is available at this URL:
Alternatively if you ran the gsutil rsync command above to download metadata then this file will already be present on your local file system.
Here are the first few rows, showing the columns with the sample IDs and the BAM file URLs:
!head ~/vo_amin_release/v1/metadata/wgs_snp_data.csv | cut -d, -f1,2
sample_id,alignments_bam VBS09378-4248STDY7308980,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09378-4248STDY7308980-2019-03-03.bam VBS09382-4248STDY7308981,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09382-4248STDY7308981-2019-03-03.bam VBS09397-4248STDY7308982,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09397-4248STDY7308982-2019-03-04.bam VBS09460-4248STDY7308986,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09460-4248STDY7308986-2019-03-07.bam VBS09466-4248STDY7308989,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09466-4248STDY7308989-2019-03-06.bam VBS09467-4248STDY7308990,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09467-4248STDY7308990-2019-03-06.bam VBS09477-4248STDY7308994,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09477-4248STDY7308994-2019-03-06.bam VBS09482-4248STDY7308996,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09482-4248STDY7308996-2019-03-06.bam VBS09483-4248STDY7308997,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09483-4248STDY7308997-2019-03-06.bam
For example, the first row provides information about sample VBS09378-4248STDY7308980, and the value of the alignments_bam field gives the download URL for the BAM file. To download this file locally:
# N.B., large data download !wget --no-clobber https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09378-4248STDY7308980-2019-03-03.bam
SNP calls in VCF format can also be downloaded. A VCF file is available for each individual sample. The download links for the VCF files is given by the snp_genotypes_vcf field in the catalog file.
For example, here are the first few rows of the catalog file, this time showing the sample_id and snp_genotypes_vcf columns:
!head ~/vo_amin_release/v1/metadata/wgs_snp_data.csv | cut -d, -f1,4
sample_id,snp_genotypes_vcf VBS09378-4248STDY7308980,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09378-4248STDY7308980-2019-03-04.vcf.gz VBS09382-4248STDY7308981,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09382-4248STDY7308981-2019-03-04.vcf.gz VBS09397-4248STDY7308982,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09397-4248STDY7308982-2019-03-04.vcf.gz VBS09460-4248STDY7308986,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09460-4248STDY7308986-2019-03-07.vcf.gz VBS09466-4248STDY7308989,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09466-4248STDY7308989-2019-03-07.vcf.gz VBS09467-4248STDY7308990,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09467-4248STDY7308990-2019-03-07.vcf.gz VBS09477-4248STDY7308994,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09477-4248STDY7308994-2019-03-07.vcf.gz VBS09482-4248STDY7308996,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09482-4248STDY7308996-2019-03-07.vcf.gz VBS09483-4248STDY7308997,https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09483-4248STDY7308997-2019-03-07.vcf.gz
For example, the first row provides information about sample VBS09378-4248STDY7308980, and the value of the snp_genotypes_vcf field gives the download URL for the VCF file for this sample. To download this file locally:
# N.B., large data download !wget --no-clobber https://1175-vo-kh-stlaurent-minimus.cog.sanger.ac.uk/VBS09378-4248STDY7308980-2019-03-04.vcf.gz
Feedback and suggestions#
If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion forum.