Af1 data downloads
Contents
Af1 data downloads#
This notebook provides information about how to download data from the MalariaGEN Vector Observatory Anopheles funestus Genomic Surveillance Project. This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. Data from other releases can be accessed by changing the release in the examples from v1
to the specific Af release, e.g. v1.0
.
Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.
Examples in this notebook assume you are downloading data to a local folder within your home directory at the path ~/vo_afun_release/
. Change this if you want to download to a different folder on the local file system.
Data hosting#
Af1
data are hosted by several different services.
Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). This guide provides examples of downloading data from ENA via FTP using the wget
command line tool, but please note that there are several other options for downloading data, see the ENA documentation on how to download data files for more information.
SNP calls in VCF and Zarr formats are hosted on S3-compatible object storage at the Sanger Institute. This guide provides examples of downloading thes data using wget
.
Sample metadata in CSV format are hosted on Google Cloud Storage (GCS) in the vo_afun_release_master_us_central1
bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible but do require an authentication step, please see details on the Vector Observatory Data Access page.
The guide below provides examples of downloading data from GCS to a local computer using the wget
and gsutil
command line tools. For more information about gsutil
, see the gsutil tool documentation.
Sample sets#
Data in these releases are organised into sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to download data from only specific sample sets, or all sample sets. For convenience there is a tab-delimited manifest file listing all sample sets in the release, this can be downloaded via gsutil
to a directory on the local file system, e.g.:
!mkdir -pv ~/vo_afun_release/v1.0/
!gsutil cp gs://vo_afun_release_master_us_central1/v1.0/manifest.tsv ~/vo_afun_release/v1.0/
/Users/ah32/vo_afun_release
/Users/ah32/vo_afun_release/v1.0
Copying gs://vo_afun_release/v1.0/manifest.tsv...
/ [1 files][ 1015 B/ 1015 B]
Operation completed over 1 objects/1015.0 B.
Here are the file contents:
!cat ~/vo_afun_release/v1.0/manifest.tsv
sample_set sample_count study_id study_url
1229-VO-GH-DADZIE-VMF00095 36 1229-VO-GH-DADZIE https://www.malariagen.net/network/where-we-work/1229-VO-GH-DADZIE
1230-VO-GA-CF-AYALA-VMF00045 50 1230-VO-MULTI-AYALA https://www.malariagen.net/network/where-we-work/1230-VO-MULTI-AYALA
1231-VO-MULTI-WONDJI-VMF00043 320 1231-VO-MULTI-WONDJI https://www.malariagen.net/network/where-we-work/1231-VO-MULTI-WONDJI
1232-VO-KE-OCHOMO-VMF00044 81 1232-VO-KE-OCHOMO https://www.malariagen.net/network/where-we-work/1232-VO-KE-OCHOMO
1235-VO-MZ-PAAIJMANS-VMF00094 76 1235-VO-MZ-PAAIJMANS https://www.malariagen.net/network/where-we-work/1235-VO-MZ-PAAIJMANS
1236-VO-TZ-OKUMU-VMF00090 10 1236-VO-TZ-OKUMU https://www.malariagen.net/network/where-we-work/1236-VO-TZ-OKUMU
1240-VO-CD-KOEKEMOER-VMF00099 43 1240-VO-MULTI-KOEKEMOER https://www.malariagen.net/network/where-we-work/1240-VO-MULTI-KOEKEMOER
1240-VO-MZ-KOEKEMOER-VMF00101 40 1240-VO-MULTI-KOEKEMOER https://www.malariagen.net/network/where-we-work/1240-VO-MULTI-KOEKEMOER
For more information about these sample sets, you can explore the Af1.0 data user guide.
Sample metadata#
Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.
Specimen collection metadata#
Specimen collection metadata can be downloaded from GCS. E.g., sample metadata for all sample sets can be downloaded using gsutil
. If you only want the sample metadata for a single sample set, these can be accessed by including the sample set name on the link below, e.g. to access the metadata for 1229-VO-GH-DADZIE-VMF00095
, you would use: gs://vo_afun_release_master_us_central1/v1.0/metadata/general/1229-VO-GH-DADZIE-VMF00095/samples.meta.csv
:
!mkdir -pv ~/vo_afun_release/v1.0/metadata/
!gsutil -m rsync -r gs://vo_afun_release_master_us_central1/v1.0/metadata/general/ ~/vo_afun_release/v1.0/metadata/
/Users/ah32/vo_afun_release/v1.0/metadata
Building synchronization state...
If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o "GSUtil:parallel_process_count=1"`. Note that multithreading is still available even if you disable multiprocessing.
Starting synchronization...
If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o "GSUtil:parallel_process_count=1"`. Note that multithreading is still available even if you disable multiprocessing.
Copying gs://vo_afun_release/v1.0/metadata/general/1229-VO-GH-DADZIE-VMF00095/samples.meta.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1230-VO-GA-CF-AYALA-VMF00045/samples.meta.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1229-VO-GH-DADZIE-VMF00095/wgs_snp_data.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1235-VO-MZ-PAAIJMANS-VMF00094/samples.meta.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1231-VO-MULTI-WONDJI-VMF00043/wgs_snp_data.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1235-VO-MZ-PAAIJMANS-VMF00094/wgs_snp_data.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1232-VO-KE-OCHOMO-VMF00044/wgs_snp_data.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1240-VO-CD-KOEKEMOER-VMF00099/samples.meta.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1231-VO-MULTI-WONDJI-VMF00043/samples.meta.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1232-VO-KE-OCHOMO-VMF00044/samples.meta.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1236-VO-TZ-OKUMU-VMF00090/wgs_snp_data.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1240-VO-CD-KOEKEMOER-VMF00099/wgs_snp_data.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1240-VO-MZ-KOEKEMOER-VMF00101/wgs_snp_data.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1230-VO-GA-CF-AYALA-VMF00045/wgs_snp_data.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/1240-VO-MZ-KOEKEMOER-VMF00101/samples.meta.csv...
Copying gs://vo_afun_release/v1.0/metadata/general/README.md...
Copying gs://vo_afun_release/v1.0/metadata/general/1236-VO-TZ-OKUMU-VMF00090/samples.meta.csv...
- [17/17 files][305.0 KiB/305.0 KiB] 100% Done
Operation completed over 17 objects/305.0 KiB.
Here are the first few rows of the sample metadata for sample set 1229-VO-GH-DADZIE-VMF00095
:
!head ~/vo_afun_release/v1.0/metadata/1229-VO-GH-DADZIE-VMF00095/samples.meta.csv
sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call
VBS24195,1229-GH-A-GH01,Samuel Dadzie,Ghana,Dimabi,2017,8,9.420,-1.083,F
VBS24196,1229-GH-A-GH02,Samuel Dadzie,Ghana,Gbullung,2017,7,9.488,-1.009,F
VBS24197,1229-GH-A-GH03,Samuel Dadzie,Ghana,Dimabi,2017,7,9.420,-1.083,F
VBS24198,1229-GH-A-GH04,Samuel Dadzie,Ghana,Dimabi,2017,8,9.420,-1.083,F
VBS24199,1229-GH-A-GH05,Samuel Dadzie,Ghana,Gupanarigu,2017,8,9.497,-0.952,F
VBS24200,1229-GH-A-GH06,Samuel Dadzie,Ghana,Gupanarigu,2017,7,9.497,-0.952,F
VBS24201,1229-GH-A-GH07,Samuel Dadzie,Ghana,Gupanarigu,2017,7,9.497,-0.952,F
VBS24202,1229-GH-A-GH08,Samuel Dadzie,Ghana,Gupanarigu,2017,7,9.497,-0.952,F
VBS24203,1229-GH-A-GH09,Samuel Dadzie,Ghana,Gupanarigu,2017,7,9.497,-0.952,F
The sample_id
column gives the sample identifier used throughout all analyses.
The country
, location
, latitude
and longitude
columns give the location where the specimen was collected.
The year
and month
columns give the approximate date when the specimen was collected.
The sex_call
column gives the gender as determined from the sequence data.
SNP calls (VCF format)#
SNP genotypes#
SNP genotypes for individual mosquitoes in VCF format are available for download from Sanger S3-compatible object storage. A VCF file is available for each individual sample. To download a VCF file for a given sample, you will need the sample identifier and the sample set in which the sample belongs. Then inspect the data catalog in the metadata. E.g., for sample set 1229-VO-GH-DADZIE-VMF00095
:
!head ~/vo_afun_release/v1.0/metadata/1229-VO-GH-DADZIE-VMF00095/wgs_snp_data.csv | cut -f1,4 -d,
sample_id,snp_genotypes_vcf
VBS24195,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24195.vcf.gz
VBS24196,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24196.vcf.gz
VBS24197,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24197.vcf.gz
VBS24198,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24198.vcf.gz
VBS24199,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24199.vcf.gz
VBS24200,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24200.vcf.gz
VBS24201,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24201.vcf.gz
VBS24202,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24202.vcf.gz
VBS24203,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24203.vcf.gz
A VCF file and associated tabix index can be downloaded via wget, e.g.:
!wget --no-clobber https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24195.vcf.gz
!wget --no-clobber https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24195.vcf.gz.tbi
Note that each of these VCF files is around 3 Gb, so downloading may take some time, and sufficient local storage will be needed.
Each of these VCF files is an “all sites” VCF file, meaning that genotypes have been called at all genomic positions where the reference nucleotide is not “N”, regardless of whether variation is observed in the given sample. This means that VCFs from multiple samples can be merged easily to create a multi-sample VCF, which may be required for certain analyses. For example, the code below merges VCFs for two samples for chromosome arm 3R up to 1 Mbp:
!bcftools merge --output-type z --regions 3RL:1-1000000 --output merged.vcf.gz VBS24195.vcf.gz VBS24196.vcf.gz
If you are just interested in analysing variants within a given set of samples, you might like to filter the merged VCF to remove non-variant sites and alleles, e.g., using bcftools view:
!bcftools view --output-type z --output-file merged_variant.vcf.gz --min-ac 1:nonmajor --trim-alt-alleles merged.vcf.gz
Site filters#
SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. We have created some sites-only VCF files with site filter information in the FILTER
column. These VCF files are hosted on GCS.
Each filter is available as a set of VCF files, one per chromosome arm. E.g., you can access the site filters on chromosome arms 2RL from:
gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/vcf/funestus/2RL_sitefilters.vcf.gz
Alternatively, all site filters VCFs can be downloaded using gsutil
, e.g.:
!mkdir -pv ~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/
!gsutil -m rsync -r \
gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/vcf/funestus/ \
~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/
!mkdir -pv ~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/
!gsutil -m rsync -r \
gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/vcf/funestus/ \
~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/
Note these filters are the result of different filter models, in this case, a decision-tree is used. These filters are the default ones used across the function.
We have also produced a second set of site filters, which are the result of static cutoffs on the site summary statistics.
These hard-filters can also be downloaded via gsutil
, e.g.:
!mkdir -pv ~/vo_afun_release/v1.0/site_filters/sc_20220908/vcf/funestus/
!gsutil -m rsync -r \
gs://vo_afun_release_master_us_central1/v1.0/site_filters/sc_20220908/vcf/funestus/ \
~/vo_afun_release/v1.0/site_filters/sc_20220908/vcf/funestus/
SNP calls (Zarr format)#
SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the Af1 cloud data access guide for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.
The data are organised into several Zarr hierarchies.
SNP sites and alleles#
Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:
!mkdir -pv ~/vo_afun_release/v1.0/snp_genotypes/all/sites/
!gsutil -m rsync -r \
gs://vo_afun_release_master_us_central1/v1.0/snp_genotypes/all/sites/ \
~/vo_afun_release/v1.0/snp_genotypes/all/sites/
Site filters#
SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. To download site filters data in Zarr format:
!mkdir -pv ~/vo_afun_release/v1.0/site_filters/dt_20200416/funestus/
!gsutil -m rsync -r \
gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/funestus/ \
~/vo_afun_release/v1.0/site_filters/dt_20200416/funestus/
SNP genotypes#
SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for sample set 1229-VO-GH-DADZIE-VMF00095
, excluding some data you probably won’t need:
!mkdir -pv ~/vo_afun_release/v1.0/snp_genotypes/all/1229-VO-GH-DADZIE-VMF00095/
!gsutil -m rsync -r \
-x '.*/calldata/(AD|GQ|MQ)/.*' \
gs://vo_afun_release_master_us_central1/v1.0/snp_genotypes/all/1229-VO-GH-DADZIE-VMF00095/ \
~/vo_afun_release/v1.0/snp_genotypes/all/1229-VO-GH-DADZIE-VMF00095/
Copy number variation (CNV) data#
Data on copy number variation within the Af1
cohort are available as three separate data types:
HMM – Genome-wide inferences of copy number state within each individual mosquito in 300 bp non-overlapping windows.
Coverage calls – Genome-wide copy number variant calls, derived from the HMM outputs by analysing contiguous regions of elevated copy number state then clustering of variants across individuals based on breakpoint proximity.
For more information on the methods used to generate these data, see the variant-calling methods page.
For each of these data types, data can be downloaded from Google Cloud Storage, and are available in either VCF or Zarr format.
CNV HMM#
The HMM inferences of copy number state are available in VCF, Zarr and text formats, and are organised by sample set.
For example, the VCF file for sample set 1229-VO-GH-DADZIE-VMF00095
can be downloaded from:
gs://vo_afun_release_master_us_central1/v1/cnv/1229-VO-GH-DADZIE-VMF00095/hmm/vcf/VBS24195_cnv_hmm.vcf.gz
VCF files for all samples sets can be downloaded via gsutil as follows:
# create a local directory to hold downloaded CNV data
!mkdir -pv ~/vo_afun_release/v1.0/cnv/
# download the HMM data in VCF format for all sample sets
!gsutil -m rsync -r \
-x '.*/coverage_calls/.*|.*/hmm/zarr/.*|.*/hmm/per_sample/.*' \
gs://vo_afun_release_master_us_central1/v1.0/cnv/ ~/vo_afun_release/v1.0/cnv/
Zarr files for all sample sets can be downloaded as follows:
# download HMM data in Zarr format for all sample sets
!gsutil -m rsync -r \
-x '.*/coverage_calls/.*|.*/hmm/vcf/.*|.*/hmm/per_sample/.*' \
gs://vo_afun_release_master_us_central1/v1.0/cnv/ ~/vo_afun_release/v1.0/cnv/
CNV coverage calls#
Coverage-based CNV calls are available in VCF and Zarr formats, and are organised by sample set. Note that some samples were excluded from coverage calling because of high coverage variance.
For example, the VCF file for sample set 1229-VO-GH-DADZIE-VMF00095
can be downloaded from:
gs://vo_afun_release_master_us_central1/v1.0/cnv/1229-VO-GH-DADZIE-VMF00095/coverage_calls/funestus/vcf/1229-VO-GH-DADZIE-VMF00095_funestus_cnv_coverage_calls.vcf.gz
VCF files for all sample sets can be downloaded with:
# download coverage calls in VCF format for all sample sets
!gsutil -m rsync -r \
-x '.*/hmm/.*|.*/coverage_calls/.*/zarr/.*' \
gs://vo_afun_release_master_us_central1/v1.0/cnv/ ~/vo_afun_release/v1.0/cnv/
Zarr files for all sample sets can be downloaded with:
# download coverage calls in Zarr format for all sample sets
!gsutil -m rsync -r \
-x '.*/hmm/.*|.*/coverage_calls/.*/vcf/.*' \
gs://vo_afun_release_master_us_central1/v1.0/cnv/ ~/vo_afun_release/v1.0/cnv/
Haplotypes#
The Af1
data resource also includes haplotype reference panels, which were obtained by phasing the SNP calls.
Haplotype data can be downloaded in either VCF or Zarr format. See the subsections below for further details
Haplotype reference panels (VCF format)#
These are the VCFs created by the phasing pipeline, containing all samples included each of the phasing runs. There is one VCF per phasing analysis per chromosome arm. The URL for each file has the following structure:
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/panel/funestus/af1.0_funestus_{contig}_phased.vcf.gz
…where {contig}
is one of “2RL”, “3RL”, “X”.
E.g., the panel VCF for the chromosome arm 3RL can be downloaded from:
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/panel/funestus/af1.0_funestus_3RL_phased.vcf.gz
Note that these files can be large, up to ~5 GB.
If you’d like to download all of the panel files, you could also use gsutil
, e.g.:
# create a local directory to store the data
!mkdir -pv ~/vo_afun_release/v1.0/snp_haplotypes/panel/funestus/
# copy files from cloud to local file system
!gsutil -m rsync -r \
-x '.*/.*zarr.zip' \
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/panel/funestus/ \
~/vo_afun_release/v1.0/snp_haplotypes/panel/funestus/
Sample set haplotypes (VCF format)#
These VCFs are subsets of the panel VCFs, containing only samples in a given sample set. There is one VCF per sample set, per phasing analysis, per chromosome arm. The URL for each file has the following structure:
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/{sample_set}/funestus/vcf/{sample_set}_funestus_{contig}_phased.vcf.gz
…where {contig}
is one of “2RL”,”3RL”, “X”; and {sample_set}
is one of the Af sample sets.
E.g., the VCF for sample set 1229-VO-GH-DADZIE-VMF00095, for chromosome arm 2RL, can be downloaded here:
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/funestus/vcf/1229-VO-GH-DADZIE-VMF00095_funestus_2RL_phased.vcf.gz
If you’d like to download all of the VCF files for a given sample set, you could also use gsutil, e.g.:
# create a local directory to store the data
!mkdir -pv ~/vo_afun_release/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/
# copy files from cloud to local file system
!gsutil -m rsync -r \
-x '.*/zarr/.*' \
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/ \
~/vo_afun_release/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/
Sample set haplotypes (Zarr format)#
These contain the haplotype data in Zarr format, with one Zarr hierarchy per sample set. The root zarr path for a given hierarchy has the following structure:
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/{sample_set}/funestus/zarr
Data can be downloaded with gsutil. E.g., download the Zarr data for sample 1229-VO-GH-DADZIE-VMF00095. Note that the sites are stored in a separate hierarchy:
# create local directories to store the data
!mkdir -pv ~/vo_afun_release//v1.0/snp_haplotypes/sites/funestus/
!mkdir -pv ~/vo_afun_release/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/funestus/
# copy haplotype data from cloud to local file system
!gsutil -m rsync -r \
-x '.*/vcf/.*' \
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/funestus/ \
~/vo_afun_release/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/funestus/
# copy phased sites data from cloud to local file system
!gsutil -m rsync -rn \
gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/sites/funestus/ \
~/vo_afun_release//v1.0/snp_haplotypes/sites/funestus/
Feedback and suggestions#
If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the malariagen/vector-data GitHub discussion board.