Ag3.0 (Ag1000G phase 3)#

The Anopheles gambiae 1000 Genomes Project (Ag1000G) is a collaborative project using whole-genome sequencing to study genetic variation and evolution in natural populations of mosquitoes in the Anopheles gambiae species complex.

This page provides an introduction to open data resources released as part of the third phase of the Ag1000G project, known as Ag3.0 for short. We hope the data from Ag3.0 will be a valuable resource for research and surveillance of malaria vectors. If you have any questions about this guide or how to use the data, please start a new discussion on the malariagen/vector-open-data repo on GitHub. If you find any bugs, please raise an issue.

Terms of use#

Data from Ag3.0 are released openly and can be downloaded and analysed for any purpose. The data have been released prior to publication by the Ag1000G Consortium, and are currently subject to a publication embargo described further in the Ag1000G terms of use. If you have any questions about the terms of use, please email data@malariagen.net.

Partner studies#

The Ag1000G project is coordinated by a consortium of partners from a range of different research institutions and countries. This includes consortium members who are carrying out independent research studies in malaria-endemic regions, and who have contributed mosquito specimens or mosquito DNA samples collected in the course of their own research. In total, 26 studies contributed samples to Ag3.0, including wild-caught specimens from 19 countries. For further information about these contributing studies, the researchers involved, and the collection sites and methods, see the Ag1000G partner studies page.

Population sampling#

Ag3.0 includes data from 3,081 individual mosquitoes, including 2,784 mosquitoes collected from natural populations in 19 countries. Three species are represented within the cohort: Anopheles gambiae, Anopheles coluzzii and Anopheles arabiensis. The map below provides an overview of the collection locations and the numbers of samples broken down by species.

Ag3 map of sampling sites

In addition to these wild-caught samples, a further 297 samples are included from 15 lab crosses.

Whole-genome sequencing and variant calling#

All samples in Ag3.0 have been sequenced individually to high coverage using Illumina technology at the Wellcome Sanger Institute. These sequence data have then been analysed to identify genetic variants such as single nucleotide polymorphisms (SNPs). After variant calling, both the samples and the variants have been through a range of quality control analyses, to ensure the data are of high quality. Both the raw sequence data and the curated variant calls are openly available for download and analysis.

For further information about the sequencing and variant calling methods used, please see the methods page.

Data hosting#

Data from Ag3.0 are hosted by several different services.

Raw sequence reads, sequence read alignments and SNP calls are available for download from the European Nucleotide Archive (ENA). Further information on how to find and download these data is provided in the data download guide.

The SNP data have also been uploaded to Google Cloud, and can be analysed directly within the cloud without having to download or copy any data, including via free interactive computing services such as MyBinder and Google Colab. Further information about analysing these data in the cloud is provided in the cloud data access guide.

Sample sets#

The samples included in Ag3.0 have been organised into 28 sample sets. Each of these sample sets corresponds to a set of mosquito specimens from a contributing study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets. Here is a list of the sample sets included:

sample_count
sample_set
AG1000G-AO 81
AG1000G-BF-A 181
AG1000G-BF-B 102
AG1000G-BF-C 13
AG1000G-CD 76
AG1000G-CF 73
AG1000G-CI 80
AG1000G-CM-A 303
AG1000G-CM-B 97
AG1000G-CM-C 44
AG1000G-FR 23
AG1000G-GA-A 69
AG1000G-GH 100
AG1000G-GM-A 74
AG1000G-GM-B 31
AG1000G-GM-C 174
AG1000G-GN-A 45
AG1000G-GN-B 185
AG1000G-GQ 10
AG1000G-GW 101
AG1000G-KE 86
AG1000G-ML-A 60
AG1000G-ML-B 71
AG1000G-MW 41
AG1000G-MZ 74
AG1000G-TZ 300
AG1000G-UG 290
AG1000G-X 669

The sample set identifiers all start with “AG1000G-” followed by the two-letter code of the country from which samples were collected (e.g., “AO” is Angola). Where there are multiple sample sets from the same country, these have been given alphabetical suffixes, e.g., “AG1000G-BF-A”, “AG1000G-BF-B” and “AG1000G-BF-C” are three sample sets from Burkina Faso.

These country code suffixes are just a convenience to help remember which sample sets contain which data, please see the sample metadata for more precise location information. Note also that sample set AG1000G-GN-B contains samples from both Guinea and Mali.

Here is a more detailed breakdown of the samples contained within each sample set, summarised by country, year of collection, and species:

taxon arabiensis coluzzii gambiae gcx1 gcx2 gcx3 unassigned
study_id sample_set country year
AG1000G-AO AG1000G-AO Angola 2009 0 81 0 0 0 0 0
AG1000G-BF-1 AG1000G-BF-A Burkina Faso 2012 0 82 99 0 0 0 0
AG1000G-BF-B Burkina Faso 2014 3 53 46 0 0 0 0
AG1000G-BF-2 AG1000G-BF-C Burkina Faso 2004 0 0 13 0 0 0 0
AG1000G-CD AG1000G-CD Democratic Republic of the Congo 2015 0 0 76 0 0 0 0
AG1000G-CF AG1000G-CF Central African Republic 1993 0 5 2 0 0 0 0
1994 0 13 53 0 0 0 0
AG1000G-CI AG1000G-CI Cote d'Ivoire 2012 0 80 0 0 0 0 0
AG1000G-CM-1 AG1000G-CM-A Cameroon 2009 0 0 303 0 0 0 0
AG1000G-CM-2 AG1000G-CM-B Cameroon 2005 0 7 90 0 0 0 0
AG1000G-CM-3 AG1000G-CM-C Cameroon 2013 2 19 23 0 0 0 0
AG1000G-FR AG1000G-FR Mayotte 2011 0 0 23 0 0 0 0
AG1000G-GA-1 AG1000G-GA-A Gabon 2000 0 0 69 0 0 0 0
AG1000G-GH AG1000G-GH Ghana 2012 0 64 36 0 0 0 0
AG1000G-GM-1 AG1000G-GM-A Gambia, The 2011 0 0 0 68 6 0 0
AG1000G-GM-2 AG1000G-GM-B Gambia, The 2006 0 0 0 9 22 0 0
AG1000G-GM-3 AG1000G-GM-C Gambia, The 2012 0 0 2 0 172 0 0
AG1000G-GN-ML AG1000G-GN-A Guinea 2012 0 4 41 0 0 0 0
AG1000G-GN-B Guinea 2012 0 7 84 0 0 0 0
Mali 2012 0 27 65 0 0 0 2
AG1000G-GQ AG1000G-GQ Equatorial Guinea 2002 0 0 10 0 0 0 0
AG1000G-GW AG1000G-GW Guinea-Bissau 2010 0 0 8 93 0 0 0
AG1000G-KE AG1000G-KE Kenya 2000 0 0 19 0 0 0 0
2007 3 0 0 0 0 0 0
2012 10 0 0 0 0 54 0
AG1000G-ML-1 AG1000G-ML-A Mali 2014 0 27 33 0 0 0 0
AG1000G-ML-2 AG1000G-ML-B Mali 2004 2 36 33 0 0 0 0
AG1000G-MW AG1000G-MW Malawi 2015 41 0 0 0 0 0 0
AG1000G-MZ AG1000G-MZ Mozambique 2003 0 0 3 0 0 0 0
2004 0 0 71 0 0 0 0
AG1000G-TZ AG1000G-TZ Tanzania 2012 87 0 0 0 0 0 0
2013 1 0 32 0 0 10 0
2015 137 0 32 0 0 1 0
AG1000G-UG AG1000G-UG Uganda 2012 82 0 207 0 0 0 1
AG1000G-X AG1000G-X Lab Cross -1 0 0 0 0 0 0 297

Note that there are also multiple sampling sites represented within some sample sets.

Further reading#

Hopefully this page has provided a useful introduction to the Ag3.0 data resource. If you would like to start working with these data, please visit the cloud data access guide or the data download guide or continue browsing the other documentation on this site.

If you have any questions about the data and how to use them, please do get in touch by starting a new discussion on the malariagen/vector-data repo on GitHub.