Introduction to Ag3

The Anopheles gambiae 1000 Genomes Project (Ag1000G) is a collaborative project using whole-genome sequencing to study genetic variation and evolution in natural populations of mosquitoes in the Anopheles gambiae species complex.

This page provides an introduction to open data resources released as part of the third phase of the Ag1000G project, known as “Ag3” for short. We hope the data from Ag3 will be a valuable resource for research and surveillance of malaria vectors. If you have any questions about this guide or how to use the data, please start a new discussion on the malariagen/vector-open-data repo on GitHub. If you find any bugs, please raise an issue.

Terms of use

Data from Ag3 are released openly and can be downloaded and analysed for any purpose. The data have been released prior to publication by the Ag1000G Consortium, and are currently subject to a publication embargo described further in the Ag1000G terms of use. If you have any questions about the terms of use, please email data@malariagen.net.

Contributing studies

The Ag1000G project is coordinated by a consortium of partners from a range of different research institutions and countries. This includes consortium members who are carrying out independent research studies in malaria-endemic regions, and who have contributed mosquito specimens or mosquito DNA samples collected in the course of their own research. In total, 26 studies contributed samples to Ag3, including wild-caught specimens from 19 countries.

For further information about these contributing studies, the researchers involved, and the collection sites and methods, please see the contributing studies document.

Population sampling

Ag3 includes data from 3,081 individual mosquitoes, including 2,784 mosquitoes collected from natural populations in 19 countries. Three species are represented within the cohort: Anopheles gambiae, Anopheles coluzzii and Anopheles arabiensis. The map below provides an overview of the collection locations and the numbers of samples broken down by species.

Ag3 map of sampling sites

In addition to these wild-caught samples, a further 297 samples are included from 15 lab crosses.

Whole-genome sequencing and variant calling

All samples in Ag3 have been sequenced individually to high coverage using Illumina technology at the Wellcome Sanger Institute. These sequence data have then been analysed to identify genetic variants such as single nucleotide polymorphisms (SNPs). After variant calling, both the samples and the variants have been through a range of quality control analyses, to ensure the data are of high quality. Both the raw sequence data and the curated variant calls are openly available for download and analysis.

For further information about the sequencing and variant calling methods used, please see the SNP calling methods document.

Data hosting

Data from Ag3 are hosted by several different services.

Raw sequence reads, sequence read alignments and SNP calls are available for download from the European Nucleotide Archive (ENA). Further information on how to find and download these data is provided in the data download guide.

The SNP data have also been uploaded to Google Cloud, and can be analysed directly within the cloud without having to download or copy any data, including via free interactive computing services such as MyBinder and Google Colab. Further information about analysing these data in the cloud is provided in the cloud data access guide.

Sample sets

The samples included in Ag3 have been organised into 28 sample sets. Each of these sample sets corresponds to a set of mosquito specimens from a contributing study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets. Here is a list of the sample sets included in Ag3:

import malariagen_data
ag3 = malariagen_data.Ag3("gs://vo_agam_release/")
df_sample_sets = ag3.sample_sets()
df_sample_sets
sample_set sample_count release
0 AG1000G-AO 81 v3
1 AG1000G-BF-A 181 v3
2 AG1000G-BF-B 102 v3
3 AG1000G-BF-C 13 v3
4 AG1000G-CD 76 v3
5 AG1000G-CF 73 v3
6 AG1000G-CI 80 v3
7 AG1000G-CM-A 303 v3
8 AG1000G-CM-B 97 v3
9 AG1000G-CM-C 44 v3
10 AG1000G-FR 23 v3
11 AG1000G-GA-A 69 v3
12 AG1000G-GH 100 v3
13 AG1000G-GM-A 74 v3
14 AG1000G-GM-B 31 v3
15 AG1000G-GM-C 174 v3
16 AG1000G-GN-A 45 v3
17 AG1000G-GN-B 185 v3
18 AG1000G-GQ 10 v3
19 AG1000G-GW 101 v3
20 AG1000G-KE 86 v3
21 AG1000G-ML-A 60 v3
22 AG1000G-ML-B 71 v3
23 AG1000G-MW 41 v3
24 AG1000G-MZ 74 v3
25 AG1000G-TZ 300 v3
26 AG1000G-UG 290 v3
27 AG1000G-X 297 v3

The sample set identifiers all start with “AG1000G-” followed by the two-letter code of the country from which samples were collected (e.g., “AO” is Angola). Where there are multiple sample sets from the same country, these have been given alphabetical suffixes, e.g., “AG1000G-BF-A”, “AG1000G-BF-B” and “AG1000G-BF-C” are three sample sets from Burkina Faso.

These country code suffixes are just a convenience to help remember which sample sets contain which data, please see the sample metadata for more precise location information. Note also that sample set AG1000G-GN-B contains samples from both Guinea and Mali.

Here is a more detailed breakdown of the samples contained within each sample set, summarised by country, year of collection, and species:

df_samples = ag3.sample_metadata()
df_summary = df_samples.pivot_table(
    index=["sample_set", "country", "year"], 
    columns=["species"],
    values="sample_id", 
    aggfunc=len,
    fill_value=0)
df_summary
species arabiensis coluzzii gambiae intermediate_arabiensis_gambiae intermediate_gambiae_coluzzii
sample_set country year
AG1000G-AO Angola 2009 0 81 0 0 0
AG1000G-BF-A Burkina Faso 2012 0 82 98 0 1
AG1000G-BF-B Burkina Faso 2014 3 53 46 0 0
AG1000G-BF-C Burkina Faso 2004 0 0 13 0 0
AG1000G-CD Democratic Republic of Congo 2015 0 0 76 0 0
AG1000G-CF Central African Republic 1993 0 5 2 0 0
1994 0 13 53 0 0
AG1000G-CI Cote d'Ivoire 2012 0 80 0 0 0
AG1000G-CM-A Cameroon 2009 0 0 303 0 0
AG1000G-CM-B Cameroon 2005 0 7 90 0 0
AG1000G-CM-C Cameroon 2013 2 19 23 0 0
AG1000G-FR Mayotte 2011 0 0 23 0 0
AG1000G-GA-A Gabon 2000 0 0 69 0 0
AG1000G-GH Ghana 2012 0 64 36 0 0
AG1000G-GM-A Gambia, The 2011 0 5 58 0 11
AG1000G-GM-B Gambia, The 2012 0 16 9 0 6
AG1000G-GM-C Gambia, The 2012 0 148 2 0 24
AG1000G-GN-A Guinea 2012 0 4 40 0 1
AG1000G-GN-B Guinea 2012 0 7 83 0 1
Mali 2012 0 28 65 0 1
AG1000G-GQ Equatorial Guinea 2002 0 0 10 0 0
AG1000G-GW Guinea-Bissau 2010 0 0 29 0 72
AG1000G-KE Kenya 2000 0 0 19 0 0
2007 3 0 0 0 0
2012 10 0 9 0 45
AG1000G-ML-A Mali 2014 0 27 33 0 0
AG1000G-ML-B Mali 2004 2 36 33 0 0
AG1000G-MW Malawi 2015 41 0 0 0 0
AG1000G-MZ Mozambique 2003 0 0 3 0 0
2004 0 0 71 0 0
AG1000G-TZ Tanzania 2012 87 0 0 0 0
2013 1 0 36 0 6
2015 137 0 32 0 1
AG1000G-UG Uganda 2012 82 0 207 1 0

Note that there are also multiple sampling sites represented within some sample sets.

Further reading

Hopefully this page has provided a useful introduction to the Ag3 data resource. If you would like to start working with these data, please visit the cloud data access guide or the data download guide or continue browsing the other documentation on this site.

If you have any questions about the data and how to use them, please do get in touch by starting a new discussion on the malariagen/vector-data repo on GitHub.