As1.0 (Anopheles stephensi Phase 1 Data Release)#
The As1.0: Anopheles stephensi data resource contains single nucleotide polymorphism (SNP) calls from whole-genome sequencing of 639 mosquitoes.
All of the samples were contributed and sequenced as part of the Controlling Emergent Anopheles stephensi in Sudan and Ethiopia (CEASE) project.
The focus of the project - Anopheles stephensi - is an invasive urban malaria mosquito currently expanding its range across sub-Saharan Africa. The objectives of the CEASE project were to identify the invasion route of An stephensi, its current and potential future distribution, evaluate its contribution to malaria transmission, and evaluate multi-sectoral vector control strategies to combat its spread.
As part 1 of this project, partners from various countries across the native and invasive range of An. stephensi contributed mosquito samples to a genomic surveillance study, where we identified the invasion source (South Asia), route (into Djibouti, seeding separate invasion fronts in Sudan, Ethiopia-Kenya, and Yemen), and architecture of insecticide resistance (mainly metabolic). You can learn more about our findings in our preprint here. This will be published shortly.
The mosquito samples sequenced as part of the CEASE project form the basis of of the MalariaGEN Vector Observatory Anopheles stephensi Phase 1 Data Release, known as As1.0 for short. This will form the basis of future genomic surveillance work in this species. We hope that these data will prove a valuable source for the community for investigations into the biology, evolution and control of An. stephensi in the native and invasive range.
This page provides an introduction to open data resources released as part of As1.0.
If you have any questions about this guide or how to use the data, please start a new discussion on the malariagen/vector-open-data repo on GitHub. If you find any bugs, please raise an issue.
Terms of use#
Data from this project will be made publicly available before journal publication, subject to the following publication embargo: unless otherwise stated, analyses of project data are ongoing and publications are in preparation by project partners, and it is not permitted to use project data for publication (including any type of communication with the general public) without prior permission from the originating partner studies. The publication embargo will expire 24 months after the data is integrated into the Malaria Genome Vector Observatory data repository, or earlier, if the project partner agrees to remove the embargo before the expiry date.
Although malaria is generally an endemic rather than an epidemic disease, and the focus of this project is on surveillance of disease vectors rather than pathogens, our data terms of use build on MalariaGEN’s approach to data sharing, and adopt norms which have been established for rapid sharing of pathogen genomic data during disease outbreaks. The primary rationale for this approach is that malaria remains a public health emergency, where ethically appropriate and rapid sharing of genomic surveillance data can help to detect and respond to biological threats such as new forms of insecticide resistance, and to adapt malaria vector control strategies to different settings and changing circumstances.
The publication embargo for all data in this release will expire on the 5th of April 2028.
If you have any questions about the terms of use, please email support@malariagen.net.
Partner studies#
The samples were contributed by partner institutions from various countries. The surname and primary institution of the lead principle investigator/s contributing samples to the study, and the sample country of origin, are detailed below.
Enquiries about the samples and studies may be directed in the first instance to Tristan Dennis (tristan.dennis@lstmed.ac.uk), David Weetman (david.weetman@lstmed.ac.uk) or Martin Donnelly (martin.donnelly@lstmed.ac.uk).
1363-VO-ET-GADISA-VMF00316 (Ethiopia)#
Endalamaw Gadisa, Armaeur Hansen Research Institute, Ethiopia.
1364-VO-SD-KAFY-VMF00317 (Sudan)#
Hmooda Toto Kafy, University of Khartoum, Sudan.
Elfatih Malik, University of Khartoum, Sudan.
1365-VO-DJ-ADBI-VMF00318 (Djibouti)#
Bouh Abdi Khaireh, Association Mutualis, Djibouti.
1366-VO-YE-ALLAN-VMF00319 (Yemen)#
Richard Allan, MENTOR Initiative, United Kingdom.
1367-VO-AF-DONNELLY-VMF00320 (Afghanistan)#
Martin Donnelly, Liverpool School of Tropical Medicine, United Kingdom.
1368-VO-PK-DONNELLY-VMF00321 (Pakistan)#
Martin Donnelly, Liverpool School of Tropical Medicine, United Kingdom.
1369-VO-SA-AL-NAZAWI-VMF00322 (Saudi Arabia)#
Ashwaq Al-Nazawi, Jazan University, Saudi Arabia.
1370-VO-IR-ENAYATI-VMF00323 (Iran)#
Ahmadali Enayati, Mazandaran University of Medical Sciences, Iran.
1385-VO-DJ-WEETMAN-VMF00338 (United Kingdom).#
David Weetman, Liverpool School of Tropical Medicine, United Kingdom.
N.B. These are colony mosquitoes derived from wild-collected samples in Djibouti.
1386-VO-KE-OCHOMO-VMF00339 (Kenya)#
Eric Ochomo, Kenya Medical Research Institute (KEMRI), Kenya
1458-VO-ET-YEWHALAW-VMF00340 (Ethiopia)#
Delenasaw Yewhalaw, Jimma University, Ethiopia.
1459-VO-SD-AHMED-VMF00342#
Ayman Ahmed, University of Khartoum, Sudan.
Literature sample sets#
This release also includes data from one study openly available in the literature:
thakare-2022#
Previously published data from Thakare et al, 2022.
Whole-genome sequencing and variant calling#
All samples in As1.0 have been sequenced individually to high coverage using Illumina technology by a commercial provider. These sequence data have then been analysed to identify genetic variants such as single nucleotide polymorphisms (SNPs). After variant calling, both the samples and the variants have been through a range of quality control analyses, to ensure the data are of high quality. Both the raw sequence data and the curated variant calls are openly available for download and analysis.
Data hosting#
As1 data are hosted by several different services.
Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). These can be accessed at on the ENA portal.
SNP calls in VCF and Zarr formats are hosted on S3-compatible object storage.
Sample metadata in CSV format are hosted on Google Cloud Storage (GCS) in the vo_aste_release_master_us_central1 bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible but do require an authentication step, please see details on the Vector Observatory Data Access page.
The SNP data have also been uploaded to Google Cloud, and can be analysed directly within the cloud without having to download or copy any data, including via free interactive computing services such as Google Colab. Further information about analysing these data in the cloud is provided in the cloud data access guide.
More information on accessing and downloading these data are available under download and cloud.
Sample sets#
The samples included in As1.0 have been organised into 13 sample sets.
Each sample set corresponds to a set of mosquito specimens from a contributing study. Study details can be found in the partner studies webpages listed above.
Note: you may need to restart the kernel to use updated packages.
| sample_set | sample_count | |
|---|---|---|
| study_id | ||
| 1363-VO-ET-GADISA | 1363-VO-ET-GADISA-VMF00316 | 111 |
| 1364-VO-SD-KAFY | 1364-VO-SD-KAFY-VMF00317 | 226 |
| 1365-VO-DJ-ADBI | 1365-VO-DJ-ADBI-VMF00318 | 21 |
| 1366-VO-YE-ALLAN | 1366-VO-YE-ALLAN-VMF00319 | 22 |
| 1367-VO-AF-DONNELLY | 1367-VO-AF-DONNELLY-VMF00320 | 24 |
| 1368-VO-PK-DONNELLY | 1368-VO-PK-DONNELLY-VMF00321 | 15 |
| 1369-VO-SA-AL-NAZAWI | 1369-VO-SA-AL-NAZAWI-VMF00322 | 42 |
| 1370-VO-IR-ENAYATI | 1370-VO-IR-ENAYATI-VMF00323 | 72 |
| 1385-VO-DJ-WEETMAN | 1385-VO-DJ-WEETMAN-VMF00338 | 14 |
| 1386-VO-KE-OCHOMO | 1386-VO-KE-OCHOMO-VMF00339 | 29 |
| 1458-VO-ET-YEWHALAW | 1458-VO-ET-YEWHALAW-VMF00340 | 23 |
| 1459-VO-SD-AHMED | 1459-VO-SD-AHMED-VMF00342 | 25 |
| thakare-2022 | thakare-2022 | 15 |
Here is a more detailed breakdown of the samples contained within this sample set, summarised by country, year of collection, and species. The warning is a result of the surveillance flags not being set. This will be implemented in future versions.
Load sample metadata: ⠏ (0:00:00.76)
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1363-VO-ET-GADISA-VMF00316
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1364-VO-SD-KAFY-VMF00317
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1365-VO-DJ-ADBI-VMF00318
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1366-VO-YE-ALLAN-VMF00319
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1367-VO-AF-DONNELLY-VMF00320
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1368-VO-PK-DONNELLY-VMF00321
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1369-VO-SA-AL-NAZAWI-VMF00322
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1370-VO-IR-ENAYATI-VMF00323
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1385-VO-DJ-WEETMAN-VMF00338
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1386-VO-KE-OCHOMO-VMF00339
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1458-VO-ET-YEWHALAW-VMF00340
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1459-VO-SD-AHMED-VMF00342
warnings.warn(
/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set thakare-2022
warnings.warn(
| taxon | stephensi | |||
|---|---|---|---|---|
| study_id | sample_set | country | year | |
| 1363-VO-ET-GADISA | 1363-VO-ET-GADISA-VMF00316 | Ethiopia | 2022 | 10 |
| 2023 | 74 | |||
| 2024 | 27 | |||
| 1364-VO-SD-KAFY | 1364-VO-SD-KAFY-VMF00317 | Sudan | 2022 | 189 |
| 2023 | 37 | |||
| 1365-VO-DJ-ADBI | 1365-VO-DJ-ADBI-VMF00318 | Djibouti | 2023 | 21 |
| 1366-VO-YE-ALLAN | 1366-VO-YE-ALLAN-VMF00319 | Yemen | 2021 | 6 |
| 2023 | 16 | |||
| 1367-VO-AF-DONNELLY | 1367-VO-AF-DONNELLY-VMF00320 | Afghanistan | 2017 | 24 |
| 1368-VO-PK-DONNELLY | 1368-VO-PK-DONNELLY-VMF00321 | Pakistan | 2005 | 15 |
| 1369-VO-SA-AL-NAZAWI | 1369-VO-SA-AL-NAZAWI-VMF00322 | Saudi Arabia | 2023 | 42 |
| 1370-VO-IR-ENAYATI | 1370-VO-IR-ENAYATI-VMF00323 | Iran | 2023 | 72 |
| 1385-VO-DJ-WEETMAN | 1385-VO-DJ-WEETMAN-VMF00338 | Colony | 2025 | 14 |
| 1386-VO-KE-OCHOMO | 1386-VO-KE-OCHOMO-VMF00339 | Kenya | 2022 | 1 |
| 2024 | 28 | |||
| 1458-VO-ET-YEWHALAW | 1458-VO-ET-YEWHALAW-VMF00340 | Ethiopia | 2023 | 23 |
| 1459-VO-SD-AHMED | 1459-VO-SD-AHMED-VMF00342 | Sudan | 2018 | 25 |
| thakare-2022 | thakare-2022 | India | 2021 | 15 |
Note that there can be multiple sampling sites represented within the same sample set.
Further reading#
We hope this page has provided a useful introduction to the As1.0 data resource. If you would like to start working with these data, please visit the cloud data access guide or the data download guide or continue browsing the other documentation on this site.