# Adir1.0 (Vector Observatory - Asia Project _Anopheles dirus_ Phase 1 Data Release)

The **[Adir1.0](adir1.0):  _Anopheles dirus_ data resource** contains single nucleotide polymorphism (SNP) calls from whole-genome sequencing of 540 mosquitoes. These data were generated as part of the [MalariaGEN Vector Observatory Asian Vector Genomic Surveillance Project](https://www.malariagen.net/project/vector-observatory-asia/).

Vector Observatory - Asia connects research groups that are investigating the population structure and diversity of malaria vectors in Asia. This centres on multiple vectors from the Greater Mekong Subregion in Southeast Asia, where drug-resistant malaria parasites are emerging and spreading. This research is expanding the range of mosquito species that are represented in our whole genome data.

More information about this release can be found in the [data resource website](https://www.malariagen.net/data_package/adir-anopheles-dirus-data-resource/).  

This page provides an introduction to open data resources released as part of the first phase of the Vector Observatory-Asia Genomic Surveillance Project. These projects are known as `Adir1.0` and `Amin1.0` for short. We hope the data from these releases will be a valuable resource for research and surveillance of malaria vectors. This page covers the `Adir1.0` _Anopheles dirus_ data release. For more information about `Amin1.0`, please head to the [Amin1.0 data resource website](https://malariagen.github.io/vector-data/amin1/intro.html).

If you have any questions about this guide or how to use the data, please [start a new discussion](https://github.com/malariagen/vector-public-data/discussions/new) on the malariagen/vector-open-data repo on GitHub. If you find any bugs, please [raise an issue](https://github.com/malariagen/vector-public-data/issues/new/choose).

## Terms of use

Data from this project will be made publicly available before journal publication, subject to the following publication embargo: unless otherwise stated, analyses of project data are ongoing and publications are in preparation by project partners, and it is not permitted to use project data for publication (including any type of communication with the general public) without prior permission from the originating partner studies. The publication embargo will expire 24 months after the data is integrated into the Malaria Genome Vector Observatory data repository, or earlier, if the project partner agrees to remove the embargo before the expiry date.

Although malaria is generally an endemic rather than an epidemic disease, and the focus of this project is on surveillance of disease vectors rather than pathogens, our data terms of use build on MalariaGEN's approach to data sharing, and adopt norms which have been established for rapid sharing of pathogen genomic data during disease outbreaks. The primary rationale for this approach is that malaria remains a public health emergency, where ethically appropriate and rapid sharing of genomic surveillance data can help to detect and respond to biological threats such as new forms of insecticide resistance, and to adapt malaria vector control strategies to different settings and changing circumstances.

The publication embargo for all data on this release will expire on the **30th of November 2027**. 

If you have any questions about the terms of use, please email [support@malariagen.net](mailto:support@malariagen.net)

## Partner studies

- [1278-VO-TH-KOBYLINSKI](https://www.malariagen.net/network/where-we-work/1278-VO-TH-KOBYLINSKI) - _Anopheles dirus_ vector surveillance in Thailand.

- [1277-VO-KH-WITKOWSKI](https://www.malariagen.net/network/where-we-work/1277-VO-KH-WITKOWSKI) - _Anopheles dirus_  vector surveillance in Cambodia.

- [1276-AD-BD-ALAM](https://www.malariagen.net/network/where-we-work/1276-AD-BD-ALAM) - _Anopheles dirus_  vector surveillance in Bangladesh.

## Whole-genome sequencing and variant calling

All samples in `Adir1.0` have been sequenced individually to high coverage using Illumina technology at the Wellcome Sanger Institute. These sequence data have then been analysed to identify genetic variants such as single nucleotide polymorphisms (SNPs). After variant calling, both the samples and the variants have been through a range of quality control analyses, to ensure the data are of high quality. Both the raw sequence data and the curated variant calls are openly available for download and analysis. 

## Quality control

### Coverage
For each sample, depth of coverage was computed at all genome positions. Samples were excluded if median coverage across all chromosomes was less than 10×, or if less than 50% of the reference genome was covered by at least 1×.

### Site filters
We implemented a static cutoff (sc) across sites to exclude variant sites where accessibility issues may impede our ability to confidently call genotypes.  We computed various site statistics from the data of all samples passing sample QC. Our filter excluded sites:
- Where more than 10% of individuals had a mapping quality (MQ) of <10.
- With a mean genotype quality (GQ) of < 60.
- Where > 1 individual had no coverage.
- Where > 10% of individuals had at least half the modal coverage.
- Where > 10% of individuals had at least twice the modal coverage.

## Data hosting

Data from `Adir1.0` are hosted by several different services. 

The SNP data have been uploaded to Google Cloud, and can be analysed directly within the cloud without having to download or copy any data, including via free interactive computing services such as [Google Colab](https://colab.research.google.com/). Further information about analysing these data in the cloud is provided in the [cloud data access guide](cloud).

## Sample sets

The samples included in `Adir1.0` have been organised into 4 sample sets. 

Each sample set corresponds to a set of mosquito specimens from a contributing study. Study details can be found in the partner studies webpages listed above.

In [None]:
%pip install -qq malariagen_data

In [1]:
import malariagen_data
adir1 = malariagen_data.Adir1()

In [2]:
df_sample_sets = adir1.sample_sets(release="1.0")
df_sample_sets[['study_id','sample_set', 'sample_count']].set_index('study_id')

Unnamed: 0_level_0,sample_set,sample_count
study_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1276-AD-BD-ALAM,1276-AD-BD-ALAM-VMF00156,47
1277-VO-KH-WITKOWSKI,1277-VO-KH-WITKOWSKI-VMF00151,26
1277-VO-KH-WITKOWSKI,1277-VO-KH-WITKOWSKI-VMF00183,248
1278-VO-TH-KOBYLINSKI,1278-VO-TH-KOBYLINSKI-VMF00153,219


Here is a more detailed breakdown of the samples contained within this sample set, summarised by country, year of collection, and species:

In [3]:
df_samples = adir1.sample_metadata(sample_sets="1.0")
df_summary = df_samples.pivot_table(
    index=["study_id","sample_set", "country", "year"], 
    columns=["taxon"],
    values="sample_id", 
    aggfunc=len,
    fill_value=0)
df_summary

Load sample metadata: ⠋ (0:00:00.00)

                                     

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,taxon,baimaii,dirus
study_id,sample_set,country,year,Unnamed: 4_level_1,Unnamed: 5_level_1
1276-AD-BD-ALAM,1276-AD-BD-ALAM-VMF00156,Bangladesh,2018,47,0
1277-VO-KH-WITKOWSKI,1277-VO-KH-WITKOWSKI-VMF00151,Cambodia,2017,0,12
1277-VO-KH-WITKOWSKI,1277-VO-KH-WITKOWSKI-VMF00151,Cambodia,2018,0,14
1277-VO-KH-WITKOWSKI,1277-VO-KH-WITKOWSKI-VMF00183,Cambodia,2019,0,41
1277-VO-KH-WITKOWSKI,1277-VO-KH-WITKOWSKI-VMF00183,Cambodia,2020,0,207
1278-VO-TH-KOBYLINSKI,1278-VO-TH-KOBYLINSKI-VMF00153,Thailand,2019,0,219


Note that there can be multiple sampling sites represented within the same sample set.

## Further reading

We hope this page has provided a useful introduction to the `Adir1.0` data resource. If you would like to start working with these data, please visit the [cloud data access guide](cloud) or the [data download guide](download) or continue browsing the other documentation on this site.

If you have any questions about the data and how to use them, please do get in touch by [starting a new discussion](https://github.com/malariagen/vector-data/discussions/new) on the malariagen/vector-data repository on GitHub.