vobs updates

Technical and scientific updates from the Malaria Vector Genome Observatory.

6 June 2024 | data

Sequence QC metadata

New columns have been added to the sample metadata for the Ag3 and Af1 resources containing data about quality control metrics.

This includes columns containing depth of coverage summary statistics:

  • mean_cov - Mean depth of coverage over the whole genome.
  • median_cov - Median depth of coverage over the whole genome.
  • modal_cov - Modal depth of coverage over the whole genome.
  • frac_gen_cov - Fraction of the genome covered with at least one read.
  • divergence - Fraction of aligned bases mismatching the reference genome.

Also columns mean_cov_{contig}, median_cov_{contig} and modal_cov_{contig} are available for each contig.

Also included are summary statistics from cross-contamination estimation:

  • contam_pct - Estimated percentage cross-contamination.
  • contam_LLR - Log likelihood ratio for contamination estimation.

These columns are available via the sample_metadata() function. Here's an example showing some of these new columns for samples in the Ag3 resource:

import malariagen_data
ag3 = malariagen_data.Ag3()
# Load sample metadata.
df_samples = ag3.sample_metadata()

# Inspect all available columns.
df_samples.columns
Index(['sample_id', 'partner_sample_id', 'contributor', 'country', 'location',
       'year', 'month', 'latitude', 'longitude', 'sex_call', 'sample_set',
       'release', 'quarter', 'study_id', 'study_url',
       'terms_of_use_expiry_date', 'terms_of_use_url', 'unrestricted_use',
       'mean_cov', 'median_cov', 'modal_cov', 'mean_cov_2L', 'median_cov_2L',
       'mode_cov_2L', 'mean_cov_2R', 'median_cov_2R', 'mode_cov_2R',
       'mean_cov_3L', 'median_cov_3L', 'mode_cov_3L', 'mean_cov_3R',
       'median_cov_3R', 'mode_cov_3R', 'mean_cov_X', 'median_cov_X',
       'mode_cov_X', 'frac_gen_cov', 'divergence', 'contam_pct', 'contam_LLR',
       'aim_species_fraction_arab', 'aim_species_fraction_colu',
       'aim_species_fraction_colu_no2l', 'aim_species_gambcolu_arabiensis',
       'aim_species_gambiae_coluzzii', 'aim_species', 'country_iso',
       'admin1_name', 'admin1_iso', 'admin2_name', 'taxon',
       'cohort_admin1_year', 'cohort_admin1_month', 'cohort_admin1_quarter',
       'cohort_admin2_year', 'cohort_admin2_month', 'cohort_admin2_quarter'],
      dtype='object')
# View sample metadata for some of the new columns.
df_samples[[
    "sample_id",
    "sample_set",
    "mean_cov",
    "median_cov",
    "modal_cov",
    "frac_gen_cov",
    "divergence",
    "contam_pct",
]]
sample_id sample_set mean_cov median_cov modal_cov frac_gen_cov divergence contam_pct
0 VBS00256-4651STDY7017184 1177-VO-ML-LEHMANN-VMF00004 26.86 26 24 0.939 0.02061 3.572
1 VBS00257-4651STDY7017185 1177-VO-ML-LEHMANN-VMF00004 31.59 31 30 0.942 0.02058 3.337
2 VBS00259-4651STDY7017186 1177-VO-ML-LEHMANN-VMF00004 35.31 35 36 0.944 0.02014 2.29
3 VBS00262-4651STDY7017187 1177-VO-ML-LEHMANN-VMF00004 30.08 29 30 0.941 0.02034 3.502
4 VBS00277-4651STDY7017189 1177-VO-ML-LEHMANN-VMF00004 31.09 30 30 0.943 0.02028 3.127
... ... ... ... ... ... ... ... ...
19766 SAMN15222632 tennessen-2021 28.12 28 29 0.94 0.02014 1.177
19767 SAMN15222633 tennessen-2021 30.31 31 33 0.939 0.01997 1.229
19768 SAMN15222634 tennessen-2021 25.49 25 26 0.939 0.0201 1.02
19769 SAMN15222635 tennessen-2021 20.13 20 20 0.939 0.02026 0.939
19770 SAMN15222636 tennessen-2021 23.81 23 24 0.94 0.02004 1.113

19771 rows × 8 columns

As part of data curation, only samples which pass standard thresholds for these QC metrics are included in each released dataset in the vector observatory. However, these metrics may be useful if you want to investigate the values of QC metrics in a particular sample set, or apply stricter thresholds.

For example, here is a histogram of coverage values for an example sample set:

import plotly.express as px

df = df_samples.query("sample_set == 'AG1000G-AO'")
px.histogram(
    df, 
    x="median_cov",
    width=700,
    height=400,
    template="plotly_dark",
)

For example, query samples where coverage is greater than 20X:

df_samples.query("median_cov > 20")
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
0 VBS00256-4651STDY7017184 GP97 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
1 VBS00257-4651STDY7017185 GP98 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
2 VBS00259-4651STDY7017186 GP100 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
3 VBS00262-4651STDY7017187 GP103 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
4 VBS00277-4651STDY7017189 GP118 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19765 SAMN15222631 D341 Jacob Tennessen Burkina Faso Tengrela 2016 -1 10.700 -4.800 F ... Cascades BF-02 Comoe coluzzii BF-02_colu_2016 BF-02_colu_2016 BF-02_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016
19766 SAMN15222632 D342 Jacob Tennessen Burkina Faso Tengrela 2016 -1 10.700 -4.800 F ... Cascades BF-02 Comoe coluzzii BF-02_colu_2016 BF-02_colu_2016 BF-02_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016
19767 SAMN15222633 D343 Jacob Tennessen Burkina Faso Tengrela 2016 -1 10.700 -4.800 F ... Cascades BF-02 Comoe coluzzii BF-02_colu_2016 BF-02_colu_2016 BF-02_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016
19768 SAMN15222634 D346 Jacob Tennessen Burkina Faso Tengrela 2016 -1 10.700 -4.800 F ... Cascades BF-02 Comoe coluzzii BF-02_colu_2016 BF-02_colu_2016 BF-02_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016
19770 SAMN15222636 D348 Jacob Tennessen Burkina Faso Tengrela 2016 -1 10.700 -4.800 F ... Cascades BF-02 Comoe coluzzii BF-02_colu_2016 BF-02_colu_2016 BF-02_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016 BF-02_Comoe_colu_2016

18676 rows × 57 columns