Sequence QC metadata
New columns have been added to the sample metadata for the Ag3
and Af1
resources containing data about quality control metrics.
This includes columns containing depth of coverage summary statistics:
mean_cov
- Mean depth of coverage over the whole genome.median_cov
- Median depth of coverage over the whole genome.modal_cov
- Modal depth of coverage over the whole genome.frac_gen_cov
- Fraction of the genome covered with at least one read.divergence
- Fraction of aligned bases mismatching the reference genome.
Also columns mean_cov_{contig}
, median_cov_{contig}
and modal_cov_{contig}
are available for each contig.
Also included are summary statistics from cross-contamination estimation:
contam_pct
- Estimated percentage cross-contamination.contam_LLR
- Log likelihood ratio for contamination estimation.
These columns are available via the sample_metadata()
function. Here's an example showing some of these new columns for samples in the Ag3
resource:
import malariagen_data
ag3 = malariagen_data.Ag3()
# Load sample metadata.
df_samples = ag3.sample_metadata()
# Inspect all available columns.
df_samples.columns
# View sample metadata for some of the new columns.
df_samples[[
"sample_id",
"sample_set",
"mean_cov",
"median_cov",
"modal_cov",
"frac_gen_cov",
"divergence",
"contam_pct",
]]
As part of data curation, only samples which pass standard thresholds for these QC metrics are included in each released dataset in the vector observatory. However, these metrics may be useful if you want to investigate the values of QC metrics in a particular sample set, or apply stricter thresholds.
For example, here is a histogram of coverage values for an example sample set:
import plotly.express as px
df = df_samples.query("sample_set == 'AG1000G-AO'")
px.histogram(
df,
x="median_cov",
width=700,
height=400,
template="plotly_dark",
)
For example, query samples where coverage is greater than 20X:
df_samples.query("median_cov > 20")