Summarise genetic variant types¶

Introduction¶

This notebook creates a table summarising the quantities and types of genetic variants in the Pf8 dataset.

The table will be generated for two callsets:

Full genetic callset: Includes genomic positions with both SNP and non-SNP variants.
SNP-only callset: Excludes non-SNP variants.

This notebook should take approximately two minutes to run.

Setup¶

Install the malariagen Python package:

!pip install malariagen_data -q --no-warn-conflicts

  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 3.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 765.4/765.4 kB 20.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.9/25.9 MB 35.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 49.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.6/210.6 kB 12.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 60.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 53.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 63.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 39.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.7/77.7 kB 6.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.7/101.7 kB 7.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 44.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 15.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.3/43.3 kB 3.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.3/13.3 MB 64.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 51.3 MB/s eta 0:00:00
?25h  Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
  Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import the required libraries

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import malariagen_data
import pandas as pd
import collections
from google.colab import drive

Access Pf8 Data¶

release_data = malariagen_data.Pf8()
variant_data = release_data.variant_calls()
variant_data

<xarray.Dataset> Size: 7TB
Dimensions:              (variants: 12493205, alleles: 7, samples: 33325,
                          ploidy: 2)
Coordinates:
    variant_position     (variants) int32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_chrom        (variants) object 100MB dask.array<chunksize=(131072,), meta=np.ndarray>
    sample_id            (samples) object 267kB dask.array<chunksize=(16663,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele       (variants, alleles) object 700MB dask.array<chunksize=(131072, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_is_snp       (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_numalt       (variants) int32 50MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_CDS          (variants) bool 12MB dask.array<chunksize=(131072,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 833GB dask.array<chunksize=(131072, 100, 2), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 6TB dask.array<chunksize=(131072, 100, 7), meta=np.ndarray>

This tells us we have 12,493,205 variants across 33,325 samples

The variants data include:

Information on where in the genome variants occur (variant_position, variant_chrom)
Which variants occur in which samples (sample_id)
Characteristics of the variants: - Did they pass QC? (variant_filter_pass) - Are they a single nucleotide polymorphism (variant_is_snp) - Do they occur in regions of the genome which encode genes? (variant_CDS)

Extract variables of interest for table¶

Here we define a dictionary which maps items from variant_data to new variables in a new dataframe ‘df_variants’

# Take data from the variant_data xarray object, process it, and create a new dataframe named df_variants containing variant information
variants_dict = collections.OrderedDict()
variants_dict['Type'] = pd.Series(variant_data['variant_is_snp'][:]).map({True: 'SNP', False: 'non-SNP'})
variants_dict['Multiallelic'] = pd.Series(variant_data['variant_numalt'][:] == 1).map({True: 'Bi-allelic', False: 'Multi-allelic'})
variants_dict['Coding'] = pd.Series(variant_data['variant_CDS'][:]).map({True: 'Coding', False: 'Non-coding'})
variants_dict['is_pass'] = variant_data['variant_filter_pass'][:]
variants_dict['num_alleles'] = variant_data['variant_numalt'][:] + 1
variants_dict['Chrom'] = pd.Series(variant_data['variant_chrom'][:])
df_variants = pd.DataFrame.from_dict(variants_dict)
df_variants.loc[df_variants['Type']=='non-SNP', 'Multiallelic'] = ''

print(df_variants.shape)

display(df_variants.head(3))

(12493205, 6)

	Type	Coding	is_pass	num_alleles	Chrom
0	non-SNP	Non-coding	False	2	Pf3D7_01_v3
1	non-SNP	Non-coding	False	2	Pf3D7_01_v3
2	non-SNP	Non-coding	False	3	Pf3D7_01_v3

Make a new dataframe for the final table¶

Here we summarise the quantities of each type of variant

# Aggregate the df_variants dataframe into a df of variant summaries
pd.DataFrame({
    "Discovered variant positions": df_variants.groupby(['Type', 'Multiallelic', 'Coding'])['Type'].count(),
    "Pass variant positions": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Multiallelic', 'Coding'])['Type'].count(),
    "% pass": (
            df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Multiallelic', 'Coding'])['Type'].count() /
            df_variants.groupby(['Type', 'Multiallelic', 'Coding'])['Type'].count()
    ),
    "Alleles per pass position": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Multiallelic', 'Coding'])['num_alleles'].mean(),
}).loc[:, ["Discovered variant positions", "Pass variant positions", "% pass", "Alleles per pass position"]]

			Discovered variant positions	Pass variant positions	% pass	Alleles per pass position
Type	Multiallelic	Coding
SNP	Bi-allelic	Coding	2139117	1544381	0.721971	2.000000
	Bi-allelic	Non-coding	1146748	645645	0.563023	2.000000
	Multi-allelic	Coding	684441	565022	0.825523	3.122047
	Multi-allelic	Non-coding	441151	264886	0.600443	3.136893
non-SNP		Coding	3028425	1925558	0.635828	3.489727
non-SNP		Non-coding	5053323	3151964	0.623741	3.860688

Make the final table¶

Format the “% pass column” so that values look like percentages
Round the “Alleles per pass position” to one decimal place
Add a totals row

df_variant_summary = pd.concat([

        pd.DataFrame({
            "Discovered variant positions": df_variants.groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count(),
            "Pass variant positions": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count(),
            "% pass": (
                    (df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count() /
                    df_variants.groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count())*100
            ).apply(lambda x: "{:.0f}%".format(x)),  # Formatting % pass with 0 decimal places and %,
            "Alleles per pass position": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['num_alleles'].mean().apply(lambda x: "{:.1f}".format(x)),  # Formatting Alleles per pass position with 1 decimal place,
        })
        .reset_index()
        ,
        pd.DataFrame({
            "Discovered variant positions": df_variants.groupby(lambda x: 'Total')['Type'].count(), # Adding a totals row
            "Pass variant positions": df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['Type'].count(),
            "% pass": (
                    (df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['Type'].count() /
                    df_variants.groupby(lambda x: 'Total')['Type'].count())*100
            ).apply(lambda x: "{:.0f}%".format(x)),
            "Alleles per pass position": df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['num_alleles'].mean().apply(lambda x: "{:.1f}".format(x)),  # Formatting Alleles per pass position with 1 decimal place,
        })
        .reset_index()
        .rename(columns={'index': 'Type'}),
], sort=False).loc[:, ["Type", "Coding", "Multiallelic", "Discovered variant positions", "Pass variant positions", "% pass", "Alleles per pass position"]]

df_variant_summary = df_variant_summary.fillna("")

df_variant_summary = df_variant_summary.set_index(['Type', 'Coding', 'Multiallelic'])

df_variant_summary

			Discovered variant positions	Pass variant positions	% pass	Alleles per pass position
Type	Coding	Multiallelic
SNP	Coding	Bi-allelic	2139117	1544381	72%	2.0
	Coding	Multi-allelic	684441	565022	83%	3.1
	Non-coding	Bi-allelic	1146748	645645	56%	2.0
	Non-coding	Multi-allelic	441151	264886	60%	3.1
non-SNP	Coding		3028425	1925558	64%	3.5
non-SNP	Non-coding		5053323	3151964	62%	3.9
Total			12493205	8097456	65%	3.2

Summary of discovered variant positions. The table shows variant positions containing only single nucleotide polymorphisms (SNPs) and variant positions containing non-SNPs (indels or spanning deletions). Note that some non-SNP positions may also contain SNPs. We then further sub-divide each of these into those within exons (coding) and those in intronic or intergenic regions (non-coding). We further sub-divide SNPs into those containing only two alleles (bi-allelic) or those containing three or more alleles (multi-allelic). Discovered variant positions are unique positions in the reference genome where either SNP or indel variation was discovered by our analysis pipeline. Pass variant positions are the subset of discovered positions that passed our quality filters. Alleles per pass position shows the mean number of distinct alleles at each pass position; biallelic variants have two alleles by definition

SNP-only Callset¶

For convenience and to capture all existing SNPs in the callset, we also created a SNP-only callset. Please see Pf8 supplementary materials for details.

In order to generate the variant summary table for SNP-only callset, we simply adjust the dataset and run the rest of the code as is.

release_data = malariagen_data.Pf8('s3://pf8-release/snp-only/')
variant_data = release_data.variant_calls()
variant_data

<xarray.Dataset> Size: 6TB
Dimensions:              (variants: 10821552, alleles: 7, samples: 33325,
                          ploidy: 2)
Coordinates:
    variant_position     (variants) int32 43MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_chrom        (variants) object 87MB dask.array<chunksize=(131072,), meta=np.ndarray>
    sample_id            (samples) object 267kB dask.array<chunksize=(16663,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele       (variants, alleles) object 606MB dask.array<chunksize=(131072, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool 11MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_is_snp       (variants) bool 11MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_numalt       (variants) int32 43MB dask.array<chunksize=(131072,), meta=np.ndarray>
    variant_CDS          (variants) bool 11MB dask.array<chunksize=(131072,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 721GB dask.array<chunksize=(131072, 100, 2), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 5TB dask.array<chunksize=(131072, 100, 7), meta=np.ndarray>

The SNP-only callset contains 10,821,552 variants, representing approximately a 13% reduction in the total number of variants. However, it also shows a significant increase of 6,410,095 in the number of SNP variants.

# Take data from the variant_data xarray object, process it, and create a new dataframe named df_variants containing variant information
variants_dict = collections.OrderedDict()
variants_dict['Type'] = pd.Series(variant_data['variant_is_snp'][:]).map({True: 'SNP', False: 'non-SNP'})
variants_dict['Multiallelic'] = pd.Series(variant_data['variant_numalt'][:] == 1).map({True: 'Bi-allelic', False: 'Multi-allelic'})
variants_dict['Coding'] = pd.Series(variant_data['variant_CDS'][:]).map({True: 'Coding', False: 'Non-coding'})
variants_dict['is_pass'] = variant_data['variant_filter_pass'][:]
variants_dict['num_alleles'] = variant_data['variant_numalt'][:] + 1
variants_dict['Chrom'] = pd.Series(variant_data['variant_chrom'][:])
df_variants = pd.DataFrame.from_dict(variants_dict)
df_variants.loc[df_variants['Type']=='non-SNP', 'Multiallelic'] = ''

print(df_variants.shape)

display(df_variants.head(3))

(10821552, 6)

	Type	Multiallelic	Coding	is_pass	num_alleles	Chrom
0	SNP	Bi-allelic	Non-coding	False	2	Pf3D7_01_v3
1	SNP	Bi-allelic	Non-coding	False	2	Pf3D7_01_v3
2	SNP	Bi-allelic	Non-coding	False	2	Pf3D7_01_v3

Now let’s generate the final table.

df_variant_summary_snp_only = pd.concat([

        pd.DataFrame({
            "Discovered variant positions": df_variants.groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count(),
            "Pass variant positions": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count(),
            "% pass": (
                    (df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count() /
                    df_variants.groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count())*100
            ).apply(lambda x: "{:.0f}%".format(x)),  # Formatting % pass with 0 decimal places and %,
            "Alleles per pass position": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['num_alleles'].mean().apply(lambda x: "{:.1f}".format(x)),  # Formatting Alleles per pass position with 1 decimal place,
        })
        .reset_index()
        ,
        pd.DataFrame({
            "Discovered variant positions": df_variants.groupby(lambda x: 'Total')['Type'].count(), # Adding a totals row
            "Pass variant positions": df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['Type'].count(),
            "% pass": (
                    (df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['Type'].count() /
                    df_variants.groupby(lambda x: 'Total')['Type'].count())*100
            ).apply(lambda x: "{:.0f}%".format(x)),
            "Alleles per pass position": df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['num_alleles'].mean().apply(lambda x: "{:.1f}".format(x)),  # Formatting Alleles per pass position with 1 decimal place,
        })
        .reset_index()
        .rename(columns={'index': 'Type'}),
], sort=False).loc[:, ["Type", "Coding", "Multiallelic", "Discovered variant positions", "Pass variant positions", "% pass", "Alleles per pass position"]]

df_variant_summary_snp_only = df_variant_summary_snp_only.fillna("")

df_variant_summary_snp_only = df_variant_summary_snp_only.set_index(['Type', 'Coding', 'Multiallelic'])

df_variant_summary_snp_only

			Discovered variant positions	Pass variant positions	% pass	Alleles per pass position
Type	Coding	Multiallelic
SNP	Coding	Bi-allelic	3724576	2565888	69%	2.0
	Coding	Multi-allelic	1558607	1032032	66%	3.1
	Non-coding	Bi-allelic	3727751	2277242	61%	2.0
	Non-coding	Multi-allelic	1810618	945464	52%	3.1
Total			10821552	6820626	63%	2.3

Summary of genomic variants within the SNP-only callset. The number of discovered SNPs are higher than in the corresponding categories for the full callset because SNPs in shared SNP-indel positions were being masked from the count of SNPs in the full callset. In the SNP-only callset, they are counted as indels are removed.

Write the table to a file¶

We can output this to a location in Google Drive

First we need to connect Google Drive by running the following:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')

Mounted at /content/drive

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
df_variant_summary.to_excel('/content/drive/My Drive/variant-summary-table.xlsx')

df_variant_summary_snp_only.to_excel('/content/drive/My Drive/snp-only-variant-summary-table.xlsx')

MalariaGEN parasite data user guide

Summarise genetic variant types

Contents