Summarise Genetic Variants¶

Introduction¶

This notebook recreates Supplementary Table 3 from the Pf7 paper.

This table summarises the quantities and types of genetic variants found in the Pf7 dataset.

This notebook should take approximately two minutes to run.

Setup¶

Install the malariagen Python package:

!pip install -q --no-warn-conflicts malariagen_data

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 kB 2.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 42.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 94.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 66.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 25.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 16.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 66.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 105.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 17.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.6/233.6 kB 19.4 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 76.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 59.9 MB/s eta 0:00:00
?25h  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import the required libraries

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import malariagen_data
import pandas as pd
import collections
from google.colab import drive

Access Pf7 Data¶

release_data = malariagen_data.Pf7()
variant_data = release_data.variant_calls()
variant_data

<xarray.Dataset>
Dimensions:              (variants: 10145661, alleles: 7, samples: 20864,
                          ploidy: 2)
Coordinates:
    variant_position     (variants) int32 dask.array<chunksize=(8388608,), meta=np.ndarray>
    variant_chrom        (variants) object dask.array<chunksize=(4194304,), meta=np.ndarray>
    sample_id            (samples) object dask.array<chunksize=(20864,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele       (variants, alleles) object dask.array<chunksize=(699051, 1), meta=np.ndarray>
    variant_filter_pass  (variants) bool dask.array<chunksize=(10145661,), meta=np.ndarray>
    variant_is_snp       (variants) bool dask.array<chunksize=(10145661,), meta=np.ndarray>
    variant_numalt       (variants) int32 dask.array<chunksize=(8388608,), meta=np.ndarray>
    variant_CDS          (variants) bool dask.array<chunksize=(10145661,), meta=np.ndarray>
    call_genotype        (variants, samples, ploidy) int8 dask.array<chunksize=(167773, 100, 2), meta=np.ndarray>
    call_AD              (variants, samples, alleles) int16 dask.array<chunksize=(23968, 100, 7), meta=np.ndarray>

This tells us we have 10,145,661 variants across 20,864 samples

The variants data include:

Information on where in the genome variants occur (variant_position, variant_chrom)
Which variants occur in which samples (sample_id)
Characteristics of the variants: - Did they pass QC? (variant_filter_pass) - Are they a single nucleotide polymorphism (variant_is_snp) - Do they occur in regions of the genome which encode genes? (variant_CDS)

Extract variables of interest for table¶

Here we define a dictionary which maps items from variant_data to new variables in a new dataframe ‘df_variants’

# Take data from the variant_data xarray object, process it, and create a new dataframe named df_variants containing variant information
variants_dict = collections.OrderedDict()
variants_dict['Type'] = pd.Series(variant_data['variant_is_snp'][:]).map({True: 'SNP', False: 'non-SNP'})
variants_dict['Multiallelic'] = pd.Series(variant_data['variant_numalt'][:] == 1).map({True: 'Bi-allelic', False: 'Multi-allelic'})
variants_dict['Coding'] = pd.Series(variant_data['variant_CDS'][:]).map({True: 'Coding', False: 'Non-coding'})
variants_dict['is_pass'] = variant_data['variant_filter_pass'][:]
variants_dict['num_alleles'] = variant_data['variant_numalt'][:] + 1
variants_dict['Chrom'] = pd.Series(variant_data['variant_chrom'][:])
df_variants = pd.DataFrame.from_dict(variants_dict)
df_variants.loc[df_variants['Type']=='non-SNP', 'Multiallelic'] = ''

print(df_variants.shape)

display(df_variants.head(3))

(10145661, 6)

	Type	Coding	is_pass	num_alleles	Chrom
0	non-SNP	Non-coding	False	2	Pf3D7_01_v3
1	non-SNP	Non-coding	False	4	Pf3D7_01_v3
2	non-SNP	Non-coding	False	3	Pf3D7_01_v3

Make a new dataframe for the final table¶

Here we summarise the quantities of each type of variant

# Aggregate the df_variants dataframe into a df of variant summaries
pd.DataFrame({
    "Discovered variant positions": df_variants.groupby(['Type', 'Multiallelic', 'Coding'])['Type'].count(),
    "Pass variant positions": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Multiallelic', 'Coding'])['Type'].count(),
    "% pass": (
            df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Multiallelic', 'Coding'])['Type'].count() /
            df_variants.groupby(['Type', 'Multiallelic', 'Coding'])['Type'].count()
    ),
    "Alleles per pass position": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Multiallelic', 'Coding'])['num_alleles'].mean(),
}).loc[:, ["Discovered variant positions", "Pass variant positions", "% pass", "Alleles per pass position"]]

			Discovered variant positions	Pass variant positions	% pass	Alleles per pass position
Type	Multiallelic	Coding
SNP	Bi-allelic	Coding	2215203	1668246	0.753089	2.000000
	Bi-allelic	Non-coding	1360243	845642	0.621685	2.000000
	Multi-allelic	Coding	476423	395788	0.830749	3.094869
	Multi-allelic	Non-coding	345932	216045	0.624530	3.097526
non-SNP		Coding	1931286	798903	0.413664	3.451484
non-SNP		Non-coding	3816574	1944035	0.509367	3.790311

Make the final table¶

Format the “% pass column” so that values look like percentages
Round the “Alleles per pass position” to one decimal place
Add a totals row

df_variant_summary = pd.concat([

        pd.DataFrame({
            "Discovered variant positions": df_variants.groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count(),
            "Pass variant positions": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count(),
            "% pass": (
                    (df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count() /
                    df_variants.groupby(['Type', 'Coding', 'Multiallelic'])['Type'].count())*100
            ).apply(lambda x: "{:.0f}%".format(x)),  # Formatting % pass with 0 decimal places and %,
            "Alleles per pass position": df_variants.loc[df_variants['is_pass']].groupby(['Type', 'Coding', 'Multiallelic'])['num_alleles'].mean().apply(lambda x: "{:.1f}".format(x)),  # Formatting Alleles per pass position with 1 decimal place,
        })
        .reset_index()
        ,
        pd.DataFrame({
            "Discovered variant positions": df_variants.groupby(lambda x: 'Total')['Type'].count(), # Adding a totals row
            "Pass variant positions": df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['Type'].count(),
            "% pass": (
                    (df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['Type'].count() /
                    df_variants.groupby(lambda x: 'Total')['Type'].count())*100
            ).apply(lambda x: "{:.0f}%".format(x)),
            "Alleles per pass position": df_variants.loc[df_variants['is_pass']].groupby(lambda x: 'Total')['num_alleles'].mean().apply(lambda x: "{:.1f}".format(x)),  # Formatting Alleles per pass position with 1 decimal place,
        })
        .reset_index()
        .rename(columns={'index': 'Type'}),
], sort=False).loc[:, ["Type", "Coding", "Multiallelic", "Discovered variant positions", "Pass variant positions", "% pass", "Alleles per pass position"]]

df_variant_summary = df_variant_summary.fillna("")

df_variant_summary = df_variant_summary.set_index(['Type', 'Coding', 'Multiallelic'])

df_variant_summary

			Discovered variant positions	Pass variant positions	% pass	Alleles per pass position
Type	Coding	Multiallelic
SNP	Coding	Bi-allelic	2215203	1668246	75%	2.0
	Coding	Multi-allelic	476423	395788	83%	3.1
	Non-coding	Bi-allelic	1360243	845642	62%	2.0
	Non-coding	Multi-allelic	345932	216045	62%	3.1
non-SNP	Coding		1931286	798903	41%	3.5
non-SNP	Non-coding		3816574	1944035	51%	3.8
Total			10145661	5868659	58%	2.9

Summary of discovered variant positions. We divide variant positions into those containing single nucleotide polymorphisms (SNPs) and non-SNPs (indels and combinations of SNPs and indels at the same position). We then further sub-divide each of these into those within exons (coding) and those in intronic or intergenic regions (non-coding). We further sub-divide SNPs into those containing only two alleles (bi-allelic) or those containing three or more alleles (multi-allelic). Discovered variant positions are unique positions in the reference genome where either SNP or indel variation was discovered by our analysis pipeline. Pass variant positions are the subset of discovered positions that passed our quality filters. Alleles per pass position shows the mean number of distinct alleles at each pass position; biallelic variants have two alleles by definition

Write the table to a file¶

We can output this to a location in Google Drive

First we need to connect Google Drive by running the following:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')

Mounted at /content/drive

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
df_variant_summary.to_excel('/content/drive/My Drive/Pf7_supp_table3.xlsx')

MalariaGEN parasite data user guide

Summarise Genetic Variants

Contents