Summarise hrp2 and hrp3 Deletions

Introduction

This notebook will recreate two supplementary display items from the Pf7 paper relating to summarising deletions in hrp2 and hrp3:

hrp2 and hrp3 are genes located in subtelomeric regions of the genome with very high levels of natural variation. Deletion in those genes can cause failure of rapid diagnostic tests and is therefore important to monitor.

Deletion is a genetic event in which a segment of DNA is entirely removed or missing. In this context, ‘breakpoints’ denote specific locations on the chromosome where such deletions take place.

This notebook should take approximately 1 minute to run.

Setup

Install and import the malariagen Python package:

!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.5/148.5 kB 3.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 13.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.8/52.8 MB 7.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 43.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 28.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 14.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 2.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 8.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 65.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 14.9 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 22.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 67.2 MB/s eta 0:00:00
?25h  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required python libraries that are installed at colab by default.

import numpy as np
import pandas as pd
import collections
from google.colab import drive

Access Pf7 Data

We use the malariagen data package to load the release data.

release_data = malariagen_data.Pf7()
df_samples = release_data.sample_metadata()

hrp2 and hrp3 Deletions

We also need to access Pf7 samples hrp2 and hrp3 deletion status (including breakpoint and deletion type).

# Fetch the details of hrp calls from this MalariaGEN.net file
hrp_calls_fn = pd.read_csv('https://www.malariagen.net/wp-content/uploads/2024/01/hrp_calls_pf7.tsv', sep='\t')

Let’s get a quick overview of this dataset.

# print the shape
print(hrp_calls_fn.shape)
# display first 5 rows of the dataset
hrp_calls_fn.head(5)
(16203, 7)
Sample HRP2 HRP3 HRP2_breakpoint HRP3_breakpoint HRP2_deletion_type HRP3_deletion_type
0 FP0008-C nodel nodel - - NaN NaN
1 FP0009-C nodel nodel - - NaN NaN
2 FP0010-CW uncallable uncallable - - NaN NaN
3 FP0011-CW uncallable uncallable - - NaN NaN
4 FP0012-CW nodel nodel - - NaN NaN

We will continue to explore the dataset in the next section.

Dataset Exploration

We begin by merging the deletion status information with sample metadata which will allow us to perform geospatial analyses, such as identifying countries or specific regions that exhibit a higher susceptibility to hrp2 and hrp3 deletions.

One thing to note is that deletion data is exclusively curated for samples that have passed the quality control (QC) criteria. Consequently, the resulting merged dataset will exclusively contain these high-quality samples with deletion information.

# Merge df_samples with hrp_calls_fn
df_samples_hrp =  df_samples.merge(hrp_calls_fn, on ='Sample')
print(df_samples_hrp.shape)
df_samples_hrp.head(3)
(16203, 23)
Sample Study Country Admin level 1 Country latitude Country longitude Admin level 1 latitude Admin level 1 longitude Year ENA ... QC pass Exclusion reason Sample type Sample was in Pf6 HRP2 HRP3 HRP2_breakpoint HRP3_breakpoint HRP2_deletion_type HRP3_deletion_type
0 FP0008-C 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR1081237 ... True Analysis_set gDNA True nodel nodel - - NaN NaN
1 FP0009-C 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR1081238 ... True Analysis_set gDNA True nodel nodel - - NaN NaN
2 FP0010-CW 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR2889621 ... True Analysis_set sWGA False uncallable uncallable - - NaN NaN

3 rows × 23 columns

What are the deletion types and breakpoints in the hrp2 gene, and how many samples exhibit these deletions?

# Count and sort deletion types and breakpoints in highest to lowest occurrence
df_samples_hrp[
 df_samples_hrp['HRP2'] == 'del'
].groupby(['HRP2_deletion_type', 'HRP2_breakpoint']).size().sort_values(ascending=False)
HRP2_deletion_type  HRP2_breakpoint    
Telomere healing    Pf3D7_08_v3:1374986    6
                    Pf3D7_08_v3:1374462    5
                    Pf3D7_08_v3:1374932    2
                    Pf3D7_08_v3:1373732    1
                    Pf3D7_08_v3:1374280    1
dtype: int64

What are the deletion types and breakpoints in the hrp3 gene, and how many samples exhibit these deletions?

# Count and sort deletion types and breakpoints in highest to lowest occurrence
pd.DataFrame(
df_samples_hrp[
df_samples_hrp['HRP3'] == 'del'
].groupby(['HRP3_deletion_type', 'HRP3_breakpoint']).size().sort_values(ascending=False).rename("Samples_with_deletion")
)
Samples_with_deletion
HRP3_deletion_type HRP3_breakpoint
Chromosome 11 recombination Pf3D7_13_v3:2800004-2807159 151
Chromosome 5 recombination Pf3D7_13_v3:2835587-2835612 21
Telomere healing Pf3D7_13_v3:2830952 7
Pf3D7_13_v3:2837145 7
Pf3D7_13_v3:2837392 3
Pf3D7_13_v3:2838654 2
Pf3D7_13_v3:2811525 1
Pf3D7_13_v3:2812344 1
Pf3D7_13_v3:2815249 1
Pf3D7_13_v3:2822480 1
Pf3D7_13_v3:2823645 1
Pf3D7_13_v3:2832080 1
Pf3D7_13_v3:2834604 1
Pf3D7_13_v3:2835532 1
Pf3D7_13_v3:2841024 1
Pf3D7_13_v3:2841120 1

Note that we can expand the previous table with more categories (population, country and admin-level 1).

# Group the HRP3 deletions by breakpoint, population, country, and admin level 1.
pd.DataFrame(
    df_samples_hrp[
   df_samples_hrp['HRP3'] == 'del'
    ].fillna('').groupby(['HRP3_deletion_type', 'HRP3_breakpoint', 'Population', 'Country', 'Admin level 1']).size().rename("Samples_with_deletion")
)
Samples_with_deletion
HRP3_deletion_type HRP3_breakpoint Population Country Admin level 1
Chromosome 11 recombination Pf3D7_13_v3:2800004-2807159 AF-NE Ethiopia Amhara 6
Oromia 3
Sudan Kassala 5
AF-W Gambia Upper River 1
Western 2
Ghana Upper East 1
Mali Kayes 1
Senegal Dakar 6
AS-S-FE Bangladesh Chittagong 1
AS-SE-E Cambodia Ratanakiri 3
Laos Champasak 1
Salavan 13
Vietnam Binh Phuoc 2
AS-SE-W Thailand Tak 1
OC-NG Indonesia Papua 40
SA Colombia Cauca 50
Peru Loreto 15
Chromosome 5 recombination Pf3D7_13_v3:2835587-2835612 AS-SE-E Cambodia Battambang 4
Pailin 3
Pursat 13
Vietnam Binh Phuoc 1
Telomere healing Pf3D7_13_v3:2811525 AS-S-E India West Bengal 1
Pf3D7_13_v3:2812344 AF-NE Sudan Kassala 1
Pf3D7_13_v3:2815249 AF-E Tanzania Tanga 1
Pf3D7_13_v3:2822480 AF-W Ghana Brong Ahafo 1
Pf3D7_13_v3:2823645 AF-E Kenya Kilifi 1
Pf3D7_13_v3:2830952 AS-SE-E Cambodia Battambang 7
Pf3D7_13_v3:2832080 AF-C Democratic Republic of the Congo Kinshasa 1
Pf3D7_13_v3:2834604 AS-SE-E Vietnam Binh Phuoc 1
Pf3D7_13_v3:2835532 AS-SE-W Thailand Tak 1
Pf3D7_13_v3:2837145 AS-SE-E Vietnam Binh Phuoc 7
Pf3D7_13_v3:2837392 AS-SE-E Cambodia Ratanakiri 2
Laos Attapeu 1
Pf3D7_13_v3:2838654 OC-NG Indonesia Papua 2
Pf3D7_13_v3:2841024 AS-SE-W Thailand Tak 1
Pf3D7_13_v3:2841120 OC-NG Indonesia Papua 1

We can note that breakpoints in Chromosome 11 and 5 recombination have been observed in multiple in-country sampling locations.

In the following sections, we will create various summary tables.

Summary by deletion types table

The initial summary table will show the number of distinct breakpoints, countries and samples for distinct deletion types within the hrp2 and hrp3 genes.

To achieve this, we need a function to count number of distinct unique breakpoints, countries and samples for each deletion type.

def hrp2_agg(x):
    """
    This function counts number of distinct unique breakpoints, countries and samples for each deletion type in hrp2.
    """
    # Create an ordered dictionary to store the information.
    summary = collections.OrderedDict()

    # Calculate the number of distinct breakpoints.
    summary['Distinct Breakpoints'] = len(x['HRP2_breakpoint'].unique())

    # Calculate the number of unique countries.
    summary['Countries'] = len(x['Country'].unique())

    # Calculate the total number of samples with deletions.
    summary['Samples with Deletion'] = len(x)

    # Return the summary information as a Pandas Series.
    return pd.Series(summary)

The hrp2_agg function can be adapted for hrp3 gene deletions as well.

def hrp3_agg(x):
    """
    This function counts number of distinct unique breakpoints, countries and samples for each deletion type in hrp3.
    """
    # Create an ordered dictionary to store the information.
    summary = collections.OrderedDict()

    # Calculate the number of distinct breakpoints.
    summary['Distinct Breakpoints'] = len(x['HRP3_breakpoint'].unique())

    # Calculate the number of unique countries.
    summary['Countries'] = len(x['Country'].unique())

    # Calculate the total number of samples with deletions.
    summary['Samples with Deletion'] = len(x)

    # Return the summary information as a Pandas Series.
    return pd.Series(summary)
# Create a summary of deletion types for hrp2 and hrp3 genes

# Filter samples with hrp2 deletions
# Group by deletion type, also include gene name
# Apply hrp2_agg function to find numbers
hrp2_agg_df = (
    df_samples_hrp[df_samples_hrp['HRP2'] == 'del']
    .assign(Gene='hrp2')
    .rename(columns={'HRP2_deletion_type': 'Deletion type'})
    .groupby(['Gene', 'Deletion type'])
    .apply(hrp2_agg)
)

# Filter samples with hrp3 deletions
# Group by deletion type, also include gene name
# Apply hrp2_agg function to find numbers
hrp3_agg_df = (
    df_samples_hrp[
        df_samples_hrp['QC pass'] & (df_samples_hrp['HRP3'] == 'del')
    ]
    .assign(Gene='hrp3')
    .rename(columns={'HRP3_deletion_type': 'Deletion type'})
    .groupby(['Gene', 'Deletion type'])
    .apply(hrp3_agg)
)

# Concatenate the results to create a summary of deletion types
df_deletion_types_summary = pd.concat([hrp2_agg_df, hrp3_agg_df])

# Display the summary
df_deletion_types_summary
Distinct Breakpoints Countries Samples with Deletion
Gene Deletion type
hrp2 Telomere healing 5 4 15
hrp3 Chromosome 11 recombination 1 14 151
Chromosome 5 recombination 1 2 21
Telomere healing 14 11 29

Save the table

You will need to authorise Google Colab access to Google Drive.

drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

df_deletion_types_summary.to_excel('/content/drive/My Drive/HRP_deletion_types_summary_unformatted.xlsx')

Summary by breakpoints

The second summary table will feature breakpoints, associated countries, and sample counts for distinct deletion types within the hrp2 and hrp3 genes.

This requires a function to aggregate countries and sample counts for each unique deletion type at specific breakpoints.

def breakpoint_agg(x):
    """
    This function aggregates countries and sample counts for each unique deletion type at specific breakpoints.
    """
    names = collections.OrderedDict()
    names['Countries'] = ', '.join((x['Country'].unique()))
    names['Samples with deletion'] = len(x)
    return pd.Series(names)
# Calculate summary statistics for hrp2
# Group samples with hrp3 deletions by gene name, deletion type, and breakpoint
# Apply breakpoint_agg function

hrp2_breakpoints_summary = (
    df_samples_hrp[df_samples_hrp['HRP2'] == 'del']
    .assign(Gene='hrp2')
    .rename(columns={'HRP2_deletion_type': 'Deletion type'})
    .groupby(['Gene', 'Deletion type', 'HRP2_breakpoint'])
    .apply(breakpoint_agg)
)

# Calculate summary statistics for hrp3
# Group samples with hrp3 deletions by gene name, deletion type, and breakpoint
# Apply breakpoint_agg function

hrp3_breakpoints_summary = (
    df_samples_hrp[df_samples_hrp['HRP3'] == 'del']
    .assign(Gene='hrp3')
    .rename(columns={'HRP3_deletion_type': 'Deletion type'})
    .groupby(['Gene', 'Deletion type', 'HRP3_breakpoint'])
    .apply(breakpoint_agg)
)

# Combine the summary statistics for HRP2 and HRP3 genes
df_breakpoints_summary = pd.concat([hrp2_breakpoints_summary, hrp3_breakpoints_summary])

# Display the combined summary
df_breakpoints_summary
Countries Samples with deletion
Gene Deletion type HRP2_breakpoint
hrp2 Telomere healing Pf3D7_08_v3:1373732 Cambodia 1
Pf3D7_08_v3:1374280 Sudan 1
Pf3D7_08_v3:1374462 Indonesia 5
Pf3D7_08_v3:1374932 Peru 2
Pf3D7_08_v3:1374986 Peru 6
hrp3 Chromosome 11 recombination Pf3D7_13_v3:2800004-2807159 Thailand, Ghana, Indonesia, Peru, Bangladesh, ... 151
Chromosome 5 recombination Pf3D7_13_v3:2835587-2835612 Cambodia, Vietnam 21
Telomere healing Pf3D7_13_v3:2811525 India 1
Pf3D7_13_v3:2812344 Sudan 1
Pf3D7_13_v3:2815249 Tanzania 1
Pf3D7_13_v3:2822480 Ghana 1
Pf3D7_13_v3:2823645 Kenya 1
Pf3D7_13_v3:2830952 Cambodia 7
Pf3D7_13_v3:2832080 Democratic Republic of the Congo 1
Pf3D7_13_v3:2834604 Vietnam 1
Pf3D7_13_v3:2835532 Thailand 1
Pf3D7_13_v3:2837145 Vietnam 7
Pf3D7_13_v3:2837392 Cambodia, Laos 3
Pf3D7_13_v3:2838654 Indonesia 2
Pf3D7_13_v3:2841024 Thailand 1
Pf3D7_13_v3:2841120 Indonesia 1

Table Legend. Summary of hrp2 and hrp3 deletion breakpoints. Telomere healing refers to the process whereby the end of a chromosome is deleted and a telomere repeat sequence attached to the breakpoint. Chromosome 11 recombination refers to a new hybrid chromosome being created by a recombination between chromosome 13 and 11 at a cluster of rRNA genes that appear to have orthologous copies on both chromosomes. Chromosome 5 recombination refers to a recombination between chromosome 13 and an inverted section of the middle of chromosome 5 containing the gene mdr1. For telomere healing an exact breakpoint position is given but for recombination events it is only possible to give a region in which the recombination has occurred.

Save the table

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

df_breakpoints_summary.to_excel('/content/drive/My Drive/HRP_breakpoints_summary_unformatted.xlsx')

Summary by country

The final summary table will display the exact number and incidence of deletions within hrp2 and hrp3 in each country.

# Define an aggregation function to calculate proportions and counts for HRP2 and HRP3 genes.
def proportion_agg(x):
    """
    This function counts the number of deletions, the incidence of deletions, and more for both genes.
    """
    names = collections.OrderedDict()

    # Calculate the number of HRP2 deletions
    names['hrp2 calls'] = np.count_nonzero(x['HRP2'] != 'uncallable')

    # Calculate the incidence of HRP2 deletions
    names['% hrp2 deletions'] = round((np.count_nonzero(x['HRP2'] == 'del') / (np.count_nonzero(x['HRP2'] != 'uncallable'))*100))

    # Calculate the number of HRP3 deletions
    names['hrp3 calls'] = np.count_nonzero(x['HRP3'] != 'uncallable')

    # Calculate the percentage of HRP3 deletions
    names['% hrp3 deletions'] = round(np.count_nonzero(x['HRP3'] == 'del') / (np.count_nonzero(x['HRP3'] != 'uncallable'))*100)

    # Calculate the number of deletions for both HRP2 and HRP3
    names['hrp2 and hrp3 calls'] = np.count_nonzero((x['HRP2'] != 'uncallable') & (x['HRP3'] != 'uncallable'))

    # Calculate the incidence of deletions for both HRP2 and HRP3
    names['% hrp2 and hrp3 deletions'] = round(np.count_nonzero(
        (x['HRP2'] == 'del') & (x['HRP3'] == 'del')
    ) / (np.count_nonzero(
        (x['HRP2'] != 'uncallable') & (x['HRP3'] != 'uncallable')
    ))*100)
    # Convert the values to integers
    names = {key: int(value) for key, value in names.items()}

    return pd.Series(names)
# Create a table to summarize HRP2 and HRP3 gene information by country

# Calculate proportions and counts for HRP2 and HRP3 genes by country
df_hrp_by_country_table = (
    df_samples_hrp.groupby('Country')  # Group the data by country
    .apply(proportion_agg)  # Apply the proportion aggregation function to calculate statistics
    .transpose()  # Transpose the table for better readability
)

# Calculate sample counts for each country
hrp_by_country_n = (
    df_samples_hrp.groupby('Country')  # Group the data by country
    .size()  # Calculate the number of samples in each country
)

# Rename columns to include sample counts next to country names
for country in df_samples_hrp.loc[df_samples_hrp['QC pass'], 'Country'].unique():
    new_column_name = f'{country} (n={hrp_by_country_n[country]:,})'
    df_hrp_by_country_table.rename(columns={country: new_column_name}, inplace=True)

# Transpose the table to have countries as rows and statistics as columns
df_hrp_by_country_table = df_hrp_by_country_table.T

# Display the table with summarized information
df_hrp_by_country_table
hrp2 calls % hrp2 deletions hrp3 calls % hrp3 deletions hrp2 and hrp3 calls % hrp2 and hrp3 deletions
Country
Bangladesh (n=1,310) 939 0 850 0 819 0
Benin (n=150) 110 0 104 0 100 0
Burkina Faso (n=57) 43 0 32 0 32 0
Cambodia (n=1,267) 1109 0 1091 3 1064 0
Cameroon (n=264) 244 0 240 0 235 0
Colombia (n=135) 124 0 123 41 118 0
Côte d'Ivoire (n=71) 70 0 71 0 70 0
Democratic Republic of the Congo (n=520) 413 0 392 0 385 0
Ethiopia (n=21) 20 0 20 45 20 0
Gabon (n=55) 34 0 38 0 32 0
Gambia (n=863) 517 0 467 1 460 0
Ghana (n=3,131) 1529 0 1448 0 1343 0
Guinea (n=151) 121 0 119 0 119 0
India (n=300) 75 0 70 1 68 0
Indonesia (n=121) 117 4 117 37 115 2
Kenya (n=690) 660 0 647 0 645 0
Laos (n=991) 773 0 717 2 669 0
Madagascar (n=24) 24 0 22 0 22 0
Malawi (n=265) 265 0 264 0 264 0
Mali (n=1,167) 709 0 691 0 669 0
Mauritania (n=92) 79 0 81 0 79 0
Mozambique (n=34) 10 0 10 0 6 0
Myanmar (n=985) 645 0 606 0 585 0
Nigeria (n=110) 34 0 30 0 30 0
Papua New Guinea (n=221) 118 0 106 0 106 0
Peru (n=21) 21 38 20 75 20 30
Senegal (n=150) 142 0 141 4 138 0
Sudan (n=76) 7 14 7 86 7 14
Tanzania (n=589) 452 0 470 0 440 0
Thailand (n=954) 855 0 846 0 823 0
Uganda (n=12) 12 0 12 0 12 0
Venezuela (n=2) 2 0 2 0 2 0
Vietnam (n=1,404) 762 0 740 1 670 0

Table Legend. Frequency of HRP2 and HRP3 deletions by country. n=number of QC pass samples. Calls columns show number of samples for which an unambiguous deletion genotype (deleted or non-deleted) could be assigned.

Save the table

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

df_hrp_by_country_table.to_excel('/content/drive/My Drive/HRP_country_summary_unformatted.xlsx')