Summarise hrp2 and hrp3 Deletions¶

Introduction¶

This notebook will recreate two supplementary display items from the Pf7 paper relating to summarising deletions in hrp2 and hrp3:

Supplementary Table 6: frequency of hrp2 and hrp3 deletions by country
Supplementary Table 7: summary of hrp2 and hrp3 deletion breakpoints

hrp2 and hrp3 are genes located in subtelomeric regions of the genome with very high levels of natural variation. Deletion in those genes can cause failure of rapid diagnostic tests and is therefore important to monitor.

Deletion is a genetic event in which a segment of DNA is entirely removed or missing. In this context, ‘breakpoints’ denote specific locations on the chromosome where such deletions take place.

This notebook should take approximately 1 minute to run.

Setup¶

Install and import the malariagen Python package:

!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.5/148.5 kB 3.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 13.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.8/52.8 MB 7.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 43.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 28.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 14.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 2.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 8.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 65.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 14.9 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 22.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 67.2 MB/s eta 0:00:00
?25h  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required python libraries that are installed at colab by default.

import numpy as np
import pandas as pd
import collections
from google.colab import drive

Access Pf7 Data¶

We use the malariagen data package to load the release data.

release_data = malariagen_data.Pf7()
df_samples = release_data.sample_metadata()

hrp2 and hrp3 Deletions¶

We also need to access Pf7 samples hrp2 and hrp3 deletion status (including breakpoint and deletion type).

# Fetch the details of hrp calls from this MalariaGEN.net file
hrp_calls_fn = pd.read_csv('https://www.malariagen.net/wp-content/uploads/2024/01/hrp_calls_pf7.tsv', sep='\t')

Let’s get a quick overview of this dataset.

# print the shape
print(hrp_calls_fn.shape)
# display first 5 rows of the dataset
hrp_calls_fn.head(5)

(16203, 7)

	Sample	HRP2	HRP3	HRP2_breakpoint	HRP3_breakpoint	HRP2_deletion_type	HRP3_deletion_type
0	FP0008-C	nodel	nodel	-	-	NaN	NaN
1	FP0009-C	nodel	nodel	-	-	NaN	NaN
2	FP0010-CW	uncallable	uncallable	-	-	NaN	NaN
3	FP0011-CW	uncallable	uncallable	-	-	NaN	NaN
4	FP0012-CW	nodel	nodel	-	-	NaN	NaN

We will continue to explore the dataset in the next section.

Dataset Exploration¶

We begin by merging the deletion status information with sample metadata which will allow us to perform geospatial analyses, such as identifying countries or specific regions that exhibit a higher susceptibility to hrp2 and hrp3 deletions.

One thing to note is that deletion data is exclusively curated for samples that have passed the quality control (QC) criteria. Consequently, the resulting merged dataset will exclusively contain these high-quality samples with deletion information.

# Merge df_samples with hrp_calls_fn
df_samples_hrp =  df_samples.merge(hrp_calls_fn, on ='Sample')
print(df_samples_hrp.shape)
df_samples_hrp.head(3)

(16203, 23)

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	...	QC pass	Exclusion reason	Sample type	Sample was in Pf6	HRP2	HRP3	HRP2_breakpoint	HRP3_breakpoint	HRP2_deletion_type	HRP3_deletion_type
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	...	True	Analysis_set	gDNA	True	nodel	nodel	-	-	NaN	NaN
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	...	True	Analysis_set	gDNA	True	nodel	nodel	-	-	NaN	NaN
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	...	True	Analysis_set	sWGA	False	uncallable	uncallable	-	-	NaN	NaN

3 rows × 23 columns

What are the deletion types and breakpoints in the hrp2 gene, and how many samples exhibit these deletions?

# Count and sort deletion types and breakpoints in highest to lowest occurrence
df_samples_hrp[
 df_samples_hrp['HRP2'] == 'del'
].groupby(['HRP2_deletion_type', 'HRP2_breakpoint']).size().sort_values(ascending=False)

HRP2_deletion_type  HRP2_breakpoint    
Telomere healing    Pf3D7_08_v3:1374986    6
                    Pf3D7_08_v3:1374462    5
                    Pf3D7_08_v3:1374932    2
                    Pf3D7_08_v3:1373732    1
                    Pf3D7_08_v3:1374280    1
dtype: int64

What are the deletion types and breakpoints in the hrp3 gene, and how many samples exhibit these deletions?

# Count and sort deletion types and breakpoints in highest to lowest occurrence
pd.DataFrame(
df_samples_hrp[
df_samples_hrp['HRP3'] == 'del'
].groupby(['HRP3_deletion_type', 'HRP3_breakpoint']).size().sort_values(ascending=False).rename("Samples_with_deletion")
)

		Samples_with_deletion
HRP3_deletion_type	HRP3_breakpoint
Chromosome 11 recombination	Pf3D7_13_v3:2800004-2807159	151
Chromosome 5 recombination	Pf3D7_13_v3:2835587-2835612	21
Telomere healing	Pf3D7_13_v3:2830952	7
	Pf3D7_13_v3:2837145	7
	Pf3D7_13_v3:2837392	3
	Pf3D7_13_v3:2838654	2
	Pf3D7_13_v3:2811525	1
	Pf3D7_13_v3:2812344	1
	Pf3D7_13_v3:2815249	1
	Pf3D7_13_v3:2822480	1
	Pf3D7_13_v3:2823645	1
	Pf3D7_13_v3:2832080	1
	Pf3D7_13_v3:2834604	1
	Pf3D7_13_v3:2835532	1
	Pf3D7_13_v3:2841024	1
	Pf3D7_13_v3:2841120	1

Note that we can expand the previous table with more categories (population, country and admin-level 1).

# Group the HRP3 deletions by breakpoint, population, country, and admin level 1.
pd.DataFrame(
    df_samples_hrp[
   df_samples_hrp['HRP3'] == 'del'
    ].fillna('').groupby(['HRP3_deletion_type', 'HRP3_breakpoint', 'Population', 'Country', 'Admin level 1']).size().rename("Samples_with_deletion")
)

					Samples_with_deletion
HRP3_deletion_type	HRP3_breakpoint	Population	Country	Admin level 1
Chromosome 11 recombination	Pf3D7_13_v3:2800004-2807159	AF-NE	Ethiopia	Amhara	6
			Ethiopia	Oromia	3
			Sudan	Kassala	5
		AF-W	Gambia	Upper River	1
			Gambia	Western	2
			Ghana	Upper East	1
			Mali	Kayes	1
			Senegal	Dakar	6
		AS-S-FE	Bangladesh	Chittagong	1
		AS-SE-E	Cambodia	Ratanakiri	3
			Laos	Champasak	1
			Laos	Salavan	13
			Vietnam	Binh Phuoc	2
		AS-SE-W	Thailand	Tak	1
		OC-NG	Indonesia	Papua	40
		SA	Colombia	Cauca	50
		SA	Peru	Loreto	15
Chromosome 5 recombination	Pf3D7_13_v3:2835587-2835612	AS-SE-E	Cambodia	Battambang	4
				Pailin	3
				Pursat	13
			Vietnam	Binh Phuoc	1
Telomere healing	Pf3D7_13_v3:2811525	AS-S-E	India	West Bengal	1
	Pf3D7_13_v3:2812344	AF-NE	Sudan	Kassala	1
	Pf3D7_13_v3:2815249	AF-E	Tanzania	Tanga	1
	Pf3D7_13_v3:2822480	AF-W	Ghana	Brong Ahafo	1
	Pf3D7_13_v3:2823645	AF-E	Kenya	Kilifi	1
	Pf3D7_13_v3:2830952	AS-SE-E	Cambodia	Battambang	7
	Pf3D7_13_v3:2832080	AF-C	Democratic Republic of the Congo	Kinshasa	1
	Pf3D7_13_v3:2834604	AS-SE-E	Vietnam	Binh Phuoc	1
	Pf3D7_13_v3:2835532	AS-SE-W	Thailand	Tak	1
	Pf3D7_13_v3:2837145	AS-SE-E	Vietnam	Binh Phuoc	7
	Pf3D7_13_v3:2837392	AS-SE-E	Cambodia	Ratanakiri	2
	Pf3D7_13_v3:2837392	AS-SE-E	Laos	Attapeu	1
	Pf3D7_13_v3:2838654	OC-NG	Indonesia	Papua	2
	Pf3D7_13_v3:2841024	AS-SE-W	Thailand	Tak	1
	Pf3D7_13_v3:2841120	OC-NG	Indonesia	Papua	1

We can note that breakpoints in Chromosome 11 and 5 recombination have been observed in multiple in-country sampling locations.

In the following sections, we will create various summary tables.

Summary by deletion types table¶

The initial summary table will show the number of distinct breakpoints, countries and samples for distinct deletion types within the hrp2 and hrp3 genes.

To achieve this, we need a function to count number of distinct unique breakpoints, countries and samples for each deletion type.

def hrp2_agg(x):
    """
    This function counts number of distinct unique breakpoints, countries and samples for each deletion type in hrp2.
    """
    # Create an ordered dictionary to store the information.
    summary = collections.OrderedDict()

    # Calculate the number of distinct breakpoints.
    summary['Distinct Breakpoints'] = len(x['HRP2_breakpoint'].unique())

    # Calculate the number of unique countries.
    summary['Countries'] = len(x['Country'].unique())

    # Calculate the total number of samples with deletions.
    summary['Samples with Deletion'] = len(x)

    # Return the summary information as a Pandas Series.
    return pd.Series(summary)

The hrp2_agg function can be adapted for hrp3 gene deletions as well.

def hrp3_agg(x):
    """
    This function counts number of distinct unique breakpoints, countries and samples for each deletion type in hrp3.
    """
    # Create an ordered dictionary to store the information.
    summary = collections.OrderedDict()

    # Calculate the number of distinct breakpoints.
    summary['Distinct Breakpoints'] = len(x['HRP3_breakpoint'].unique())

    # Calculate the number of unique countries.
    summary['Countries'] = len(x['Country'].unique())

    # Calculate the total number of samples with deletions.
    summary['Samples with Deletion'] = len(x)

    # Return the summary information as a Pandas Series.
    return pd.Series(summary)

# Create a summary of deletion types for hrp2 and hrp3 genes

# Filter samples with hrp2 deletions
# Group by deletion type, also include gene name
# Apply hrp2_agg function to find numbers
hrp2_agg_df = (
    df_samples_hrp[df_samples_hrp['HRP2'] == 'del']
    .assign(Gene='hrp2')
    .rename(columns={'HRP2_deletion_type': 'Deletion type'})
    .groupby(['Gene', 'Deletion type'])
    .apply(hrp2_agg)
)

# Filter samples with hrp3 deletions
# Group by deletion type, also include gene name
# Apply hrp2_agg function to find numbers
hrp3_agg_df = (
    df_samples_hrp[
        df_samples_hrp['QC pass'] & (df_samples_hrp['HRP3'] == 'del')
    ]
    .assign(Gene='hrp3')
    .rename(columns={'HRP3_deletion_type': 'Deletion type'})
    .groupby(['Gene', 'Deletion type'])
    .apply(hrp3_agg)
)

# Concatenate the results to create a summary of deletion types
df_deletion_types_summary = pd.concat([hrp2_agg_df, hrp3_agg_df])

# Display the summary
df_deletion_types_summary

		Distinct Breakpoints	Countries	Samples with Deletion
Gene	Deletion type
hrp2	Telomere healing	5	4	15
hrp3	Chromosome 11 recombination	1	14	151
	Chromosome 5 recombination	1	2	21
	Telomere healing	14	11	29

Save the table¶

You will need to authorise Google Colab access to Google Drive.

drive.mount('/content/drive')

Mounted at /content/drive

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

df_deletion_types_summary.to_excel('/content/drive/My Drive/HRP_deletion_types_summary_unformatted.xlsx')

Summary by breakpoints¶

The second summary table will feature breakpoints, associated countries, and sample counts for distinct deletion types within the hrp2 and hrp3 genes.

This requires a function to aggregate countries and sample counts for each unique deletion type at specific breakpoints.

def breakpoint_agg(x):
    """
    This function aggregates countries and sample counts for each unique deletion type at specific breakpoints.
    """
    names = collections.OrderedDict()
    names['Countries'] = ', '.join((x['Country'].unique()))
    names['Samples with deletion'] = len(x)
    return pd.Series(names)

# Calculate summary statistics for hrp2
# Group samples with hrp3 deletions by gene name, deletion type, and breakpoint
# Apply breakpoint_agg function

hrp2_breakpoints_summary = (
    df_samples_hrp[df_samples_hrp['HRP2'] == 'del']
    .assign(Gene='hrp2')
    .rename(columns={'HRP2_deletion_type': 'Deletion type'})
    .groupby(['Gene', 'Deletion type', 'HRP2_breakpoint'])
    .apply(breakpoint_agg)
)

# Calculate summary statistics for hrp3
# Group samples with hrp3 deletions by gene name, deletion type, and breakpoint
# Apply breakpoint_agg function

hrp3_breakpoints_summary = (
    df_samples_hrp[df_samples_hrp['HRP3'] == 'del']
    .assign(Gene='hrp3')
    .rename(columns={'HRP3_deletion_type': 'Deletion type'})
    .groupby(['Gene', 'Deletion type', 'HRP3_breakpoint'])
    .apply(breakpoint_agg)
)

# Combine the summary statistics for HRP2 and HRP3 genes
df_breakpoints_summary = pd.concat([hrp2_breakpoints_summary, hrp3_breakpoints_summary])

# Display the combined summary
df_breakpoints_summary

			Countries	Samples with deletion
Gene	Deletion type	HRP2_breakpoint
hrp2	Telomere healing	Pf3D7_08_v3:1373732	Cambodia	1
		Pf3D7_08_v3:1374280	Sudan	1
		Pf3D7_08_v3:1374462	Indonesia	5
		Pf3D7_08_v3:1374932	Peru	2
		Pf3D7_08_v3:1374986	Peru	6
hrp3	Chromosome 11 recombination	Pf3D7_13_v3:2800004-2807159	Thailand, Ghana, Indonesia, Peru, Bangladesh, ...	151
	Chromosome 5 recombination	Pf3D7_13_v3:2835587-2835612	Cambodia, Vietnam	21
	Telomere healing	Pf3D7_13_v3:2811525	India	1
		Pf3D7_13_v3:2812344	Sudan	1
		Pf3D7_13_v3:2815249	Tanzania	1
		Pf3D7_13_v3:2822480	Ghana	1
		Pf3D7_13_v3:2823645	Kenya	1
		Pf3D7_13_v3:2830952	Cambodia	7
		Pf3D7_13_v3:2832080	Democratic Republic of the Congo	1
		Pf3D7_13_v3:2834604	Vietnam	1
		Pf3D7_13_v3:2835532	Thailand	1
		Pf3D7_13_v3:2837145	Vietnam	7
		Pf3D7_13_v3:2837392	Cambodia, Laos	3
		Pf3D7_13_v3:2838654	Indonesia	2
		Pf3D7_13_v3:2841024	Thailand	1
		Pf3D7_13_v3:2841120	Indonesia	1

Table Legend. Summary of hrp2 and hrp3 deletion breakpoints. Telomere healing refers to the process whereby the end of a chromosome is deleted and a telomere repeat sequence attached to the breakpoint. Chromosome 11 recombination refers to a new hybrid chromosome being created by a recombination between chromosome 13 and 11 at a cluster of rRNA genes that appear to have orthologous copies on both chromosomes. Chromosome 5 recombination refers to a recombination between chromosome 13 and an inverted section of the middle of chromosome 5 containing the gene mdr1. For telomere healing an exact breakpoint position is given but for recombination events it is only possible to give a region in which the recombination has occurred.

Save the table¶

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

df_breakpoints_summary.to_excel('/content/drive/My Drive/HRP_breakpoints_summary_unformatted.xlsx')

Summary by country¶

The final summary table will display the exact number and incidence of deletions within hrp2 and hrp3 in each country.

# Define an aggregation function to calculate proportions and counts for HRP2 and HRP3 genes.
def proportion_agg(x):
    """
    This function counts the number of deletions, the incidence of deletions, and more for both genes.
    """
    names = collections.OrderedDict()

    # Calculate the number of HRP2 deletions
    names['hrp2 calls'] = np.count_nonzero(x['HRP2'] != 'uncallable')

    # Calculate the incidence of HRP2 deletions
    names['% hrp2 deletions'] = round((np.count_nonzero(x['HRP2'] == 'del') / (np.count_nonzero(x['HRP2'] != 'uncallable'))*100))

    # Calculate the number of HRP3 deletions
    names['hrp3 calls'] = np.count_nonzero(x['HRP3'] != 'uncallable')

    # Calculate the percentage of HRP3 deletions
    names['% hrp3 deletions'] = round(np.count_nonzero(x['HRP3'] == 'del') / (np.count_nonzero(x['HRP3'] != 'uncallable'))*100)

    # Calculate the number of deletions for both HRP2 and HRP3
    names['hrp2 and hrp3 calls'] = np.count_nonzero((x['HRP2'] != 'uncallable') & (x['HRP3'] != 'uncallable'))

    # Calculate the incidence of deletions for both HRP2 and HRP3
    names['% hrp2 and hrp3 deletions'] = round(np.count_nonzero(
        (x['HRP2'] == 'del') & (x['HRP3'] == 'del')
    ) / (np.count_nonzero(
        (x['HRP2'] != 'uncallable') & (x['HRP3'] != 'uncallable')
    ))*100)
    # Convert the values to integers
    names = {key: int(value) for key, value in names.items()}

    return pd.Series(names)

# Create a table to summarize HRP2 and HRP3 gene information by country

# Calculate proportions and counts for HRP2 and HRP3 genes by country
df_hrp_by_country_table = (
    df_samples_hrp.groupby('Country')  # Group the data by country
    .apply(proportion_agg)  # Apply the proportion aggregation function to calculate statistics
    .transpose()  # Transpose the table for better readability
)

# Calculate sample counts for each country
hrp_by_country_n = (
    df_samples_hrp.groupby('Country')  # Group the data by country
    .size()  # Calculate the number of samples in each country
)

# Rename columns to include sample counts next to country names
for country in df_samples_hrp.loc[df_samples_hrp['QC pass'], 'Country'].unique():
    new_column_name = f'{country} (n={hrp_by_country_n[country]:,})'
    df_hrp_by_country_table.rename(columns={country: new_column_name}, inplace=True)

# Transpose the table to have countries as rows and statistics as columns
df_hrp_by_country_table = df_hrp_by_country_table.T

# Display the table with summarized information
df_hrp_by_country_table

	hrp2 calls	% hrp2 deletions	hrp3 calls	% hrp3 deletions	hrp2 and hrp3 calls	% hrp2 and hrp3 deletions
Country
Bangladesh (n=1,310)	939	0	850	0	819	0
Benin (n=150)	110	0	104	0	100	0
Burkina Faso (n=57)	43	0	32	0	32	0
Cambodia (n=1,267)	1109	0	1091	3	1064	0
Cameroon (n=264)	244	0	240	0	235	0
Colombia (n=135)	124	0	123	41	118	0
Côte d'Ivoire (n=71)	70	0	71	0	70	0
Democratic Republic of the Congo (n=520)	413	0	392	0	385	0
Ethiopia (n=21)	20	0	20	45	20	0
Gabon (n=55)	34	0	38	0	32	0
Gambia (n=863)	517	0	467	1	460	0
Ghana (n=3,131)	1529	0	1448	0	1343	0
Guinea (n=151)	121	0	119	0	119	0
India (n=300)	75	0	70	1	68	0
Indonesia (n=121)	117	4	117	37	115	2
Kenya (n=690)	660	0	647	0	645	0
Laos (n=991)	773	0	717	2	669	0
Madagascar (n=24)	24	0	22	0	22	0
Malawi (n=265)	265	0	264	0	264	0
Mali (n=1,167)	709	0	691	0	669	0
Mauritania (n=92)	79	0	81	0	79	0
Mozambique (n=34)	10	0	10	0	6	0
Myanmar (n=985)	645	0	606	0	585	0
Nigeria (n=110)	34	0	30	0	30	0
Papua New Guinea (n=221)	118	0	106	0	106	0
Peru (n=21)	21	38	20	75	20	30
Senegal (n=150)	142	0	141	4	138	0
Sudan (n=76)	7	14	7	86	7	14
Tanzania (n=589)	452	0	470	0	440	0
Thailand (n=954)	855	0	846	0	823	0
Uganda (n=12)	12	0	12	0	12	0
Venezuela (n=2)	2	0	2	0	2	0
Vietnam (n=1,404)	762	0	740	1	670	0

Table Legend. Frequency of HRP2 and HRP3 deletions by country. n=number of QC pass samples. Calls columns show number of samples for which an unambiguous deletion genotype (deleted or non-deleted) could be assigned.

Save the table¶

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

df_hrp_by_country_table.to_excel('/content/drive/My Drive/HRP_country_summary_unformatted.xlsx')

MalariaGEN parasite data user guide

Summarise hrp2 and hrp3 Deletions

Contents

Summarise hrp2 and hrp3 Deletions¶

Introduction¶

Setup¶

Access Pf7 Data¶

hrp2 and hrp3 Deletions¶

Dataset Exploration¶

Summary by deletion types table¶

Save the table¶

Summary by breakpoints¶

Save the table¶

Summary by country¶

Save the table¶