Visualise hrp2 and hrp3 Deletions

Introduction

This notebook will create a public notebook for the generation of Figure 5 from the main Pf7 paper - a figure of HRP deletion breakpoints.

hrp2 and hrp3 are genes located in subtelomeric regions of the genome with very high levels of natural variation. Deletion in those genes can cause failure of rapid diagnostic tests and is therefore important to monitor.

Deletion is a genetic event in which a segment of DNA is entirely removed or missing. In this context, ‘breakpoints’ denote specific locations on the chromosome where such deletions take place.

This notebook should take approximately two minutes to run.

Setup

Install and import the malariagen Python package:

!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data

Import required python libraries that are installed at colab by default.

import numpy as np
import pandas as pd
import collections
import matplotlib.pyplot as plt
from google.colab import drive

Access Pf7 Data

We use the malariagen data package to load the release data.

release_data = malariagen_data.Pf7()
df_samples = release_data.sample_metadata()

hrp2 & hrp3 Deletions

We additionally require list of deletion and breakpoint locations within the histidine-rich protein genes II and III (HRP-2 and -3) across 16,203 QC-pass samples. We can access this data from MalariaGEN web page.

# Fetch details of hrp calls from this MalariaGEN.net file
hrp_calls_fn = pd.read_csv('https://www.malariagen.net/wp-content/uploads/2024/01/hrp_calls_pf7.tsv', sep='\t')

# Print the shape and first rows
print(hrp_calls_fn.shape)
hrp_calls_fn.head()
(16203, 7)
Sample HRP2 HRP3 HRP2_breakpoint HRP3_breakpoint HRP2_deletion_type HRP3_deletion_type
0 FP0008-C nodel nodel - - NaN NaN
1 FP0009-C nodel nodel - - NaN NaN
2 FP0010-CW uncallable uncallable - - NaN NaN
3 FP0011-CW uncallable uncallable - - NaN NaN
4 FP0012-CW nodel nodel - - NaN NaN

Now, let’s merge hrp_calls_fn with df_samples which contains various metadata of Pf7 samples.

# Merge df_samples with hrp_calls_fn
df_samples =  df_samples.merge(hrp_calls_fn, on ='Sample')

3D7 Reference Annotation

Another relevant dataset in this notebook is the 3D7 reference annotation which is used for genome annotations in Pf7. It can be accessed at the following FTP link: ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff.

We will use the wget package to download this file. After running the following cell, the file should appear in the list of files on the left.

!wget ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff
--2024-02-08 11:27:43--  ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff
           => ‘Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff.1’
Resolving ngs.sanger.ac.uk (ngs.sanger.ac.uk)... 193.62.203.221
Connecting to ngs.sanger.ac.uk (ngs.sanger.ac.uk)|193.62.203.221|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /production/malaria/Resource/34 ... done.
==> SIZE Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff ... 21035550
==> PASV ... done.    ==> RETR Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff ... done.
Length: 21035550 (20M) (unauthoritative)

Pfalciparum_replace 100%[===================>]  20.06M  18.2MB/s    in 1.1s    

2024-02-08 11:27:45 (18.2 MB/s) - ‘Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff.1’ saved [21035550]

This file is in GFF file format defined by Ensembl.

Since the file is tab-separated, we can proceed to read the data into a Pandas dataframe once again.

# Read the file column names.
df_gff = pd.read_csv('/content/Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff', names=['chrom', 'db', 'type', 'start', 'end', 'u1', 'strand', 'u2', 'id'], comment='#', sep='\t')
# Print the shape (rows and columns) of df_gff
print(df_gff.shape)
# Show the first 5 rows
df_gff.head()
(40713, 9)
chrom db type start end u1 strand u2 id
0 Pf3D7_01_v3 chado repeat_region 1 360 . + . ID=Pfalciparum_REP_20;comment=telomeric repeat...
1 Pf3D7_01_v3 chado repeat_region 361 1418 . + . ID=Pfalciparum_REP_15;comment=14bp repeat
2 Pf3D7_01_v3 chado repeat_region 2160 3858 . + . ID=Pfalciparum_REP_35;comment=65bp repeat
3 Pf3D7_01_v3 chado repeat_region 8856 9021 . + . ID=Pfalciparum_REP_5;comment=25bp repeat
4 Pf3D7_01_v3 chado repeat_region 9313 9529 . + . ID=Pfalciparum_REP_25;comment=26bp repeat

Figure Preparation

We need to find the start and end positions of chromosomes to draw gene annotations in the figure that we are going to create.

# Find start and end positions for each chromosome by grouping chromosome coordinates
df_chroms = df_gff.groupby('chrom').agg({'start': 'min', 'end': 'max'}).reset_index()
# Set 'chrom' as the index
df_chroms.set_index('chrom', inplace=True)
df_chroms
start end
chrom
Pf3D7_01_v3 1 640851
Pf3D7_02_v3 1 947102
Pf3D7_03_v3 1 1067971
Pf3D7_04_v3 1 1200490
Pf3D7_05_v3 1 1343557
Pf3D7_06_v3 1 1418242
Pf3D7_07_v3 1 1445207
Pf3D7_08_v3 1 1472805
Pf3D7_09_v3 1 1541735
Pf3D7_10_v3 1 1687656
Pf3D7_11_v3 1 2038340
Pf3D7_12_v3 1 2260569
Pf3D7_13_v3 1 2925236
Pf3D7_14_v3 1 3291936
Pf3D7_API_v3 1 34225
Pf_M76611 3 5954

The next question is: How many samples have deletions in each country?

# This function returns samples with deletion for each country.

def breakpoint_agg(x):
    names = collections.OrderedDict()
    names['Countries'] = ''
    countries = []
    # Loop over each country
    # Count non-zero samples (with deletion)
    for country in x['Country'].unique():
        countries.append(f"{country} ({np.count_nonzero(x['Country'] == country)})")
    # Join together country name and number
    names['Countries'] = ', '.join(countries)
    names['Samples with deletion'] = len(x)
    return pd.Series(names)

We will apply breakpoint_agg function to df_samples separately for hrp2 and hrp3.

Additionally, we will find genomic coordinates at the edges for mapping in the figure.

# Group samples by Deletion type and HRP3_breakpoint
# Apply breakpoint_agg to count samples with deletion in each country
df_hrp2 = (
    df_samples[
        df_samples['QC pass']
        & ( df_samples['HRP2'] == 'del' )
]
    .assign(Gene='$hrp2$')
    .rename(columns={'HRP2_deletion_type': 'Deletion type'})
    .groupby(['Deletion type', 'HRP2_breakpoint'])
    .apply(breakpoint_agg)
    .reset_index()
)
# Seperate coordinate value from chromosome
df_hrp2['breakpoint'] = df_hrp2['HRP2_breakpoint'].apply(lambda x: int(x.split(':')[1]))

# Print min and max coordinates
print(f"HRP2 min breakpoint = {df_hrp2['breakpoint'].min()}")
print(f"HRP2 max breakpoint = {df_hrp2['breakpoint'].max()}")
df_hrp2
HRP2 min breakpoint = 1373732
HRP2 max breakpoint = 1374986
Deletion type HRP2_breakpoint Countries Samples with deletion breakpoint
0 Telomere healing Pf3D7_08_v3:1373732 Cambodia (1) 1 1373732
1 Telomere healing Pf3D7_08_v3:1374280 Sudan (1) 1 1374280
2 Telomere healing Pf3D7_08_v3:1374462 Indonesia (5) 5 1374462
3 Telomere healing Pf3D7_08_v3:1374932 Peru (2) 2 1374932
4 Telomere healing Pf3D7_08_v3:1374986 Peru (6) 6 1374986

A repeat of the same look-up for hrp3.

# Group samples by Deletion type and HRP3_breakpoint
# Apply breakpoint_agg to count samples with deletion in each country
df_hrp3 = (
    df_samples[
        df_samples['QC pass']
        & ( df_samples['HRP3'] == 'del' )
    ]
    .rename(columns={'HRP3_deletion_type': 'Deletion type'})
    .groupby(['Deletion type', 'HRP3_breakpoint'])
    .apply(breakpoint_agg)
    .reset_index()
)
# Seperate coordinate value from chromosome
df_hrp3['breakpoint'] = df_hrp3['HRP3_breakpoint'].apply(lambda x: x.split(':')[1])

# Print min and max coordinates
print(f"HRP3 min breakpoint = {df_hrp3['breakpoint'].min()}")
print(f"HRP3 max breakpoint = {df_hrp3['breakpoint'].max()}")
df_hrp3
HRP3 min breakpoint = 2800004-2807159
HRP3 max breakpoint = 2841120
Deletion type HRP3_breakpoint Countries Samples with deletion breakpoint
0 Chromosome 11 recombination Pf3D7_13_v3:2800004-2807159 Thailand (1), Ghana (1), Indonesia (40), Peru ... 151 2800004-2807159
1 Chromosome 5 recombination Pf3D7_13_v3:2835587-2835612 Cambodia (20), Vietnam (1) 21 2835587-2835612
2 Telomere healing Pf3D7_13_v3:2811525 India (1) 1 2811525
3 Telomere healing Pf3D7_13_v3:2812344 Sudan (1) 1 2812344
4 Telomere healing Pf3D7_13_v3:2815249 Tanzania (1) 1 2815249
5 Telomere healing Pf3D7_13_v3:2822480 Ghana (1) 1 2822480
6 Telomere healing Pf3D7_13_v3:2823645 Kenya (1) 1 2823645
7 Telomere healing Pf3D7_13_v3:2830952 Cambodia (7) 7 2830952
8 Telomere healing Pf3D7_13_v3:2832080 Democratic Republic of the Congo (1) 1 2832080
9 Telomere healing Pf3D7_13_v3:2834604 Vietnam (1) 1 2834604
10 Telomere healing Pf3D7_13_v3:2835532 Thailand (1) 1 2835532
11 Telomere healing Pf3D7_13_v3:2837145 Vietnam (7) 7 2837145
12 Telomere healing Pf3D7_13_v3:2837392 Cambodia (2), Laos (1) 3 2837392
13 Telomere healing Pf3D7_13_v3:2838654 Indonesia (2) 2 2838654
14 Telomere healing Pf3D7_13_v3:2841024 Thailand (1) 1 2841024
15 Telomere healing Pf3D7_13_v3:2841120 Indonesia (1) 1 2841120

It seems that Chromosome 11 recombination breakpoints are observed in multiple countries which motivates us to print the full list of countries.

# Locate the first row by using the index
df_hrp3.iloc[0]['Countries']
'Thailand (1), Ghana (1), Indonesia (40), Peru (15), Bangladesh (1), Vietnam (2), Colombia (50), Ethiopia (9), Senegal (6), Laos (14), Cambodia (3), Sudan (5), Mali (1), Gambia (3)'

Given the fact that many of these events result in the deletion of other genes in addition to hrp2 and hrp3.

We could have a look at which genes are present within the range of breakpoints before including them in the plot.

# Genes in hrp2 breakpoints
df_gff.loc[
    ( df_gff['chrom'] == 'Pf3D7_08_v3' )
    & ( df_gff['start'] <= 1375500 )
    & ( df_gff['end'] >= 1364000 )
]
chrom db type start end u1 strand u2 id
15534 Pf3D7_08_v3 chado three_prime_UTR 1364640 1365466 . - . ID=PF3D7_0831700.1:3UTR;Parent=PF3D7_0831700.1
15535 Pf3D7_08_v3 chado gene 1364640 1369862 . - . ID=PF3D7_0831700;Name=HSP70x;previous_systemat...
15536 Pf3D7_08_v3 chado mRNA 1364640 1369862 . - . ID=PF3D7_0831700.1;comment=HSP70-x can be knoc...
15537 Pf3D7_08_v3 chado CDS 1365467 1367506 . - 0 ID=PF3D7_0831700.1:exon:1;Parent=PF3D7_0831700.1
15538 Pf3D7_08_v3 chado polypeptide 1365467 1367506 . - . ID=PF3D7_0831700.1:pep;Derives_from=PF3D7_0831...
15539 Pf3D7_08_v3 chado five_prime_UTR 1367507 1367640 . - . ID=PF3D7_0831700.1:5UTR:1;Parent=PF3D7_0831700.1
15540 Pf3D7_08_v3 chado five_prime_UTR 1368649 1369862 . - . ID=PF3D7_0831700.1:5UTR;Parent=PF3D7_0831700.1
15541 Pf3D7_08_v3 chado pseudogenic_exon 1371847 1372100 . + . ID=PF3D7_0831750.1:exon:2;Parent=PF3D7_0831750.1
15542 Pf3D7_08_v3 chado pseudogene 1371847 1372720 . + . ID=PF3D7_0831750
15543 Pf3D7_08_v3 chado pseudogenic_transcript 1371847 1372720 . + . ID=PF3D7_0831750.1;Parent=PF3D7_0831750;Dbxref...
15544 Pf3D7_08_v3 chado polypeptide 1371847 1372720 . + . ID=PF3D7_0831750.1:pep;Derives_from=PF3D7_0831...
15545 Pf3D7_08_v3 chado pseudogenic_exon 1372103 1372223 . + . ID=PF3D7_0831750.1:exon:3;Parent=PF3D7_0831750.1
15546 Pf3D7_08_v3 chado pseudogenic_exon 1372225 1372291 . + . ID=PF3D7_0831750.1:exon:4;Parent=PF3D7_0831750.1
15547 Pf3D7_08_v3 chado pseudogenic_exon 1372294 1372577 . + . ID=PF3D7_0831750.1:exon:5;Parent=PF3D7_0831750.1
15548 Pf3D7_08_v3 chado pseudogenic_exon 1372579 1372667 . + . ID=PF3D7_0831750.1:exon:6;Parent=PF3D7_0831750.1
15549 Pf3D7_08_v3 chado pseudogenic_exon 1372669 1372720 . + . ID=PF3D7_0831750.1:exon:7;Parent=PF3D7_0831750.1
15550 Pf3D7_08_v3 chado three_prime_UTR 1373212 1374235 . - . ID=PF3D7_0831800.1:3UTR;Parent=PF3D7_0831800.1
15551 Pf3D7_08_v3 chado gene 1373212 1376988 . - . ID=PF3D7_0831800;Name=HRP2;previous_systematic...
15552 Pf3D7_08_v3 chado mRNA 1373212 1376988 . - . ID=PF3D7_0831800.1;comment=HRP2%2C which is on...
15553 Pf3D7_08_v3 chado CDS 1374236 1375084 . - 0 ID=PF3D7_0831800.1:exon:3;Parent=PF3D7_0831800.1
15554 Pf3D7_08_v3 chado polypeptide 1374236 1375299 . - . ID=PF3D7_0831800.1:pep;Derives_from=PF3D7_0831...
15555 Pf3D7_08_v3 chado CDS 1375231 1375299 . - 0 ID=PF3D7_0831800.1:exon:2;Parent=PF3D7_0831800.1
15556 Pf3D7_08_v3 chado five_prime_UTR 1375300 1376988 . - . ID=PF3D7_0831800.1:5UTR;Parent=PF3D7_0831800.1
# Genes in hrp3 breakpoints
pd.options.display.max_rows = 100
df_gff.loc[
    ( df_gff['chrom'] == 'Pf3D7_13_v3' )
    & ( df_gff['start'] <= 2845000 )
    & ( df_gff['end'] >= 2795000 )
]
chrom db type start end u1 strand u2 id
34303 Pf3D7_13_v3 chado gene 2796119 2797144 . + . ID=PF3D7_1370800;previous_systematic_id=PF13TR...
34304 Pf3D7_13_v3 chado ncRNA 2796119 2797144 . + . ID=PF3D7_1370800.1;comment=Barrell%2C March 20...
34305 Pf3D7_13_v3 chado CDS 2796119 2797144 . + 0 ID=PF3D7_1370800.1:exon:1;Parent=PF3D7_1370800.1
34306 Pf3D7_13_v3 chado gene 2797507 2798103 . + . ID=PF3D7_1370900;previous_systematic_id=PF13TR...
34307 Pf3D7_13_v3 chado ncRNA 2797507 2798103 . + . ID=PF3D7_1370900.1;comment=Barrell%2C March 20...
34308 Pf3D7_13_v3 chado CDS 2797507 2798103 . + 0 ID=PF3D7_1370900.1:exon:1;Parent=PF3D7_1370900.1
34309 Pf3D7_13_v3 chado gene 2800004 2802154 . + . ID=PF3D7_1371000;previous_systematic_id=MAL13_...
34310 Pf3D7_13_v3 chado rRNA 2800004 2802154 . + . ID=PF3D7_1371000.1;literature=PMID:17901154;Pa...
34311 Pf3D7_13_v3 chado CDS 2800004 2802154 . + 0 ID=PF3D7_1371000.1:exon:3;Parent=PF3D7_1371000.1
34312 Pf3D7_13_v3 chado gene 2802527 2802686 . + . ID=PF3D7_1371200;previous_systematic_id=MAL13_...
34313 Pf3D7_13_v3 chado rRNA 2802527 2802686 . + . ID=PF3D7_1371200.1;literature=PMID:17901154;Pa...
34314 Pf3D7_13_v3 chado CDS 2802527 2802686 . + 0 ID=PF3D7_1371200.1:exon:1;Parent=PF3D7_1371200.1
34315 Pf3D7_13_v3 chado gene 2802945 2807159 . + . ID=PF3D7_1371300;previous_systematic_id=MAL13_...
34316 Pf3D7_13_v3 chado rRNA 2802945 2807159 . + . ID=PF3D7_1371300.1;literature=PMID:17901154;Pa...
34317 Pf3D7_13_v3 chado CDS 2802945 2807159 . + 0 ID=PF3D7_1371300.1:exon:3;Parent=PF3D7_1371300.1
34318 Pf3D7_13_v3 chado three_prime_UTR 2808200 2808562 . - . ID=PF3D7_1371500.1:3UTR;Parent=PF3D7_1371500.1
34319 Pf3D7_13_v3 chado gene 2808200 2810256 . - . ID=PF3D7_1371500;previous_systematic_id=MAL13P...
34320 Pf3D7_13_v3 chado mRNA 2808200 2810256 . - . ID=PF3D7_1371500.1;Parent=PF3D7_1371500;Dbxref...
34321 Pf3D7_13_v3 chado CDS 2808563 2809000 . - 0 ID=PF3D7_1371500.1:exon:1;Parent=PF3D7_1371500.1
34322 Pf3D7_13_v3 chado polypeptide 2808563 2809222 . - . ID=PF3D7_1371500.1:1:pep;Derives_from=PF3D7_13...
34323 Pf3D7_13_v3 chado CDS 2809154 2809222 . - 0 ID=PF3D7_1371500.1:exon:2;Parent=PF3D7_1371500.1
34324 Pf3D7_13_v3 chado five_prime_UTR 2809223 2810256 . - . ID=PF3D7_1371500.1:5UTR;Parent=PF3D7_1371500.1
34325 Pf3D7_13_v3 chado five_prime_UTR 2811277 2821077 . + . ID=PF3D7_1371700.1:5UTR;Parent=PF3D7_1371700.1
34326 Pf3D7_13_v3 chado gene 2811277 2824446 . + . ID=PF3D7_1371700;previous_systematic_id=MAL13P...
34327 Pf3D7_13_v3 chado mRNA 2811277 2824446 . + . ID=PF3D7_1371700.1;literature=PMID:15752424,PM...
34328 Pf3D7_13_v3 chado pseudogenic_exon 2811706 2812263 . + . ID=PF3D7_1371600.1:exon:2;Parent=PF3D7_1371600.1
34329 Pf3D7_13_v3 chado pseudogene 2811706 2820270 . + . ID=PF3D7_1371600;Name=EBL1;previous_systematic...
34330 Pf3D7_13_v3 chado pseudogenic_transcript 2811706 2820270 . + . ID=PF3D7_1371600.1;literature=PMID:15760663,PM...
34331 Pf3D7_13_v3 chado polypeptide 2811706 2820270 . + . ID=PF3D7_1371600.1:pep;Derives_from=PF3D7_1371...
34332 Pf3D7_13_v3 chado pseudogenic_exon 2812266 2819628 . + . ID=PF3D7_1371600.1:exon:3;Parent=PF3D7_1371600.1
34333 Pf3D7_13_v3 chado pseudogenic_exon 2819764 2819851 . + . ID=PF3D7_1371600.1:exon:1;Parent=PF3D7_1371600.1
34334 Pf3D7_13_v3 chado pseudogenic_exon 2820015 2820088 . + . ID=PF3D7_1371600.1:exon:4;Parent=PF3D7_1371600.1
34335 Pf3D7_13_v3 chado pseudogenic_exon 2820227 2820270 . + . ID=PF3D7_1371600.1:exon:5;Parent=PF3D7_1371600.1
34336 Pf3D7_13_v3 chado CDS 2821078 2821173 . + 0 ID=PF3D7_1371700.1:exon:2;Parent=PF3D7_1371700.1
34337 Pf3D7_13_v3 chado polypeptide 2821078 2823292 . + . ID=PF3D7_1371700.1:pep;Derives_from=PF3D7_1371...
34338 Pf3D7_13_v3 chado CDS 2821278 2822786 . + 0 ID=PF3D7_1371700.1:exon:3;Parent=PF3D7_1371700.1
34339 Pf3D7_13_v3 chado CDS 2823212 2823292 . + 0 ID=PF3D7_1371700.1:exon:4;Parent=PF3D7_1371700.1
34340 Pf3D7_13_v3 chado three_prime_UTR 2823251 2824301 . - . ID=PF3D7_1371800.1:3UTR;Parent=PF3D7_1371800.1
34341 Pf3D7_13_v3 chado gene 2823251 2825967 . - . ID=PF3D7_1371800;previous_systematic_id=MAL13P...
34342 Pf3D7_13_v3 chado mRNA 2823251 2825967 . - . ID=PF3D7_1371800.1;Parent=PF3D7_1371800;Dbxref...
34343 Pf3D7_13_v3 chado three_prime_UTR 2823293 2824446 . + . ID=PF3D7_1371700.1:3UTR;Parent=PF3D7_1371700.1
34344 Pf3D7_13_v3 chado CDS 2824302 2825552 . - 0 ID=PF3D7_1371800.1:exon:3;Parent=PF3D7_1371800.1
34345 Pf3D7_13_v3 chado polypeptide 2824302 2825852 . - . ID=PF3D7_1371800.1:pep;Derives_from=PF3D7_1371...
34346 Pf3D7_13_v3 chado CDS 2825781 2825852 . - 0 ID=PF3D7_1371800.1:exon:2;Parent=PF3D7_1371800.1
34347 Pf3D7_13_v3 chado five_prime_UTR 2825853 2825967 . - . ID=PF3D7_1371800.1:5UTR;Parent=PF3D7_1371800.1
34348 Pf3D7_13_v3 chado five_prime_UTR 2829530 2829855 . + . ID=PF3D7_1371900.1:5UTR;Parent=PF3D7_1371900.1
34349 Pf3D7_13_v3 chado gene 2829530 2830830 . + . ID=PF3D7_1371900;previous_systematic_id=MAL13P...
34350 Pf3D7_13_v3 chado mRNA 2829530 2830830 . + . ID=PF3D7_1371900.1;Parent=PF3D7_1371900;Dbxref...
34351 Pf3D7_13_v3 chado CDS 2829856 2829927 . + 0 ID=PF3D7_1371900.1:exon:1;Parent=PF3D7_1371900.1
34352 Pf3D7_13_v3 chado polypeptide 2829856 2830669 . + . ID=PF3D7_1371900.1:pep;Derives_from=PF3D7_1371...
34353 Pf3D7_13_v3 chado CDS 2830109 2830669 . + 0 ID=PF3D7_1371900.1:exon:2;Parent=PF3D7_1371900.1
34354 Pf3D7_13_v3 chado three_prime_UTR 2830670 2830830 . + . ID=PF3D7_1371900.1:3UTR;Parent=PF3D7_1371900.1
34355 Pf3D7_13_v3 chado five_prime_UTR 2832623 2832951 . + . ID=PF3D7_1372000.1:5UTR;Parent=PF3D7_1372000.1
34356 Pf3D7_13_v3 chado gene 2832623 2835439 . + . ID=PF3D7_1372000;previous_systematic_id=MAL13P...
34357 Pf3D7_13_v3 chado mRNA 2832623 2835439 . + . ID=PF3D7_1372000.1;comment=PHIST domain protei...
34358 Pf3D7_13_v3 chado CDS 2832952 2833086 . + 0 ID=PF3D7_1372000.1:exon:1;Parent=PF3D7_1372000.1
34359 Pf3D7_13_v3 chado polypeptide 2832952 2834322 . + . ID=PF3D7_1372000.1:pep;Derives_from=PF3D7_1372...
34360 Pf3D7_13_v3 chado CDS 2833204 2834322 . + 0 ID=PF3D7_1372000.1:exon:2;Parent=PF3D7_1372000.1
34361 Pf3D7_13_v3 chado three_prime_UTR 2834323 2835439 . + . ID=PF3D7_1372000.1:3UTR;Parent=PF3D7_1372000.1
34362 Pf3D7_13_v3 chado five_prime_UTR 2835756 2837052 . + . ID=PF3D7_1372100.1:5UTR;Parent=PF3D7_1372100.1
34363 Pf3D7_13_v3 chado gene 2835756 2839580 . + . ID=PF3D7_1372100;Name=GEXP04;previous_systemat...
34364 Pf3D7_13_v3 chado mRNA 2835756 2839580 . + . ID=PF3D7_1372100.1;comment=Barrell%2C October ...
34365 Pf3D7_13_v3 chado CDS 2837053 2837136 . + 0 ID=PF3D7_1372100.1:exon:1;Parent=PF3D7_1372100.1
34366 Pf3D7_13_v3 chado polypeptide 2837053 2839058 . + . ID=PF3D7_1372100.1:pep;Derives_from=PF3D7_1372...
34367 Pf3D7_13_v3 chado CDS 2837313 2839058 . + 0 ID=PF3D7_1372100.1:exon:2;Parent=PF3D7_1372100.1
34368 Pf3D7_13_v3 chado three_prime_UTR 2839059 2839580 . + . ID=PF3D7_1372100.1:3UTR;Parent=PF3D7_1372100.1
34369 Pf3D7_13_v3 chado three_prime_UTR 2840236 2840726 . - . ID=PF3D7_1372200.1:3UTR;Parent=PF3D7_1372200.1
34370 Pf3D7_13_v3 chado gene 2840236 2842840 . - . ID=PF3D7_1372200;Name=HRPIII;previous_systemat...
34371 Pf3D7_13_v3 chado mRNA 2840236 2842840 . - . ID=PF3D7_1372200.1;comment=sub-telomeric gene ...
34372 Pf3D7_13_v3 chado CDS 2840727 2841485 . - 0 ID=PF3D7_1372200.1:exon:1;Parent=PF3D7_1372200.1
34373 Pf3D7_13_v3 chado polypeptide 2840727 2841703 . - . ID=PF3D7_1372200.1:pep;Derives_from=PF3D7_1372...
34374 Pf3D7_13_v3 chado CDS 2841635 2841703 . - 0 ID=PF3D7_1372200.1:exon:2;Parent=PF3D7_1372200.1
34375 Pf3D7_13_v3 chado five_prime_UTR 2841704 2842840 . - . ID=PF3D7_1372200.1:5UTR;Parent=PF3D7_1372200.1
34376 Pf3D7_13_v3 chado five_prime_UTR 2843157 2845766 . + . ID=PF3D7_1372300.1:5UTR;Parent=PF3D7_1372300.1
34377 Pf3D7_13_v3 chado gene 2843157 2847557 . + . ID=PF3D7_1372300
34378 Pf3D7_13_v3 chado mRNA 2843157 2847557 . + . ID=PF3D7_1372300.1;Parent=PF3D7_1372300;Dbxref...

Create figure

This intricate figure serves to map deletion breakpoints in hrp2 and hrp3 across various countries. The x-axis of the figure displays the deletion breakpoints, while the y-axis shows the countries along with the number of breakpoints they exhibit.

The figure comprises two sections: hrp2 (positioned at the top, subplots 1-5) and hrp3 (located at the bottom, subplots 6-11), consisting of a total of 11 subplots.

We start by defining distinct colour codes for each genomic region that we will annotate.

# Create a dictionary with distinct colour codes
figure_colours = collections.OrderedDict()
figure_colours['chr_8_13'] = '#377eb8'
figure_colours['chr_11'] = '#984ea3'
figure_colours['chr_5'] = '#ff7f00'
figure_colours['similar_sequence'] = '#ffff33'
figure_colours['hrp_genes'] = '#e41a1c'
figure_colours['rrna_genes'] = '#4daf4a'
figure_colours['other_genes'] = 'black'
figure_colours['pseudogenes'] = 'grey'
# Full figure

fig, axs = plt.subplots(11, 1, figsize=(9, 9), gridspec_kw={'height_ratios': [1, 7, 10, 6, 12, 1, 7, 32, 18, 12, 12], 'hspace': 0})

### HRP2

# Set the minimum and maximum positions for hrp2
min_pos = 1371500
max_pos = 1375500

## Subplot 1: Title and Chromosome Region

# Set the title
axs[0].set_title('Pf3D7_08_v3')
# Configure the x-axis properties
axs[0].set_xticks([])
axs[0].set_xlabel(None)
# Configure the y-axis properties
axs[0].set_yticks([])
# Set the x-axis limits to match the genomic region of 'Pf3D7_08_v3'
axs[0].set_xlim(int(df_chroms.loc['Pf3D7_08_v3', 'start']), int(df_chroms.loc['Pf3D7_08_v3', 'end']))
# Define the background colour of the subplot
axs[0].set_facecolor(figure_colours['chr_8_13'])
# Add a vertical span between 'min_pos' and 'max_pos' with a white background
axs[0].axvspan(min_pos, max_pos, color='white')

## Subplot 2: Chromosome Boundaries

# Configure x and y axis ticks
axs[1].set_xticks([])
axs[1].set_yticks([])
# Set the x-axis limits to match the genomic region of 'Pf3D7_08_v3'
axs[1].set_xlim(int(df_chroms.loc['Pf3D7_08_v3', 'start']), int(df_chroms.loc['Pf3D7_08_v3', 'end']))
# Define the y-axis limits
axs[1].set_ylim(0, 1)
# Plot horizontal lines marking the start and end positions of 'Pf3D7_08_v3'
axs[1].plot([int(df_chroms.loc['Pf3D7_08_v3', 'start']), min_pos], [0, 1], '-', color='black')
axs[1].plot([int(df_chroms.loc['Pf3D7_08_v3', 'end']), max_pos], [0, 1], '-', color='black')
# Hide the left and right spines for a cleaner appearance
axs[1].spines['left'].set_visible(False)
axs[1].spines['right'].set_visible(False)

# Subplot 3: Deletion Breakpoints by Country

# Create lists for y-axis labels and positions
ylabels = []
yposes = []
# Loop through the sorted df_hrp2
for ypos, row in df_hrp2.sort_values('breakpoint', ascending=False).iterrows():
    yposes.append(ypos)
    # Plot horizontal lines connecting min_pos to each hrp2 deletion breakpoint
    axs[2].plot((min_pos, row['breakpoint']), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_8_13'])
    # Plot circular markers for each breakpoint position
    axs[2].plot(row['breakpoint'] + np.arange(20*2, 101*2, 40*2), [ypos]*3, 'o', color=figure_colours['chr_8_13'])
    # Add country names to the y-labels list
    ylabels.append(f"{row['Countries']}")
# Set y-axis ticks and labels to correspond to country names
axs[2].set_yticks(yposes)
axs[2].set_yticklabels(ylabels)
# Set x-axis limits to cover the specified region between min_pos and max_pos
axs[2].set_xlim(min_pos, max_pos)
# Adjust y-axis limits to include space for data visualization
axs[2].set_ylim(min(yposes) - 0.5, max(yposes) + 0.5)
# Hide x-axis ticks for a cleaner appearance
axs[2].set_xticks([])

## Subplot 4: Genes on the x-axis

# Set the x-axis limits to cover the specified genomic region between min_pos and max_pos
axs[3].set_xlim(min_pos, max_pos)
# Set the y-axis limits to display a region between 0 and 1
axs[3].set_ylim(0, 1)
# Iterate through df_gff to process different gene annotations
for ix, row in df_gff.loc[
    (df_gff['chrom'] == 'Pf3D7_08_v3')
    & (df_gff['start'] <= max_pos)
    & (df_gff['end'] >= min_pos)
    & (df_gff['type'] == 'CDS')
].iterrows():
    # Determine the color based on the genomic position
    if row['start'] >= 1373212 and row['end'] <= 1376988:
        color = figure_colours['hrp_genes']
    else:
        color = figure_colours['other_genes']
    # Plot CDS features with specified line properties and color
    axs[3].plot((row['start'], row['end']), (0.75, 0.75), linewidth=6, solid_capstyle='butt', color=color)

# Iterate through polypeptide
for ix, row in df_gff.loc[
    (df_gff['chrom'] == 'Pf3D7_08_v3')
    & (df_gff['start'] <= max_pos)
    & (df_gff['end'] >= min_pos)
    & (df_gff['type'] == 'polypeptide')
].iterrows():
    if row['start'] >= 1373212 and row['end'] <= 1376988:
        color = figure_colours['hrp_genes']
    else:
        color = figure_colours['other_genes']
    # Plot polypeptide features with different line properties and color
    axs[3].plot((row['start'], row['end']), (0.75, 0.75), linewidth=2, solid_capstyle='butt', color=color)

# Iterate through pseudogene
for ix, row in df_gff.loc[
    (df_gff['chrom'] == 'Pf3D7_08_v3')
    & (df_gff['start'] <= max_pos)
    & (df_gff['end'] >= min_pos)
    & (df_gff['type'] == 'pseudogene')
].iterrows():
    # Plot pseudogene features with specified line properties and color
    axs[3].plot((row['start'], row['end']), (0.75, 0.75), linewidth=6, solid_capstyle='butt', color=figure_colours['pseudogenes'])

# Add text labels for specific genomic regions
axs[3].text((1375299 + 1374236) / 2, 0.25, '${hrp2}$', va='center', ha='center', size=8)
axs[3].text((1371847 + 1372720) / 2, 0.25, 'PF3D7_0831750 (pseudogene)', va='center', ha='center', size=8)

# Set y-axis ticks, labels, and x-axis ticks to maintain an organized representation
axs[3].set_yticks([])
axs[3].set_xticks([])
axs[3].set_ylabel('Genes', rotation=0, ha='right', va='center')

## Subplot 5: Ticks for Genomic Coordinates
# Set the x-axis limits to cover the specified genomic region between 'min_pos' and 'max_pos'
axs[4].set_xlim(min_pos, max_pos)
# Hide the left and right spines to create a cleaner appearance
axs[4].spines['left'].set_visible(False)
axs[4].spines['right'].set_visible(False)
# Configure the x-axis ticks at specific positions
axs[4].set_xticks([1372000, 1373000, 1374000, 1375000])
# Label the x-axis ticks with corresponding values
axs[4].set_xticklabels(["1.372 Mbp", "1.373 Mbp", "1.374 Mbp", "1.375 Mbp"])
# Remove y-axis ticks to maintain a clean look
axs[4].set_yticks([])
# Position x-axis ticks at the top
axs[4].xaxis.tick_top()
# Adjust the direction of x-axis ticks and add padding
axs[4].tick_params(axis="x", direction="in", pad=-20)

### HRP3

# Set the minimum and maximum positions for hrp3
min_pos = 2795000
max_pos = 2845000

## Subplot 6: Title and Chromosome Region

# Set the title
axs[5].set_title('Pf3D7_13_v3')
# Hide the x-axis ticks and labels
axs[5].set_xticks([])
axs[5].set_xlabel(None)
# Hide the y-axis ticks
axs[5].set_yticks([])
# Set the x-axis limits to cover the specified genomic region between the start and end positions of chromosome
axs[5].set_xlim(int(df_chroms.loc['Pf3D7_13_v3', 'start']), int(df_chroms.loc['Pf3D7_13_v3', 'end']))
# Customize the background color of the subplot
axs[5].set_facecolor(figure_colours['chr_8_13'])
# Create a white span on the plot to mark the genomic region of interest
axs[5].axvspan(min_pos, max_pos, color='white')

## Subplot 7: Chromosome Boundaries

# Hide the x-axis ticks and labels
axs[6].set_xticks([])
axs[6].set_yticks([])
# Set the x-axis limits to cover the specified genomic region between the start and end positions of 'Pf3D7_13_v3'
axs[6].set_xlim(int(df_chroms.loc['Pf3D7_13_v3', 'start']), int(df_chroms.loc['Pf3D7_13_v3', 'end']))
# Set the y-axis limits to cover a range from 0 to 1
axs[6].set_ylim(0, 1)
# Create lines to mark the start and end positions of the chromosome region with black color
axs[6].plot([int(df_chroms.loc['Pf3D7_13_v3', 'start']), min_pos], [0, 1], '-', color='black')
axs[6].plot([int(df_chroms.loc['Pf3D7_13_v3', 'end']), max_pos], [0, 1], '-', color='black')
# Hide the left and right spines to create a cleaner appearance
axs[6].spines['left'].set_visible(False)
axs[6].spines['right'].set_visible(False)

## Subplot 8: Deletion Breakpoints by Country

# Initialize lists for labels and positions
ylabels = []
yposes = []
# Iterate through df_hrp3 sorted by breakpoints
for ypos, row in df_hrp3.sort_values('breakpoint', ascending=False).iterrows():
    yposes.append(ypos)

    # Check the deletion type and apply different plotting styles accordingly
    if row['Deletion type'] == 'Telomere healing':
        breakpoint = int(row['breakpoint'])

        # Plot the deletion breakpoint and add circular markers
        axs[7].plot((min_pos, breakpoint), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_8_13'])
        axs[7].plot(int(row['breakpoint']) + np.arange(20*25, 101*25, 40*25), [ypos]*3, 'o', color=figure_colours['chr_8_13'])
        ylabels.append(f"{row['Countries']}")

    if row['Deletion type'] == 'Chromosome 5 recombination':
        breakpoint_start, breakpoint_end = [int(x) for x in row['breakpoint'].split('-')]

        # Plot the breakpoints for Chromosome 5 recombination
        axs[7].plot((min_pos, breakpoint_start), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_8_13'])
        axs[7].plot((breakpoint_end, max_pos), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_5'])
        ylabels.append(f"{row['Countries']}")

    if row['Deletion type'] == 'Chromosome 11 recombination':
        breakpoint_start, breakpoint_end = [int(x) for x in row['breakpoint'].split('-')]

        # Plot the breakpoints for Chromosome 11 recombination
        axs[7].plot((min_pos, breakpoint_start), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_8_13'])
        axs[7].plot((breakpoint_end, max_pos), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_11'])
        axs[7].plot((breakpoint_start, breakpoint_end), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['similar_sequence'])
        ylabels.append(f"14 countries (151)")

# Set the y-axis ticks and labels
axs[7].set_yticks(yposes)
axs[7].set_yticklabels(ylabels)
# Set x-axis limits, y-axis limits, and remove x-axis ticks
axs[7].set_xlim(min_pos, max_pos)
axs[7].set_ylim(min(yposes)-0.5, max(yposes)+0.5)
axs[7].set_xticks([])

## Subplot 9: Genes on the x-axis

bar_pos = 0.9
text_pos = 0.8

# Set the x-axis and y-axis limits
axs[8].set_xlim(min_pos, max_pos)
axs[8].set_ylim(0, 1)
# Iterate through GFF data filtering for relevant features
# Plot CDS features in distinct colors, depending on their location
for ix, row in df_gff.loc[
    ( df_gff['chrom'] == 'Pf3D7_13_v3' )
    & ( df_gff['start'] <= max_pos )
    & ( df_gff['end'] >= min_pos )
    & ( df_gff['type'] == 'CDS' )
].iterrows():
    if row['start'] >= 2840727 and row['end'] <= 2841703:
        color = figure_colours['hrp_genes']
    else:
        color = figure_colours['other_genes']
    axs[8].plot((row['start'], row['end']), (bar_pos, bar_pos), linewidth=6, solid_capstyle='butt', color=color)

# Plot polypeptide features with similar style, with specific colors
for ix, row in df_gff.loc[
    ( df_gff['chrom'] == 'Pf3D7_13_v3' )
    & ( df_gff['start'] <= max_pos )
    & ( df_gff['end'] >= min_pos )
    & ( df_gff['type'] == 'polypeptide' )
].iterrows():
    if row['start'] >= 2840727 and row['end'] <= 2841703:
        color = figure_colours['hrp_genes']
    else:
        color = figure_colours['other_genes']
    axs[8].plot((row['start'], row['end']), (bar_pos, bar_pos), linewidth=2, solid_capstyle='butt', color=color)

# Plot rRNA and pseudogene features
for ix, row in df_gff.loc[
    ( df_gff['chrom'] == 'Pf3D7_13_v3' )
    & ( df_gff['start'] <= max_pos )
    & ( df_gff['end'] >= min_pos )
    & ( df_gff['type'] == 'rRNA' )
].iterrows():
    axs[8].plot((row['start'], row['end']), (bar_pos, bar_pos), linewidth=6, solid_capstyle='butt', color=figure_colours['rrna_genes'])
# Annotate specific gene positions with labels
axs[8].text((2796119 + 2797144) / 2, text_pos, 'PF3D7_1370800', va='top', ha='center', size=8, rotation=90)
axs[8].text((2797507 + 2798103) / 2, text_pos, 'PF3D7_1370900', va='top', ha='center', size=8, rotation=90)
axs[8].text((2800004 + 2802154) / 2, text_pos, '18S rRNA', va='top', ha='center', size=8, rotation=90)
axs[8].text((2802527 + 2802686) / 2, text_pos, '5.8S rRNA', va='top', ha='center', size=8, rotation=90)
axs[8].text((2802945 + 2807159) / 2, text_pos, '28S rRNA', va='top', ha='center', size=8, rotation=90)
axs[8].text((2808200 + 2810256) / 2, text_pos, 'PF3D7_1371500', va='top', ha='center', size=8, rotation=90)
axs[8].text((2811706 + 2820270) / 2, text_pos, '${ebl1}$\n(pseudogene)', va='top', ha='center', size=8, rotation=90)
axs[8].text((2821078 + 2823292) / 2, text_pos, 'PF3D7_1371700', va='top', ha='center', size=8, rotation=90)
axs[8].text((2824302 + 2825852) / 2, text_pos, 'PF3D7_1371800', va='top', ha='center', size=8, rotation=90)
axs[8].text((2829856 + 2830669) / 2, text_pos, 'PF3D7_1371900', va='top', ha='center', size=8, rotation=90)
axs[8].text((2832952 + 2834322) / 2, text_pos, 'PF3D7_1372000', va='top', ha='center', size=8, rotation=90)
axs[8].text((2837053 + 2839058) / 2, text_pos, '${gexp04}$', va='top', ha='center', size=8, rotation=90)
axs[8].text((2840727 + 2841703) / 2, text_pos, '${hrp3}$', va='top', ha='center', size=8, rotation=90)
# Set y-axis ticks, customize x-axis labels, and add x-axis label and y-axis label
axs[8].set_yticks([])
axs[8].set_xticklabels(["", "2.80 Mbp", "2.81 Mbp", "2.82 Mbp", "2.83 Mbp", "2.84 Mbp"])
axs[8].set_xlabel('Genomic coordinates')
axs[8].set_ylabel('Genes', rotation=0, ha='right', va='center')

## Subplot 10: Ticks for Genomic Coordinates

# Set the x-axis limits
axs[9].set_xlim(min_pos, max_pos)
# Hide left and right spines to make the plot clean
axs[9].spines['left'].set_visible(False)
axs[9].spines['right'].set_visible(False)
# Define custom tick positions on the x-axis
axs[9].set_xticks([2800000, 2810000, 2820000, 2830000, 2840000])
# Set custom tick labels corresponding to genomic positions
axs[9].set_xticklabels(["2.80 Mbp", "2.81 Mbp", "2.82 Mbp", "2.83 Mbp", "2.84 Mbp"])
# Hide the y-axis ticks
axs[9].set_yticks([])
# Move the x-axis ticks to the top
axs[9].xaxis.tick_top()
# Adjust the direction and padding of x-axis ticks
axs[9].tick_params(axis="x", direction="in", pad=-20)
# Add a label to the x-axis indicating 'Genomic coordinates' at the center of the x-axis
axs[9].text((min_pos + max_pos) / 2, 0.5, 'Genomic coordinates', ha='center')

## Subplot 11: Legend

# Hide left, right, and bottom spines
axs[10].spines['left'].set_visible(False)
axs[10].spines['right'].set_visible(False)
axs[10].spines['bottom'].set_visible(False)
# Hide both x and y-axis ticks
axs[10].set_xticks([])
axs[10].set_yticks([])
# Create legend elements with color bars and labels
axs[10].plot((0, 0.1), (0.5, 0.5), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_5'])
axs[10].text(0.12, 0.5, 'Chr 5 sequence', va='center')
axs[10].plot((0, 0.1), (0.25, 0.25), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_11'])
axs[10].text(0.12, 0.25, 'Chr 11 sequence', va='center')
axs[10].plot((0.38, 0.48), (0.5, 0.5), linewidth=6, solid_capstyle='butt', color=figure_colours['similar_sequence'])
axs[10].text(0.50, 0.5, 'Highly similar sequence on chr 13 and chr 11', va='center')
axs[10].plot(np.arange(0.405, 0.456, 0.025), [0.25]*3, 'o', color=figure_colours['chr_8_13'])
axs[10].text(0.50, 0.25, 'Telomeric repeat sequence', va='center')
# Set the limits for this subplot
axs[10].set_xlim(0, 1)
axs[10].set_ylim(0, 1)

# Ensure the figure layout is tidy
fig.tight_layout()
<ipython-input-16-28c0075a88af>:278: UserWarning: FixedFormatter should only be used together with FixedLocator
  axs[8].set_xticklabels(["", "2.80 Mbp", "2.81 Mbp", "2.82 Mbp", "2.83 Mbp", "2.84 Mbp"])
../../_images/visualise_hrp2_hrp3_deletions_39_1.png

Figure Legend: HRP deletion breakpoints. We see five different breakpoints resulting in the deletion of hrp2. Four of these are within exon 2 of the gene whereas the fifth is found between hrp2 and the pseudogene PF3D7_0831750. For all five events we see evidence of telomeric healing from reads that contain part Pf3D7_08_v3 sequence and part telomeric repeat sequence (GGGTTCA/GGGTTTA). We see 16 different breakpoints resulting in the deletion of hrp3. For fourteen of these we see evidence of telomeric healing. Note that many of these events result in the deletion of other genes in addition to hrp3. For twenty samples from Cambodia and a single sample from Vietnam we see evidence of a recombination with chromosome 5 which results in a hybrid chromosome comprising mostly chromosome 13 sequence but a small inverted section of an internal portion of chromosome 5 containing the gene mdr1. We also see evidence of a recombination with chromosome 11 which results in a hybrid chromosome comprising mostly chromosome 13 sequence but also a section of the 3’ end of chromosome 11. This is the most common deletion type, being seen in 151 samples from 14 different countries. Because the recombination occurs between highly similar sequences of a set of three orthologous ribosomal RNA genes found on both chromosomes, it is not possible to identify the exact breakpoints.

Save Figure

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

fig.savefig('/content/drive/My Drive/HRP_Deletions_Figure.pdf')
fig.savefig('/content/drive/My Drive/HRP_Deletions_Figure.png', dpi=480) # increase the dpi for higher resolution