Explore hrp2 and hrp3 deletion breakpoints

Introduction

This notebook will create a figure of histidine-rich protein (hrp) 2 and hrp3 deletion breakpoints.

hrp2 and hrp3 are genes located in subtelomeric regions of the genome with very high levels of natural variation. Deletion in those genes can cause failure of rapid diagnostic tests and is therefore important to monitor.

Deletion is a genetic event in which a segment of DNA is entirely removed or missing. In this context, ‘breakpoints’ denote specific locations on the chromosome where such deletions take place.

This notebook should take approximately two minutes to run.

Setup

Install and import the malariagen Python package:

!pip install malariagen_data -q --no-warn-conflicts
import malariagen_data
  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 33.3 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 5.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 775.9/775.9 kB 33.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.9/25.9 MB 66.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 93.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.6/210.6 kB 14.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 80.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 64.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.0/8.0 MB 86.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 5.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.7/101.7 kB 7.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 93.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 16.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 92.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 60.1 MB/s eta 0:00:00
?25h  Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
  Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
  Building wheel for stringcase (setup.py) ... ?25l?25hdone
  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required python libraries that are installed at colab by default.

import numpy as np
import pandas as pd
import collections
import matplotlib.pyplot as plt
from google.colab import drive

Access Pf8 Data

We use the malariagen data package to load the release data.

release_data = malariagen_data.Pf8()
df_samples = release_data.sample_metadata()

hrp2 & hrp3 Deletions

We additionally require list of deletion and breakpoint locations within the hrp2 and hrp3 across 24,409 QC-pass samples. We can access this data along with other copy-number variation (CNV) calls from Sanger cloud storage.

# Read data directly from url
hrp_calls_fn = pd.read_csv('https://pf8-release.cog.sanger.ac.uk/Pf8_cnv_calls.tsv', sep='\t') 

# Print the shape and first rows
print(hrp_calls_fn.shape)
hrp_calls_fn.head()
(24409, 29)
Sample CRT_uncurated_coverage_only CRT_curated_coverage_only CRT_breakpoint CRT_faceaway_only CRT_final_amplification_call GCH1_uncurated_coverage_only GCH1_curated_coverage_only GCH1_breakpoint GCH1_faceaway_only ... PM2_PM3_faceaway_only PM2_PM3_final_amplification_call HRP2_uncurated_coverage_only HRP2_breakpoint HRP2_deletion_type HRP2_final_deletion_call HRP3_uncurated_coverage_only HRP3_breakpoint HRP3_deletion_type HRP3_final_deletion_call
0 FP0008-C 0 0 - -1 0 -1 -1 - 0 ... 0 0 0 - - 0 0 - - 0
1 FP0009-C 0 0 - 0 0 0 0 - 0 ... 0 0 0 - - 0 0 - - 0
2 FP0010-CW -1 -1 - 0 0 -1 -1 - 0 ... 0 0 -1 - - -1 -1 - - -1
3 FP0011-CW -1 -1 - -1 -1 -1 -1 - 0 ... 0 0 -1 - - -1 -1 - - -1
4 FP0012-CW -1 -1 - 0 0 0 0 - 0 ... 0 0 -1 - - -1 0 - - 0

5 rows × 29 columns

Now, let’s merge hrp_calls_fn with df_samples which contains various metadata of Pf8 samples.

# Merge df_samples with hrp_calls_fn
df_samples =  df_samples.merge(hrp_calls_fn, on ='Sample')

3D7 Reference Genome Annotation

We would like to know where breakpoints occur in the genome, such as whether they fall within protein-coding regions or mRNA regions.

To facilitate this, we will use the 3D7 reference genome annotation. This data is in a tabular format where each row specifies a genomic feature, such as an exon, mRNA, or protein-coding gene, along with its corresponding coordinates (start and end columns). For more information about the annotation data format, refer to this wiki page.

This dataset is available through the malariagen_data package.

df_gff= release_data.genome_features()

# print first rows
df_gff.head()
contig source type start end score strand phase ID Parent Name
0 Pf3D7_13_v3 VEuPathDB protein_coding_gene 624510 626292 NaN + NaN PF3D7_1314600 NaN LipL1
1 Pf3D7_13_v3 VEuPathDB mRNA 624510 626292 NaN + NaN PF3D7_1314600.1 PF3D7_1314600 NaN
2 Pf3D7_13_v3 VEuPathDB exon 624510 626292 NaN + NaN exon_PF3D7_1314600.1-E1 PF3D7_1314600.1 NaN
3 Pf3D7_13_v3 VEuPathDB CDS 624785 626011 NaN + 0.0 PF3D7_1314600.1-p1-CDS1 PF3D7_1314600.1 NaN
4 Pf3D7_13_v3 VEuPathDB five_prime_UTR 624510 624784 NaN + NaN utr_PF3D7_1314600.1_1 PF3D7_1314600.1 NaN

Figure Preparation

We need to find the start and end positions of chromosomes (‘contig’ column in the dataframe) to draw gene annotations in the figure that we are going to create.

# Find start and end positions for each chromosome by grouping chromosome coordinates
df_chroms = df_gff.groupby('contig').agg({'start': 'min', 'end': 'max'}).reset_index()
# Set 'chrom' as the index
df_chroms.set_index('contig', inplace=True)
df_chroms
start end
contig
Pf3D7_01_v3 29510 614893
Pf3D7_02_v3 25232 923648
Pf3D7_03_v3 36965 1038254
Pf3D7_04_v3 28706 1180226
Pf3D7_05_v3 20929 1342964
Pf3D7_06_v3 653 1382627
Pf3D7_07_v3 20307 1426234
Pf3D7_08_v3 21361 1443449
Pf3D7_09_v3 20080 1503336
Pf3D7_10_v3 28490 1649948
Pf3D7_11_v3 24160 2035886
Pf3D7_12_v3 16973 2248962
Pf3D7_13_v3 21364 2892340
Pf3D7_14_v3 1393 3291501
Pf3D7_API_v3 1 34225
Pf3D7_MIT_v3 3 5954

The next question is: How many samples have deletions in each country?

# This function returns samples with deletion for each country.

def breakpoint_agg(x):
    names = collections.OrderedDict()
    names['Countries'] = ''
    countries = []
    # Loop over each country
    # Count non-zero samples (with deletion)
    for country in x['Country'].unique():
        countries.append(f"{country} ({np.count_nonzero(x['Country'] == country)})")
    # Join together country name and number
    names['Countries'] = ', '.join(countries)
    names['Samples with deletion'] = len(x)
    return pd.Series(names)

We will apply breakpoint_agg function to df_samples separately for hrp2 and hrp3.

Additionally, we will find genomic coordinates at the edges for mapping in the figure.

# Group samples by Deletion type and HRP3_breakpoint
# Apply breakpoint_agg to count samples with deletion in each country
df_hrp2 = (
    df_samples[
        df_samples['QC pass']
        & ( df_samples['HRP2_final_deletion_call'] == 1 )
    ]
    .assign(Gene='$hrp2$')
    .rename(columns={'HRP2_deletion_type': 'Deletion type'})
    .groupby(['Deletion type', 'HRP2_breakpoint'])
    .apply(breakpoint_agg, include_groups=False)
    .reset_index()
)
# Seperate coordinate value from chromosome
df_hrp2['breakpoint'] = df_hrp2['HRP2_breakpoint'].apply(lambda x: int(x.split(':')[1]))

# Print min and max coordinates
print(f"HRP2 min breakpoint = {df_hrp2['breakpoint'].min()}")
print(f"HRP2 max breakpoint = {df_hrp2['breakpoint'].max()}")
df_hrp2
HRP2 min breakpoint = 1373732
HRP2 max breakpoint = 1374986
Deletion type HRP2_breakpoint Countries Samples with deletion breakpoint
0 Telomere healing Pf3D7_08_v3:1373732 Cambodia (1) 1 1373732
1 Telomere healing Pf3D7_08_v3:1374280 Sudan (1) 1 1374280
2 Telomere healing Pf3D7_08_v3:1374462 Indonesia (5) 5 1374462
3 Telomere healing Pf3D7_08_v3:1374932 Peru (2) 2 1374932
4 Telomere healing Pf3D7_08_v3:1374986 Peru (6) 6 1374986

A repeat of the same look-up for hrp3.

# Group samples by Deletion type and HRP3_breakpoint
# Apply breakpoint_agg to count samples with deletion in each country
df_hrp3 = (
    df_samples[
        df_samples['QC pass']
        & ( df_samples['HRP3_final_deletion_call'] == 1 )
    ]
    .rename(columns={'HRP3_deletion_type': 'Deletion type'})
    .groupby(['Deletion type', 'HRP3_breakpoint'])
    .apply(breakpoint_agg, include_groups=False)
    .reset_index()
)
# Seperate coordinate value from chromosome
df_hrp3['breakpoint'] = df_hrp3['HRP3_breakpoint'].apply(lambda x: x.split(':')[1])

# Print min and max coordinates
print(f"HRP3 min breakpoint = {df_hrp3['breakpoint'].min()}")
print(f"HRP3 max breakpoint = {df_hrp3['breakpoint'].max()}")
df_hrp3
HRP3 min breakpoint = 2800004-2807159
HRP3 max breakpoint = 2841120
Deletion type HRP3_breakpoint Countries Samples with deletion breakpoint
0 Chromosome 11 recombination Pf3D7_13_v3:2800004-2807159 Thailand (1), Ghana (3), Indonesia (38), Peru ... 162 2800004-2807159
1 Chromosome 5 recombination Pf3D7_13_v3:2835587-2835612 Cambodia (20), Vietnam (1) 21 2835587-2835612
2 Telomere healing Pf3D7_13_v3:2811525 India (1) 1 2811525
3 Telomere healing Pf3D7_13_v3:2812344 Sudan (1) 1 2812344
4 Telomere healing Pf3D7_13_v3:2815249 Tanzania (1) 1 2815249
5 Telomere healing Pf3D7_13_v3:2822480 Ghana (1) 1 2822480
6 Telomere healing Pf3D7_13_v3:2823645 Kenya (1) 1 2823645
7 Telomere healing Pf3D7_13_v3:2830952 Cambodia (7) 7 2830952
8 Telomere healing Pf3D7_13_v3:2834604 Vietnam (1) 1 2834604
9 Telomere healing Pf3D7_13_v3:2835532 Thailand (1) 1 2835532
10 Telomere healing Pf3D7_13_v3:2835899 Ghana (1) 1 2835899
11 Telomere healing Pf3D7_13_v3:2837144 Vietnam (8) 8 2837144
12 Telomere healing Pf3D7_13_v3:2837392 Cambodia (2), Laos (1) 3 2837392
13 Telomere healing Pf3D7_13_v3:2838654 Indonesia (2) 2 2838654
14 Telomere healing Pf3D7_13_v3:2840859 Gambia (2) 2 2840859
15 Telomere healing Pf3D7_13_v3:2841024 Thailand (1) 1 2841024
16 Telomere healing Pf3D7_13_v3:2841120 Indonesia (1) 1 2841120

It seems that Chromosome 11 recombination breakpoints are observed in multiple countries which motivates us to print the full list of countries.

# Locate the first row by using the index
df_hrp3.iloc[0]['Countries']
'Thailand (1), Ghana (3), Indonesia (38), Peru (15), Bangladesh (1), Vietnam (2), Colombia (50), Ethiopia (9), Senegal (6), Laos (19), Cambodia (3), Sudan (6), Mali (1), Gambia (7), Kenya (1)'

Given the fact that many of these events result in the deletion of other genes in addition to hrp2 and hrp3.

We could have a look at which genes are present within the range of breakpoints before including them in the plot.

# Genes in hrp2 breakpoints
df_gff.loc[
    ( df_gff['contig'] == 'Pf3D7_08_v3' )
    & ( df_gff['start'] <= 1375500 )
    & ( df_gff['end'] >= 1364000 )
]
contig source type start end score strand phase ID Parent Name
17090 Pf3D7_08_v3 VEuPathDB protein_coding_gene 1373212 1376988 NaN - NaN PF3D7_0831800 NaN HRP2
17091 Pf3D7_08_v3 VEuPathDB mRNA 1373212 1376988 NaN - NaN PF3D7_0831800.1 PF3D7_0831800 NaN
17092 Pf3D7_08_v3 VEuPathDB exon 1373212 1375084 NaN - NaN exon_PF3D7_0831800.1-E2 PF3D7_0831800.1 NaN
17093 Pf3D7_08_v3 VEuPathDB exon 1375231 1376988 NaN - NaN exon_PF3D7_0831800.1-E1 PF3D7_0831800.1 NaN
17094 Pf3D7_08_v3 VEuPathDB CDS 1374236 1375084 NaN - 0.0 PF3D7_0831800.1-p1-CDS2 PF3D7_0831800.1 NaN
17095 Pf3D7_08_v3 VEuPathDB CDS 1375231 1375299 NaN - 0.0 PF3D7_0831800.1-p1-CDS1 PF3D7_0831800.1 NaN
17096 Pf3D7_08_v3 VEuPathDB three_prime_UTR 1373212 1374235 NaN - NaN utr_PF3D7_0831800.1_1 PF3D7_0831800.1 NaN
17097 Pf3D7_08_v3 VEuPathDB five_prime_UTR 1375300 1376988 NaN - NaN utr_PF3D7_0831800.1_2 PF3D7_0831800.1 NaN
27865 Pf3D7_08_v3 VEuPathDB protein_coding_gene 1364640 1369862 NaN - NaN PF3D7_0831700 NaN HSP70x
27866 Pf3D7_08_v3 VEuPathDB mRNA 1364640 1369862 NaN - NaN PF3D7_0831700.1 PF3D7_0831700 NaN
27867 Pf3D7_08_v3 VEuPathDB exon 1364640 1367640 NaN - NaN exon_PF3D7_0831700.1-E2 PF3D7_0831700.1 NaN
27868 Pf3D7_08_v3 VEuPathDB exon 1368649 1369862 NaN - NaN exon_PF3D7_0831700.1-E1 PF3D7_0831700.1 NaN
27869 Pf3D7_08_v3 VEuPathDB CDS 1365467 1367506 NaN - 0.0 PF3D7_0831700.1-p1-CDS1 PF3D7_0831700.1 NaN
27870 Pf3D7_08_v3 VEuPathDB three_prime_UTR 1364640 1365466 NaN - NaN utr_PF3D7_0831700.1_1 PF3D7_0831700.1 NaN
27871 Pf3D7_08_v3 VEuPathDB five_prime_UTR 1367507 1367640 NaN - NaN utr_PF3D7_0831700.1_2 PF3D7_0831700.1 NaN
27872 Pf3D7_08_v3 VEuPathDB five_prime_UTR 1368649 1369862 NaN - NaN utr_PF3D7_0831700.1_3 PF3D7_0831700.1 NaN
28984 Pf3D7_08_v3 VEuPathDB pseudogene 1371847 1372720 NaN + NaN PF3D7_0831750 NaN NaN
28985 Pf3D7_08_v3 VEuPathDB pseudogenic_transcript 1371847 1372720 NaN + NaN PF3D7_0831750.1 PF3D7_0831750 NaN
28986 Pf3D7_08_v3 VEuPathDB exon 1371847 1372100 NaN + NaN exon_PF3D7_0831750.1-E1 PF3D7_0831750.1 NaN
28987 Pf3D7_08_v3 VEuPathDB exon 1372103 1372223 NaN + NaN exon_PF3D7_0831750.1-E2 PF3D7_0831750.1 NaN
28988 Pf3D7_08_v3 VEuPathDB exon 1372225 1372291 NaN + NaN exon_PF3D7_0831750.1-E3 PF3D7_0831750.1 NaN
28989 Pf3D7_08_v3 VEuPathDB exon 1372294 1372577 NaN + NaN exon_PF3D7_0831750.1-E4 PF3D7_0831750.1 NaN
28990 Pf3D7_08_v3 VEuPathDB exon 1372579 1372667 NaN + NaN exon_PF3D7_0831750.1-E5 PF3D7_0831750.1 NaN
28991 Pf3D7_08_v3 VEuPathDB exon 1372669 1372720 NaN + NaN exon_PF3D7_0831750.1-E6 PF3D7_0831750.1 NaN
28992 Pf3D7_08_v3 VEuPathDB CDS 1371847 1372100 NaN + 0.0 PF3D7_0831750.1-p1-CDS1 PF3D7_0831750.1 NaN
28993 Pf3D7_08_v3 VEuPathDB CDS 1372103 1372223 NaN + 1.0 PF3D7_0831750.1-p1-CDS2 PF3D7_0831750.1 NaN
28994 Pf3D7_08_v3 VEuPathDB CDS 1372225 1372291 NaN + 0.0 PF3D7_0831750.1-p1-CDS3 PF3D7_0831750.1 NaN
28995 Pf3D7_08_v3 VEuPathDB CDS 1372294 1372403 NaN + 2.0 PF3D7_0831750.1-p1-CDS4 PF3D7_0831750.1 NaN
28996 Pf3D7_08_v3 VEuPathDB three_prime_UTR 1372404 1372577 NaN + NaN utr_PF3D7_0831750.1_1 PF3D7_0831750.1 NaN
28997 Pf3D7_08_v3 VEuPathDB three_prime_UTR 1372579 1372667 NaN + NaN utr_PF3D7_0831750.1_2 PF3D7_0831750.1 NaN
28998 Pf3D7_08_v3 VEuPathDB three_prime_UTR 1372669 1372720 NaN + NaN utr_PF3D7_0831750.1_3 PF3D7_0831750.1 NaN
# Genes in hrp3 breakpoints
pd.options.display.max_rows = 100
df_gff.loc[
    ( df_gff['contig'] == 'Pf3D7_13_v3' )
    & ( df_gff['start'] <= 2845000 )
    & ( df_gff['end'] >= 2795000 )
]
contig source type start end score strand phase ID Parent Name
17986 Pf3D7_13_v3 VEuPathDB pseudogene 2811706 2820270 NaN + NaN PF3D7_1371600 NaN EBL1
17987 Pf3D7_13_v3 VEuPathDB pseudogenic_transcript 2811706 2820270 NaN + NaN PF3D7_1371600.1 PF3D7_1371600 NaN
17988 Pf3D7_13_v3 VEuPathDB exon 2811706 2812263 NaN + NaN exon_PF3D7_1371600.1-E1 PF3D7_1371600.1 NaN
17989 Pf3D7_13_v3 VEuPathDB exon 2812266 2819628 NaN + NaN exon_PF3D7_1371600.1-E2 PF3D7_1371600.1 NaN
17990 Pf3D7_13_v3 VEuPathDB exon 2819764 2819851 NaN + NaN exon_PF3D7_1371600.1-E3 PF3D7_1371600.1 NaN
17991 Pf3D7_13_v3 VEuPathDB exon 2820015 2820088 NaN + NaN exon_PF3D7_1371600.1-E4 PF3D7_1371600.1 NaN
17992 Pf3D7_13_v3 VEuPathDB exon 2820227 2820270 NaN + NaN exon_PF3D7_1371600.1-E5 PF3D7_1371600.1 NaN
17993 Pf3D7_13_v3 VEuPathDB CDS 2811706 2812263 NaN + 0.0 PF3D7_1371600.1-p1-CDS1 PF3D7_1371600.1 NaN
17994 Pf3D7_13_v3 VEuPathDB CDS 2812266 2815526 NaN + 0.0 PF3D7_1371600.1-p1-CDS2 PF3D7_1371600.1 NaN
17995 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2815527 2819628 NaN + NaN utr_PF3D7_1371600.1_1 PF3D7_1371600.1 NaN
17996 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2819764 2819851 NaN + NaN utr_PF3D7_1371600.1_2 PF3D7_1371600.1 NaN
17997 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2820015 2820088 NaN + NaN utr_PF3D7_1371600.1_3 PF3D7_1371600.1 NaN
17998 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2820227 2820270 NaN + NaN utr_PF3D7_1371600.1_4 PF3D7_1371600.1 NaN
28226 Pf3D7_13_v3 VEuPathDB protein_coding_gene 2835756 2839580 NaN + NaN PF3D7_1372100 NaN GEXP04
28227 Pf3D7_13_v3 VEuPathDB mRNA 2835756 2839580 NaN + NaN PF3D7_1372100.1 PF3D7_1372100 NaN
28228 Pf3D7_13_v3 VEuPathDB exon 2835756 2837136 NaN + NaN exon_PF3D7_1372100.1-E1 PF3D7_1372100.1 NaN
28229 Pf3D7_13_v3 VEuPathDB exon 2837313 2839580 NaN + NaN exon_PF3D7_1372100.1-E2 PF3D7_1372100.1 NaN
28230 Pf3D7_13_v3 VEuPathDB CDS 2837053 2837136 NaN + 0.0 PF3D7_1372100.1-p1-CDS1 PF3D7_1372100.1 NaN
28231 Pf3D7_13_v3 VEuPathDB CDS 2837313 2839058 NaN + 0.0 PF3D7_1372100.1-p1-CDS2 PF3D7_1372100.1 NaN
28232 Pf3D7_13_v3 VEuPathDB five_prime_UTR 2835756 2837052 NaN + NaN utr_PF3D7_1372100.1_1 PF3D7_1372100.1 NaN
28233 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2839059 2839580 NaN + NaN utr_PF3D7_1372100.1_2 PF3D7_1372100.1 NaN
30629 Pf3D7_13_v3 VEuPathDB ncRNA_gene 2796119 2797144 NaN + NaN PF3D7_1370800 NaN NaN
30630 Pf3D7_13_v3 VEuPathDB ncRNA 2796119 2797144 NaN + NaN PF3D7_1370800.1 PF3D7_1370800 NaN
30631 Pf3D7_13_v3 VEuPathDB exon 2796119 2797144 NaN + NaN exon_PF3D7_1370800.1-E1 PF3D7_1370800.1 NaN
31628 Pf3D7_13_v3 VEuPathDB ncRNA_gene 2797507 2798103 NaN + NaN PF3D7_1370900 NaN NaN
31629 Pf3D7_13_v3 VEuPathDB ncRNA 2797507 2798103 NaN + NaN PF3D7_1370900.1 PF3D7_1370900 NaN
31630 Pf3D7_13_v3 VEuPathDB exon 2797507 2798103 NaN + NaN exon_PF3D7_1370900.1-E1 PF3D7_1370900.1 NaN
34729 Pf3D7_13_v3 VEuPathDB protein_coding_gene 2808200 2810256 NaN - NaN PF3D7_1371500 NaN NaN
34730 Pf3D7_13_v3 VEuPathDB mRNA 2808200 2810256 NaN - NaN PF3D7_1371500.1 PF3D7_1371500 NaN
34731 Pf3D7_13_v3 VEuPathDB exon 2808200 2809000 NaN - NaN exon_PF3D7_1371500.1-E2 PF3D7_1371500.1 NaN
34732 Pf3D7_13_v3 VEuPathDB exon 2809154 2810256 NaN - NaN exon_PF3D7_1371500.1-E1 PF3D7_1371500.1 NaN
34733 Pf3D7_13_v3 VEuPathDB CDS 2808563 2809000 NaN - 0.0 PF3D7_1371500.1-p1-CDS2 PF3D7_1371500.1 NaN
34734 Pf3D7_13_v3 VEuPathDB CDS 2809154 2809222 NaN - 0.0 PF3D7_1371500.1-p1-CDS1 PF3D7_1371500.1 NaN
34735 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2808200 2808562 NaN - NaN utr_PF3D7_1371500.1_1 PF3D7_1371500.1 NaN
34736 Pf3D7_13_v3 VEuPathDB five_prime_UTR 2809223 2810256 NaN - NaN utr_PF3D7_1371500.1_2 PF3D7_1371500.1 NaN
35981 Pf3D7_13_v3 VEuPathDB protein_coding_gene 2840236 2842840 NaN - NaN PF3D7_1372200 NaN HRPIII
35982 Pf3D7_13_v3 VEuPathDB mRNA 2840236 2842840 NaN - NaN PF3D7_1372200.1 PF3D7_1372200 NaN
35983 Pf3D7_13_v3 VEuPathDB exon 2840236 2841485 NaN - NaN exon_PF3D7_1372200.1-E3 PF3D7_1372200.1 NaN
35984 Pf3D7_13_v3 VEuPathDB exon 2841635 2841716 NaN - NaN exon_PF3D7_1372200.1-E2 PF3D7_1372200.1 NaN
35985 Pf3D7_13_v3 VEuPathDB exon 2842024 2842840 NaN - NaN exon_PF3D7_1372200.1-E1 PF3D7_1372200.1 NaN
35986 Pf3D7_13_v3 VEuPathDB CDS 2840727 2841485 NaN - 0.0 PF3D7_1372200.1-p1-CDS2 PF3D7_1372200.1 NaN
35987 Pf3D7_13_v3 VEuPathDB CDS 2841635 2841703 NaN - 0.0 PF3D7_1372200.1-p1-CDS1 PF3D7_1372200.1 NaN
35988 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2840236 2840726 NaN - NaN utr_PF3D7_1372200.1_1 PF3D7_1372200.1 NaN
35989 Pf3D7_13_v3 VEuPathDB five_prime_UTR 2841704 2841716 NaN - NaN utr_PF3D7_1372200.1_2 PF3D7_1372200.1 NaN
35990 Pf3D7_13_v3 VEuPathDB five_prime_UTR 2842024 2842840 NaN - NaN utr_PF3D7_1372200.1_3 PF3D7_1372200.1 NaN
36301 Pf3D7_13_v3 VEuPathDB ncRNA_gene 2802945 2807159 NaN + NaN PF3D7_1371300 NaN NaN
36302 Pf3D7_13_v3 VEuPathDB rRNA 2802945 2807159 NaN + NaN PF3D7_1371300.1 PF3D7_1371300 NaN
36303 Pf3D7_13_v3 VEuPathDB exon 2802945 2807159 NaN + NaN exon_PF3D7_1371300.1-E1 PF3D7_1371300.1 NaN
37550 Pf3D7_13_v3 VEuPathDB protein_coding_gene 2823251 2825967 NaN - NaN PF3D7_1371800 NaN NaN
37551 Pf3D7_13_v3 VEuPathDB mRNA 2823251 2825967 NaN - NaN PF3D7_1371800.1 PF3D7_1371800 NaN
37552 Pf3D7_13_v3 VEuPathDB exon 2823251 2825552 NaN - NaN exon_PF3D7_1371800.1-E2 PF3D7_1371800.1 NaN
37553 Pf3D7_13_v3 VEuPathDB exon 2825781 2825967 NaN - NaN exon_PF3D7_1371800.1-E1 PF3D7_1371800.1 NaN
37554 Pf3D7_13_v3 VEuPathDB CDS 2824302 2825552 NaN - 0.0 PF3D7_1371800.1-p1-CDS2 PF3D7_1371800.1 NaN
37555 Pf3D7_13_v3 VEuPathDB CDS 2825781 2825852 NaN - 0.0 PF3D7_1371800.1-p1-CDS1 PF3D7_1371800.1 NaN
37556 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2823251 2824301 NaN - NaN utr_PF3D7_1371800.1_1 PF3D7_1371800.1 NaN
37557 Pf3D7_13_v3 VEuPathDB five_prime_UTR 2825853 2825967 NaN - NaN utr_PF3D7_1371800.1_2 PF3D7_1371800.1 NaN
37701 Pf3D7_13_v3 VEuPathDB protein_coding_gene 2832623 2835439 NaN + NaN PF3D7_1372000 NaN NaN
37702 Pf3D7_13_v3 VEuPathDB mRNA 2832623 2835439 NaN + NaN PF3D7_1372000.1 PF3D7_1372000 NaN
37703 Pf3D7_13_v3 VEuPathDB exon 2832623 2833086 NaN + NaN exon_PF3D7_1372000.1-E1 PF3D7_1372000.1 NaN
37704 Pf3D7_13_v3 VEuPathDB exon 2833204 2835439 NaN + NaN exon_PF3D7_1372000.1-E2 PF3D7_1372000.1 NaN
37705 Pf3D7_13_v3 VEuPathDB CDS 2832952 2833086 NaN + 0.0 PF3D7_1372000.1-p1-CDS1 PF3D7_1372000.1 NaN
37706 Pf3D7_13_v3 VEuPathDB CDS 2833204 2834322 NaN + 0.0 PF3D7_1372000.1-p1-CDS2 PF3D7_1372000.1 NaN
37707 Pf3D7_13_v3 VEuPathDB five_prime_UTR 2832623 2832951 NaN + NaN utr_PF3D7_1372000.1_1 PF3D7_1372000.1 NaN
37708 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2834323 2835439 NaN + NaN utr_PF3D7_1372000.1_2 PF3D7_1372000.1 NaN
41004 Pf3D7_13_v3 VEuPathDB protein_coding_gene 2821078 2824446 NaN + NaN PF3D7_1371700 NaN FIKK13
41005 Pf3D7_13_v3 VEuPathDB mRNA 2821078 2824446 NaN + NaN PF3D7_1371700.1 PF3D7_1371700 NaN
41006 Pf3D7_13_v3 VEuPathDB exon 2821078 2821173 NaN + NaN exon_PF3D7_1371700.1-E1 PF3D7_1371700.1 NaN
41007 Pf3D7_13_v3 VEuPathDB exon 2821278 2822786 NaN + NaN exon_PF3D7_1371700.1-E2 PF3D7_1371700.1 NaN
41008 Pf3D7_13_v3 VEuPathDB exon 2823212 2824446 NaN + NaN exon_PF3D7_1371700.1-E3 PF3D7_1371700.1 NaN
41009 Pf3D7_13_v3 VEuPathDB CDS 2821078 2821173 NaN + 0.0 PF3D7_1371700.1-p1-CDS1 PF3D7_1371700.1 NaN
41010 Pf3D7_13_v3 VEuPathDB CDS 2821278 2822786 NaN + 0.0 PF3D7_1371700.1-p1-CDS2 PF3D7_1371700.1 NaN
41011 Pf3D7_13_v3 VEuPathDB CDS 2823212 2823292 NaN + 0.0 PF3D7_1371700.1-p1-CDS3 PF3D7_1371700.1 NaN
41012 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2823293 2824446 NaN + NaN utr_PF3D7_1371700.1_1 PF3D7_1371700.1 NaN
41067 Pf3D7_13_v3 VEuPathDB ncRNA_gene 2800004 2802154 NaN + NaN PF3D7_1371000 NaN NaN
41068 Pf3D7_13_v3 VEuPathDB rRNA 2800004 2802154 NaN + NaN PF3D7_1371000.1 PF3D7_1371000 NaN
41069 Pf3D7_13_v3 VEuPathDB exon 2800004 2802154 NaN + NaN exon_PF3D7_1371000.1-E1 PF3D7_1371000.1 NaN
42330 Pf3D7_13_v3 VEuPathDB protein_coding_gene 2843157 2847557 NaN + NaN PF3D7_1372300 NaN NaN
42331 Pf3D7_13_v3 VEuPathDB mRNA 2843157 2847557 NaN + NaN PF3D7_1372300.1 PF3D7_1372300 NaN
42332 Pf3D7_13_v3 VEuPathDB exon 2843157 2845850 NaN + NaN exon_PF3D7_1372300.1-E1 PF3D7_1372300.1 NaN
42336 Pf3D7_13_v3 VEuPathDB five_prime_UTR 2843157 2845766 NaN + NaN utr_PF3D7_1372300.1_1 PF3D7_1372300.1 NaN
44509 Pf3D7_13_v3 VEuPathDB protein_coding_gene 2829530 2830830 NaN + NaN PF3D7_1371900 NaN NaN
44510 Pf3D7_13_v3 VEuPathDB mRNA 2829530 2830830 NaN + NaN PF3D7_1371900.1 PF3D7_1371900 NaN
44511 Pf3D7_13_v3 VEuPathDB exon 2829530 2829927 NaN + NaN exon_PF3D7_1371900.1-E1 PF3D7_1371900.1 NaN
44512 Pf3D7_13_v3 VEuPathDB exon 2830109 2830830 NaN + NaN exon_PF3D7_1371900.1-E2 PF3D7_1371900.1 NaN
44513 Pf3D7_13_v3 VEuPathDB CDS 2829856 2829927 NaN + 0.0 PF3D7_1371900.1-p1-CDS1 PF3D7_1371900.1 NaN
44514 Pf3D7_13_v3 VEuPathDB CDS 2830109 2830669 NaN + 0.0 PF3D7_1371900.1-p1-CDS2 PF3D7_1371900.1 NaN
44515 Pf3D7_13_v3 VEuPathDB five_prime_UTR 2829530 2829855 NaN + NaN utr_PF3D7_1371900.1_1 PF3D7_1371900.1 NaN
44516 Pf3D7_13_v3 VEuPathDB three_prime_UTR 2830670 2830830 NaN + NaN utr_PF3D7_1371900.1_2 PF3D7_1371900.1 NaN
46143 Pf3D7_13_v3 VEuPathDB ncRNA_gene 2802527 2802686 NaN + NaN PF3D7_1371200 NaN NaN
46144 Pf3D7_13_v3 VEuPathDB rRNA 2802527 2802686 NaN + NaN PF3D7_1371200.1 PF3D7_1371200 NaN
46145 Pf3D7_13_v3 VEuPathDB exon 2802527 2802686 NaN + NaN exon_PF3D7_1371200.1-E1 PF3D7_1371200.1 NaN

Create Figure

This intricate figure serves to map deletion breakpoints in hrp2 and hrp3 across various countries. The x-axis of the figure displays the deletion breakpoints, while the y-axis shows the countries along with the number of breakpoints they exhibit.

The figure comprises two sections: hrp2 (positioned at the top, subplots 1-5) and hrp3 (located at the bottom, subplots 6-12), consisting of a total of 12 subplots.

We start by defining distinct colour codes for each genomic region that we will annotate.

# Create a dictionary with distinct colour codes
figure_colours = collections.OrderedDict()
figure_colours['chr_8_13'] = '#377eb8'
figure_colours['chr_11'] = '#984ea3'
figure_colours['chr_5'] = '#ff7f00'
figure_colours['similar_sequence'] = '#ffff33'
figure_colours['hrp_genes'] = '#e41a1c'
figure_colours['rrna_genes'] = '#4daf4a'
figure_colours['other_genes'] = 'black'
figure_colours['pseudogenes'] = 'grey'
# Full figure

# The figure will consist of 12 subplots, each with varying sizes.
fig, axs = plt.subplots(12, 1, figsize=(13, 10), gridspec_kw={'height_ratios': [1, 7, 20, 6, 12, 1, 7, 10, 60, 24, 12, 12], 'hspace': 0})

### HRP2 (upper panel)

# Set the minimum and maximum positions for hrp2

min_pos = 1371500
max_pos = 1375500

## Subplot 1: Chromosome Title and Region

# Set the title
axs[0].set_title('Pf3D7_08_v3')
# Don't display other properties here
axs[0].set_xticks([])
axs[0].set_xlabel(None)
axs[0].set_yticks([])
axs[0].set_xlim(int(df_chroms.loc['Pf3D7_08_v3', 'start']), int(df_chroms.loc['Pf3D7_08_v3', 'end']))
axs[0].set_facecolor(figure_colours['chr_8_13'])
axs[0].axvspan(min_pos, max_pos, color='white')

## Subplot 2: Chromosome Boundaries

axs[1].set_xticks([])
axs[1].set_yticks([])
axs[1].set_xlim(int(df_chroms.loc['Pf3D7_08_v3', 'start']), int(df_chroms.loc['Pf3D7_08_v3', 'end']))
axs[1].set_ylim(0, 1)
axs[1].plot([int(df_chroms.loc['Pf3D7_08_v3', 'start']), min_pos], [0, 1], '-', color='black')
axs[1].plot([int(df_chroms.loc['Pf3D7_08_v3', 'end']), max_pos], [0, 1], '-', color='black')
axs[1].spines['left'].set_visible(False)
axs[1].spines['right'].set_visible(False)

# Subplot 3: Deletion Breakpoints by Country

# Create lists for y-axis labels and positions
ylabels = []
yposes = []
for ypos, row in df_hrp2.sort_values('breakpoint', ascending=False).iterrows():
    yposes.append(ypos)
    axs[2].plot((min_pos, row['breakpoint']), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_8_13'])
    axs[2].text(row['breakpoint'] + 50, ypos , '$(GGGTT[T/C]A)_n$', fontsize=6, ha='left', bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=0.3'))
    ylabels.append(f"{row['Countries']}")

axs[2].set_yticks(yposes)
axs[2].set_yticklabels(ylabels)
axs[2].set_xlim(min_pos, max_pos)
axs[2].set_ylim(min(yposes)-0.5, max(yposes)+0.5)
axs[2].set_xticks([])

## Subplot 4: Genes on the x-axis

axs[3].set_xlim(min_pos, max_pos)
axs[3].set_ylim(0, 1)

for ix, row in df_gff.loc[
    ( df_gff['contig'] == 'Pf3D7_08_v3' )
    & ( df_gff['start'] <= max_pos )
    & ( df_gff['end'] >= min_pos )
    & ( df_gff['type'] == 'CDS' )
].iterrows():
    if row['start'] >= 1373212 and row['end'] <= 1376988:
        color=figure_colours['hrp_genes']
    else:
        color=figure_colours['other_genes']
    axs[3].plot((row['start'], row['end']), (0.75, 0.75), linewidth=6, solid_capstyle='butt', color=color)

for ix, row in df_gff.loc[
    ( df_gff['contig'] == 'Pf3D7_08_v3' )
    & ( df_gff['start'] <= max_pos )
    & ( df_gff['end'] >= min_pos )
    & ( df_gff['type'] == 'pseudogene' )
].iterrows():
    axs[3].plot((row['start'], row['end']), (0.75, 0.75), linewidth=6, solid_capstyle='butt', color=figure_colours['pseudogenes'])

# Add gene names for hrp2 panel
axs[3].text(( 1375299 + 1374236 ) / 2, 0.25, '${hrp2}$', va='center', ha='center', size=8)
axs[3].text(( 1371847 + 1372720 ) / 2, 0.25, 'PF3D7_0831750 (pseudogene)', va='center', ha='center', size=8)
axs[3].set_yticks([])
axs[3].set_xticks([])
axs[3].set_ylabel('Genes', rotation=0, ha='right', va='center')
# Add a connection line for the gene
axs[3].plot((1_375_084, 1_375_231), (0.75, 0.75), c = figure_colours['hrp_genes'])

## Subplot 5: Ticks for Genomic Coordinates
# Set the x-axis limits to cover the specified genomic region between 'min_pos' and 'max_pos'
axs[4].set_xlim(min_pos, max_pos)
# Hide the left and right spines to create a cleaner appearance
axs[4].spines['left'].set_visible(False)
axs[4].spines['right'].set_visible(False)
# Configure the x-axis ticks at specific positions
axs[4].set_xticks([1372000, 1373000, 1374000, 1375000])
# Label the x-axis ticks with corresponding values
axs[4].set_xticklabels(["1.372 Mbp", "1.373 Mbp", "1.374 Mbp", "1.375 Mbp"])
# Remove y-axis ticks to maintain a clean look
axs[4].set_yticks([])
# Position x-axis ticks at the top
axs[4].xaxis.tick_top()
# Adjust the direction of x-axis ticks and add padding
axs[4].tick_params(axis="x", direction="in", pad=-20)

### HRP3

# Set the minimum and maximum positions for hrp3
min_pos = 2797000
max_pos = 2845000

## Subplot 6: Chromosome Name as Title and
# Set the title
axs[5].set_title('Pf3D7_13_v3')
axs[5].set_xticks([])
axs[5].set_xlabel(None)
axs[5].set_yticks([])
axs[5].set_xlim(int(df_chroms.loc['Pf3D7_13_v3', 'start']), int(df_chroms.loc['Pf3D7_13_v3', 'end']))
axs[5].set_facecolor(figure_colours['chr_8_13'])
axs[5].axvspan(min_pos, max_pos, color='white')

## Subplot 7: Chromosome Boundaries
axs[6].set_xticks([])
axs[6].set_yticks([])
# Set the x-axis limits to cover the specified genomic region between the start and end positions of 'Pf3D7_13_v3'
axs[6].set_xlim(int(df_chroms.loc['Pf3D7_13_v3', 'start']), int(df_chroms.loc['Pf3D7_13_v3', 'end']))
axs[6].set_ylim(0, 1) # y-axis will only have one annotation
# Create lines to mark the start and end positions of the chromosome region with black color
axs[6].plot([int(df_chroms.loc['Pf3D7_13_v3', 'start']), min_pos], [0, 1], '-', color='black')
axs[6].plot([int(df_chroms.loc['Pf3D7_13_v3', 'end']), max_pos], [0, 1], '-', color='black')
# Hide the left and right spines to create a cleaner appearance
axs[6].spines['left'].set_visible(False)
axs[6].spines['right'].set_visible(False)


## Subplot 8 and 9: Deletion Breakpoints by Country

# Initialize lists for labels and positions
ylabels_7 = []
yposes_7 = []
ylabels_8 = []
yposes_8 = []

# How many countries have Chromosome 11 recombination breakpoints?
multicountry_label = f'{len(df_hrp3.iloc[0]["Countries"].split(","))} countries ({df_hrp3.iloc[0]["Samples with deletion"]})'

# Iterate through df_hrp3 sorted by breakpoints
for ypos, row in df_hrp3.sort_values('breakpoint', ascending=True).iterrows():

    # Check the deletion type and apply different plotting styles accordingly
    # Telomere healing
    if row['Deletion type'] == 'Telomere healing':
        breakpoint = int(row['breakpoint'])
        axs[8].plot((min_pos, breakpoint), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_8_13'])
        axs[8].text(breakpoint + 600, ypos , '$(GGGTT[T/C]A)_n$', fontsize=6, ha='left', bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=0.3'))
        ylabels_8.append(f"{row['Countries']}")
        yposes_8.append(ypos)

    # Chromosome 5 recombination
    if row['Deletion type'] == 'Chromosome 5 recombination':
        breakpoint_start, breakpoint_end = [int(x) for x in row['breakpoint'].split('-')]
        axs[7].plot((min_pos, breakpoint_start), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_8_13'])
        axs[7].plot((breakpoint_end, max_pos), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_5'])
        ylabels_7.append(f"{row['Countries']}")
        yposes_7.append(ypos)

    # Chromosome 11 recombination
    if row['Deletion type'] == 'Chromosome 11 recombination':
        breakpoint_start, breakpoint_end = [int(x) for x in row['breakpoint'].split('-')]
        axs[7].plot((min_pos, breakpoint_start), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_8_13'])
        axs[7].plot((breakpoint_end, max_pos), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_11'])
        axs[7].plot((breakpoint_start, breakpoint_end), (ypos, ypos), linewidth=6, solid_capstyle='butt', color=figure_colours['similar_sequence'])
        yposes_7.append(ypos)
        ylabels_7.append(multicountry_label)

# Add a line to seperate Chrom 5 and 11, this may need to be adjusted manually in the future
axs[7].axhline(y=max(yposes_7)-0.5, color='black', linestyle='-', linewidth=1)

# Set the ticks and labels for Subplot 8
axs[7].set_yticks(yposes_7)
axs[7].set_yticklabels(ylabels_7)
axs[7].set_ylim(min(yposes_7)-0.5, max(yposes_7)+0.5)
axs[7].set_xlim(min_pos, max_pos)
axs[7].set_xticks([])

# Set the ticks and labels for Subplot 9
axs[8].set_yticks(yposes_8)  # Ensure yposes_8 has the same length as ylabels_8
axs[8].set_yticklabels(ylabels_8)  # Ensure the number of labels matches the number of ticks
axs[8].set_xlim(min_pos, max_pos)
axs[8].set_ylim(min(yposes_8)-0.5, max(yposes_8)+0.5)  # Use yposes_8 for ylim
axs[8].set_xticks([])

## Subplot 10: Gene annotations on the x-axis

bar_pos = 0.9
text_pos = 0.8

# Set the x-axis and y-axis limits
axs[9].set_xlim(min_pos, max_pos)
axs[9].set_ylim(0, 1)

# Group CDS regions and draw connection lines by using proximity of breakpoints.
# Previously we were using polypeptide type to draw connection lines, but this does not exist in the new GFF
prev_end = None
connections = []
for ix, row in df_gff.loc[
    ( df_gff['contig'] == 'Pf3D7_13_v3' )
    & ( df_gff['start'] <= max_pos )
    & ( df_gff['end'] >= min_pos )
    & ( df_gff['type'] == 'CDS' )
].iterrows():
    if row['start'] >= 2840727 and row['end'] <= 2841703:
        color=figure_colours['hrp_genes']
    else:
        color=figure_colours['other_genes']
    #print(row.id)
    axs[9].plot((row['start'], row['end']), (bar_pos, bar_pos), linewidth=6, solid_capstyle='butt', color=color)

    if prev_end and (row['start'] - prev_end) <= 500 and (row['start'] - prev_end) > 0:
        connections.append((prev_end, row['end'], color))
    prev_end = row['end']

# Plot connection lines between breakpoints that are in same CDS region and within 500 base-diameter
# These lines will be thinner, so lw=2
for start, end, color in connections:
    axs[9].plot((start, end), (bar_pos, bar_pos), linewidth=2, color=color, solid_capstyle='butt')

# Plot rRNA annotations
for ix, row in df_gff.loc[
    ( df_gff['contig'] == 'Pf3D7_13_v3' )
    & ( df_gff['start'] <= max_pos )
    & ( df_gff['end'] >= min_pos )
    & ( df_gff['type'] == 'rRNA' )
].iterrows():
    axs[9].plot((row['start'], row['end']), (bar_pos, bar_pos), linewidth=6, solid_capstyle='butt', color=figure_colours['rrna_genes'])

# Plot pseudogene annotations
for ix, row in df_gff.loc[
    ( df_gff['contig'] == 'Pf3D7_13_v3' )
    & ( df_gff['start'] <= max_pos )
    & ( df_gff['end'] >= min_pos )
    & ( df_gff['type'] == 'pseudogene' )
].iterrows():
    axs[9].plot((row['start'], row['end']), (bar_pos, bar_pos), linewidth=6, solid_capstyle='butt', color=figure_colours['pseudogenes'])

# Annotate specific gene positions with labels
axs[9].text(( 2800004 + 2802154 ) / 2, text_pos, '18S rRNA', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2802527 + 2802686 ) / 2, text_pos, '5.8S rRNA', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2802945 + 2807159 ) / 2, text_pos, '28S rRNA', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2808200 + 2809700 ) / 2, text_pos, 'PF3D7_1371500', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2811706 + 2820270 ) / 2, text_pos, '${ebl1}$\n(pseudogene)', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2821078 + 2823292 ) / 2, text_pos, '${fikk13}$', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2824302 + 2825852 ) / 2, text_pos, 'PF3D7_1371800', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2829856 + 2830669 ) / 2, text_pos, 'PF3D7_1371900', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2832952 + 2834322 ) / 2, text_pos, 'PF3D7_1372000', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2837053 + 2839058 ) / 2, text_pos, '${gexp04}$', va='top', ha='center', size=8, rotation=90)
axs[9].text(( 2840727 + 2841703 ) / 2, text_pos, '${hrp3}$', va='top', ha='center', size=8, rotation=90)
# Set y-axis ticks, customize x-axis labels, and add x-axis label and y-axis label
axs[9].set_xlabel('Genomic coordinates')
axs[9].set_ylabel('Genes', rotation=0, ha='right', va='center')
axs[9].set_yticks([])

## Subplot 11: Ticks (x-axis) for Genomic Coordinates
# Hide left, right, and bottom spines
axs[10].set_xlim(min_pos, max_pos)
axs[10].spines['left'].set_visible(False)
axs[10].spines['right'].set_visible(False)
axs[10].set_xticks([2800000, 2810000, 2820000, 2830000, 2840000])
axs[10].set_xticklabels(["2.80 Mbp", "2.81 Mbp", "2.82 Mbp", "2.83 Mbp", "2.84 Mbp"])
axs[10].set_yticks([])
axs[10].xaxis.tick_top()
axs[10].tick_params(axis="x", direction="in", pad=-20)
axs[10].text((min_pos + max_pos) / 2, 0.4, 'Genomic coordinates', ha='center')

## Subplot 12: Legend
# Hide left, right, and bottom spines
axs[11].spines['left'].set_visible(False)
axs[11].spines['right'].set_visible(False)
axs[11].spines['bottom'].set_visible(False)
# Hide both x and y-axis ticks
axs[11].set_xticks([])
axs[11].set_yticks([])
# Create legend elements with color bars and labels
axs[11].plot((0, 0.1), (0.5, 0.5), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_5'])
axs[11].text(0.12, 0.5, 'Chr 5 sequence', va='center')
axs[11].plot((0, 0.1), (0.25, 0.25), linewidth=6, solid_capstyle='butt', color=figure_colours['chr_11'])
axs[11].text(0.12, 0.25, 'Chr 11 sequence', va='center')
axs[11].plot((0.38, 0.48), (0.5, 0.5), linewidth=6, solid_capstyle='butt', color=figure_colours['similar_sequence'])
axs[11].text(0.50, 0.5, 'Highly similar sequence on chr 13 and chr 11', va='center')
axs[11].text(.405, .206, '$(GGGTT[T/C]A)_n$', fontsize=6, ha='left', bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=0.3'))
axs[11].text(0.50, 0.25, 'Telomeric repeat sequence', va='center')
# Set the limits for this subplot
axs[11].set_xlim(0, 1)
axs[11].set_ylim(0, 1)

# Ensure the figure layout is tidy
fig.tight_layout()
../../_images/Figure_HRPDeletion_37_0.png

Figure Legend: hrp2 and hrp3 deletion breakpoints. We see five different breakpoints resulting in the deletion of hrp2. Four of these are within exon 2 of the gene whereas the fifth is found between hrp2 and the pseudogene PF3D7_0831750. For all five events we see evidence of telomeric healing from reads that contain part Pf3D7_08_v3 sequence and part telomeric repeat sequence (GGGTTCA/GGGTTTA). We see 17 different breakpoints resulting in the deletion of hrp3. For 15 of these we see evidence of telomeric healing. Note that many of these events result in the deletion of other genes in addition to hrp3. For twenty samples from Cambodia and a single sample from Vietnam we see evidence of a recombination with chromosome 5 which results in a hybrid chromosome comprising mostly chromosome 13 sequence but a small inverted section of an internal portion of chromosome 5 containing the gene mdr1. We also see evidence of a recombination with chromosome 11 which results in a hybrid chromosome comprising mostly chromosome 13 sequence but also a section of the 3’ end of chromosome 11. This is the most common deletion type, being seen in 162 samples from 15 different countries. Because the recombination occurs between highly similar sequences of a set of three orthologous ribosomal RNA genes found on both chromosomes, it is not possible to identify the exact breakpoints.

Save Figure

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

fig.savefig('/content/drive/My Drive/HRP_Deletions_Figure.pdf')
fig.savefig('/content/drive/My Drive/HRP_Deletions_Figure.png', dpi=480) # increase the dpi for higher resolution