Analyse csp C-Terminal Variation¶

Introduction¶

The c-terminal domain of the Plasmodium falciparum circumsporozoite protein (csp) has been targeted by the RTS,S and R21 vaccines, underscoring the importance of monitoring the variability of this region.

This notebook recreates the following figures from the Pf7 paper:

Figure 4A: frequency of csp c-terminal haplotypes and global mean distances
Figure 4B: histograms of amino acid differences between subpopulations and 3D7 for csp c-terminal
Supplementary Figure 10: Barplot of non-reference allele proportions in different sub-populations.

This notebook takes 5 minutes to run.

Setup¶

Install and import the malariagen Python package:

!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 146.2/146.2 kB 1.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 11.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.8/52.8 MB 7.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 19.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 31.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 11.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 1.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 21.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 33.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 9.4 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 24.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 22.2 MB/s eta 0:00:00
?25h  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required python libraries that are installed at colab by default.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections
import re
import scipy
from google.colab import drive

Access Pf7 Data¶

We use the malariagen data package to load the release data.

release_data = malariagen_data.Pf7()
sample_metadata = release_data.sample_metadata()
# take a glance at the metadata dataframe
sample_metadata.head(3)

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	All samples same case	Population	% callable	QC pass	Exclusion reason	Sample type	Sample was in Pf6
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	FP0008-C	AF-W	82.16	True	Analysis_set	gDNA	True
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	FP0009-C	AF-W	88.85	True	Analysis_set	gDNA	True
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	FP0010-CW	AF-W	86.46	True	Analysis_set	sWGA	False

Access the csp c-terminal Haplotypes Data¶

We can download the csp c-terminal region haplotypes for 16,203 samples from the MalariaGEN web resources using wget command.

!wget ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_csp_c_terminal_haplotypes.txt

--2024-01-09 16:44:02--  ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_csp_c_terminal_haplotypes.txt
           => ‘Pf7_csp_c_terminal_haplotypes.txt’
Resolving ngs.sanger.ac.uk (ngs.sanger.ac.uk)... 193.62.203.79
Connecting to ngs.sanger.ac.uk (ngs.sanger.ac.uk)|193.62.203.79|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /production/malaria/Resource/34 ... done.
==> SIZE Pf7_csp_c_terminal_haplotypes.txt ... 10196635
==> PASV ... done.    ==> RETR Pf7_csp_c_terminal_haplotypes.txt ... done.
Length: 10196635 (9.7M) (unauthoritative)

Pf7_csp_c_terminal_ 100%[===================>]   9.72M  7.16MB/s    in 1.4s    

2024-01-09 16:44:05 (7.16 MB/s) - ‘Pf7_csp_c_terminal_haplotypes.txt’ saved [10196635]

# Read the data into a dataset
df_csp_haplotypes = pd.read_csv('Pf7_csp_c_terminal_haplotypes.txt', sep='\t', low_memory=False)

# Rename first column as 'Samples' in accordance with sample_metadata
df_csp_haplotypes = df_csp_haplotypes.rename(columns={df_csp_haplotypes.columns[0]: 'Sample'})

# Print first rows
df_csp_haplotypes.head()

	Sample	csp_277-397_aa_haplotype	csp_277-397_nucleotide_haplotype	csp_277-397_ns_changes
0	FP0008-C	QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT...	ctaattaaggaacaagaaggataataccattattaatcctattgaa...	S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q
1	FP0009-C	QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT...	ctaattaaggaacaagaaggataataccattattaatcctattgaa...	S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q
2	FP0010-CW	QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT...	ctaattaaggaacaagaaggataataccattattaatcctattgaa...	s301n/k314q/k317e/e318k/n321k/k322t/q324k/e357...
3	FP0011-CW	QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDKHIEQYLKT...	ctaattaaggaacaagaaggataataccattattaatcctattgaa...	S301N/K317E/E318Q/N321K/K322T/Q324K/D359N/A361E
4	FP0012-CW	QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT...	ctaattaaggaacaagaaggataataccattattaatcctattgaa...	S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q

Let’s merge the metadata and haplotypes dataframes, sample_metadata and df_csp_haplotypes, using Sample IDs.

df_samples = sample_metadata.merge(df_csp_haplotypes, on='Sample')
df_samples.head(3)

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	All samples same case	Population	% callable	QC pass	Exclusion reason	Sample type	Sample was in Pf6	csp_277-397_aa_haplotype	csp_277-397_nucleotide_haplotype	csp_277-397_ns_changes
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	FP0008-C	AF-W	82.16	True	Analysis_set	gDNA	True	QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT...	ctaattaaggaacaagaaggataataccattattaatcctattgaa...	S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	FP0009-C	AF-W	88.85	True	Analysis_set	gDNA	True	QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT...	ctaattaaggaacaagaaggataataccattattaatcctattgaa...	S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	FP0010-CW	AF-W	86.46	True	Analysis_set	sWGA	False	QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT...	ctaattaaggaacaagaaggataataccattattaatcctattgaa...	s301n/k314q/k317e/e318k/n321k/k322t/q324k/e357...

As we examine a region comprising 121 amino acids, haplotypes exceeding this length are considered heterozygous calls, indicative of mixed infections. Let’s find out the percentage of samples that comprise 121 amino acids.

df_samples['csp_277-397_aa_haplotype'].apply(len).value_counts() / 16203*100

  69.493304
  20.551750
     6.801210
   3.178424
   0.006172
   0.006172
Name: csp_277-397_aa_haplotype, dtype: float64

We will only use homozygous calls for our subsequent analyses.

# Classify samples with expected haplotype length as homozygous
df_samples['c_terminal_hap_is_hom'] = ( df_samples['csp_277-397_aa_haplotype'].apply(len) == 121 )

Frequency of csp c-terminal haplotypes and global mean distances¶

We will start with finding out the frequency of haplotypes.

The csp_277-397_ns_changes column provides information about nucleotide sequence changes in each sample. An empty value in this column indicates no change in the nucleotide sequence, and such haplotypes are referred to as 3D7, representing the reference sample in Pf7.

df_samples['csp_277-397_ns_changes'] = df_samples['csp_277-397_ns_changes'].fillna('3D7')

We can group the samples with identical nucleotide changes and count them using a simple function.

def my_agg(x):
    names = collections.OrderedDict()
    names['csp_277-397_aa_haplotype'] = x['csp_277-397_aa_haplotype']
    names['number_of_samples'] = len(x)
    return(pd.Series(names))

We will only apply the function to samples with homozygous calls, and lab strains will be excluded from the analysis.

df_haplotypes = pd.DataFrame(
    df_samples.loc[
        # Select only samples with homozygous calls
        ( df_samples['csp_277-397_aa_haplotype'].apply(len) == 121 ) &
        # Exclude the lab strains
        ( df_samples['Study'] != '1153-PF-Pf3KLAB-KWIATKOWSKI' )
        # Group samples by ns changes and apply the function to count samples
    ]).groupby('csp_277-397_ns_changes').apply(my_agg)

print(df_haplotypes.shape)
df_haplotypes.head()

(248, 2)

	csp_277-397_aa_haplotype	number_of_samples
csp_277-397_ns_changes
3D7	5 QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSD...	368
A297D/S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q	59 QGNGQGHNMPNDPNRNVDENDNANNAVKNNNNEEPSD...	2
A297S/S301N/K317E/E318Q/E357Q	11584 QGNGQGHNMPNDPNRNVDENSNANNAVKNNNNEEPSD...	1
A299D/S301N/K314Q/K317E/E318K/N321K/E357Q	505 QGNGQGHNMPNDPNRNVDENANDNNAVKNNNNEEPSD...	14
A299G/L327I/E357Q	1837 QGNGQGHNMPNDPNRNVDENANGNSAVKNNNNEEPSD...	8

Next, we will create a distance matrix based on pairwise differences to compute the global mean distance for each sample. The diagonal of the matrix is filled with NaN to exclude self-comparisons.

# Initialize distance matrix
haplotype_distance_matrix = np.zeros((len(df_haplotypes), len(df_haplotypes)))

# Populate distance matrix
for i in range(len(df_haplotypes)):
    for j in range(len(df_haplotypes)):
        # Select the first copy of the haplotype for comparison
        hap1 = df_haplotypes.iloc[i, 0].values[0]
        hap2 = df_haplotypes.iloc[j, 0].values[0]
        haplotype_distance_matrix[i, j] = np.count_nonzero(np.array(list(hap1)) != np.array(list(hap2)))

# Fill diagonal with NaN
np.fill_diagonal(haplotype_distance_matrix, np.nan)

# Print the shape and the distance matrix
print(haplotype_distance_matrix.shape)
print(haplotype_distance_matrix)

(248, 248)
[[nan  9.  5. ...  8.  8.  8.]
 [ 9. nan  6. ...  4.  6.  4.]
 [ 5.  6. nan ...  6.  5.  6.]
 ...
 [ 8.  4.  6. ... nan  5.  2.]
 [ 8.  6.  5. ...  5. nan  6.]
 [ 8.  4.  6. ...  2.  6. nan]]

After creating the distance matrix, each row in the matrix represents the global mean distance for each sample. Finally, the samples are sorted based on their global mean distance for visual purposes of the figure.

df_haplotypes['global_mean_distance'] = np.nanmean(haplotype_distance_matrix, axis=0)
df_haplotypes.sort_values('global_mean_distance', inplace=True)
df_haplotypes

	csp_277-397_aa_haplotype	number_of_samples	global_mean_distance
csp_277-397_ns_changes
S301N/K317E/E318Q/N321K/E357Q/A361E	10856 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD...	3	4.647773
S301N/K317E/E318Q/N321K/A361E	712 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD...	2760	4.708502
S301N/K317E/N321K/A361E	6736 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDK...	1	4.805668
S301N/K317E/E318Q/N321K/K322T/E357Q/A361E	4581 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD...	6	4.951417
S301N/K314Q/K317E/E318K/N321K/E357Q/A361E	15085 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD...	1	5.032389
...	...	...	...
S301N/L327I/N352G/D356E/D359N/A361E	13069 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD...	1	8.570850
K317T/N321K/N352G/P354S/D356N/A361E	15402 QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSD...	1	8.655870
D294N/S301N/N321T/L327I/N352G/D359N/A361E	8343 QGNGQGHNMPNDPNRNVNENANANNAVKNNNNEEPSDK...	1	8.724696
L327I/N352G/D356N/A361E	1905 QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDK...	3	8.740891
D294N/S301N/K317T/N321K/N352G/P354S/K355R/A361E	2364 QGNGQGHNMPNDPNRNVNENANANNAVKNNNNEEPSD...	9	8.870445

248 rows × 3 columns

We would like to show haplotypes found in some lab strains. We haplotype-match these lab strains and store them in a new dataframe.

df_haplotypes_lab = df_haplotypes.loc[df_haplotypes.index.isin(df_samples.loc[df_samples['Study'] == '1153-PF-Pf3KLAB-KWIATKOWSKI', 'csp_277-397_ns_changes'])]

# List of labels
labels = ['Dd2', 'IT', '7G8', 'GB4', '3D7', 'HB3']

# Create 'labels' column
df_haplotypes_lab= df_haplotypes_lab.copy()
df_haplotypes_lab.loc[:, 'labels'] = labels
df_haplotypes_lab

	csp_277-397_aa_haplotype	number_of_samples	global_mean_distance	labels
csp_277-397_ns_changes
S301N/K317E/E318Q/N321K/A361E	712 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD...	2760	4.708502	Dd2
S301N/K317E/E318Q/N321K/N352D/E357Q/A361E	3090 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD...	80	5.259109	IT
S301N/K317E/E318Q/N321K/Q324K/L327I/A361E	23 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD...	125	5.558704	7G8
N298K/S301N/N321K/K322I	2174 QGNGQGHNMPNDPNRNVDENAKANNAVKNNNNEEPSD...	4	6.736842	GB4
3D7	5 QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSD...	368	6.882591	3D7
D288N/S301N/K317E/E318Q/N321K/N352G/P354S/A361E	92 QGNGQGHNMPNNPNRNVDENANANNAVKNNNNEEPSD...	47	7.210526	HB3

Plotting¶

We are ready to create the first plot which will have 2 subplots - one for showing the haplotype frequencies and the other one for showing the global mean distance. Then, we will paint lab strains with red colour using df_haplotypes_lab dataset.

fig, axs = plt.subplots(2, 1, figsize=(12, 6))
axs[0].bar(df_haplotypes.index, df_haplotypes['number_of_samples'], log='y', color='grey')
axs[0].bar(df_haplotypes_lab.index, df_haplotypes_lab['number_of_samples'], log='y', color='red')
axs[0].set_yticks([1, 10, 100, 1000])
axs[0].set_yticklabels([1, 10, 100, 1000])
axs[0].set_xticks(df_haplotypes_lab.index)
axs[0].set_xticklabels(df_haplotypes_lab.labels)
axs[0].set_ylabel('Number of samples (log scale)')
# axs[0].set_ylim([0.8, 3000])

axs[1].bar(df_haplotypes.index, df_haplotypes['global_mean_distance'], color='grey')
axs[1].bar(df_haplotypes_lab.index, df_haplotypes_lab['global_mean_distance'], color='red')
# axs[1].set_xticks([])
axs[1].set_xticks(df_haplotypes_lab.labels)
axs[1].set_xticklabels([])
axs[1].set_ylabel('Global mean distance')
axs[1].invert_yaxis()
axs[1].xaxis.tick_top()

fig.tight_layout()

../../_images/csp-cterm-analysis_38_0.png

Figure Legend: Analysis of c-terminal of csp. Upper panel - frequency of different haplotypes of c-terminal of csp. Haplotypes found in some lab strains are named and highlighted in red. Haplotypes are ordered as per lower panel. Lower panel - global mean distance (number of amino acid differences) to all other haplotypes.

Save the plot¶

We can save the plot by running:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')

Mounted at /content/drive

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

fig.savefig('/content/drive/My Drive/figure_csp_c_terminal_all_haplotypes.pdf', dpi=480, bbox_inches="tight") # increase the dpi for higher resolution

Histograms of Amino Acid Differences Between Subpopulations for csp c-terminal¶

Before creating this figure, we need to count the number of nucleotide changes within samples.

Count amino acid differences to non-3D7 haplotypes¶

Short way of counting substitions is using the csp_277-397_ns_changes column in the dataframe. Since different substutions are concatanated using ‘/’ (i.e. D288N/S301N), we will first split the string and then find the count.

def determine_num_aa_changes(x):
    if x == '' or x is None:
        return(0)
    else:
        return(len(x.split(',')[0].split('/')))


df_samples_hom = df_samples.loc[df_samples['c_terminal_hap_is_hom']]

df_samples.loc[df_samples['c_terminal_hap_is_hom'] == True, 'csp_277-397_ns_changes'].apply(determine_num_aa_changes).value_counts()

   3008
   2087
   1894
   1876
   1704
    374
    222
    45
     33
     17
Name: csp_277-397_ns_changes, dtype: int64

The long way of counting the amino acid differences is by cross-matching the haplotype sequence with the reference and counting the differences.

def determine_num_aa_changes_from_haplotype(
    x,
    # 3D7 haplotype
    haplotype='QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYANDIEKKICKMEKCSSVFNVVNSSIGLIMVLSFLFLN'
):
    num_aa_changes = 0
    if len(x) != len(haplotype):
        return(0)
#         raise ValueError(f"Length of x ({len(x)}) different to that of haplotype ({len(haplotype)})")
    for i in np.arange(len(x)):
        if x[i] != haplotype[i]:
            num_aa_changes += 1
    return(num_aa_changes)

(
    df_samples_hom['csp_277-397_aa_haplotype']
    .fillna('')
    .apply(
        lambda x: determine_num_aa_changes_from_haplotype(
            x,
            # 3D7
            'QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYANDIEKKICKMEKCSSVFNVVNSSIGLIMVLSFLFLN'
        )
    )
    .value_counts()
)

   3008
   2087
   1894
   1876
   1704
    369
    222
    45
     33
     17
      5
Name: csp_277-397_aa_haplotype, dtype: int64

Both methods gave the same results.

Count numbers of amino acid differences from each haplotype for each sample¶

We already found the amino acid differences to 3D7, as a bonus step, we can utilize the same function to count amino acid differences between each haplotype.

# Df to store haplotype counts in samples
num_aa_changes =  df_samples_hom.copy()

# Ignore the dirty performance warning
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

# Iterate over each haplotype
for haplotype in df_samples_hom['csp_277-397_aa_haplotype'].unique():
    print('.', end='')

    # Pick the corresponding substitutions (aa_changes) for the haplotype
    aa_changes = df_samples_hom.loc[df_samples_hom['csp_277-397_aa_haplotype'] == haplotype, 'csp_277-397_ns_changes'].values[0]

    # Create a column for aa_changes
    # Apply function to find number of changes
    num_aa_changes.loc[:,aa_changes] = (
        df_samples['csp_277-397_aa_haplotype']
        .apply(
            lambda x: determine_num_aa_changes_from_haplotype(
                x,
                haplotype
            )
        )
    )

........................................................................................................................................................................................................................................................

Create plot showing differences between 3D7 and Dd2 haplotypes¶

Let’s define a colour to each subpopulation to be used in the figure.

population_colours = collections.OrderedDict()
population_colours['SA']= "#4daf4a"
population_colours['AF-W']= "#e31a1c"
population_colours['AF-C'] = "#fd8d3c"
population_colours['AF-NE'] = "#bb8129"
population_colours['AF-E'] = "#fecc5c"
population_colours['AS-S-E'] = "#dfc0eb"
population_colours['AS-S-FE'] = "#984ea3"
population_colours['AS-SE-W'] = "#9ecae1"
population_colours['AS-SE-E'] = "#3182bd"
population_colours['OC-NG'] = "#f781bf"

In this plot:

Each subplot represents samples from a distinct subpopulation.
The x-axis denotes the number of amino acid differences, ranging from 1 to 10.
The y-axis indicates the corresponding number of samples.
In addition to the 3D7 reference, substitutions to the Dd2 reference will also be featured in the analysis.

# Create a 5x5 grid of subplots with specified dimensions and spacing
fig, axs = plt.subplots(5, 5, figsize=(15, 8), gridspec_kw={'height_ratios': [1, 1, 0.5, 1, 1], 'hspace': 0.25})

# Define an ordered dictionary to hold references and associated labels for each subplot
plots = collections.OrderedDict()
plots['3D7'] = '3D7'
plots['S301N/K317E/E318Q/N321K/A361E'] = 'Dd2'

# Loop through each reference in the plots dictionary
for plot_num, aa_changes in enumerate(plots):
    # Loop through each subpopulation and enumerate its index
    for pop_num, population in enumerate(population_colours):
        # Calculate the subplot indices based on the current reference and subpopulation
        plot_index_row = (pop_num // 5) * 3 + (plot_num)
        plot_index_col = pop_num % 5

        # Extract the number of amino acid changes for the current subpopulation and reference
        num_csp_ns_mutations_this_pop = num_aa_changes.loc[
            (num_aa_changes['Population'] == population),
            aa_changes
        ].values

        # Create a histogram for the amino acid changes in the current subplot
        axs[plot_index_row, plot_index_col].hist(num_csp_ns_mutations_this_pop, bins=np.arange(0, 12),
                                                 color=population_colours[population])

        # Set x-axis ticks and labels for better readability
        axs[plot_index_row, plot_index_col].set_xticks(np.arange(0.5, 11))

        # Set subplot title for the first row and x-axis labels for the second row
        if plot_num == 0:
            axs[plot_index_row, plot_index_col].set_title(population, fontsize=12)
        if plot_num == 1:
            axs[plot_index_row, plot_index_col].set_xticklabels(np.arange(0, 11))
        else:
            axs[plot_index_row, plot_index_col].set_xticklabels([])

        # Set y-axis label for each subplot
        if plot_index_col == 0:
            axs[plot_index_row, plot_index_col].set_ylabel(plots[aa_changes])

# Turn off the middle row of subplots for aesthetic reasons
for column_num in np.arange(5):
    axs[2, column_num].axis('off')

# Add x-axis and y-axis labels to the entire figure
fig.text(0.5, 0.04, '$csp$ c-terminal number of amino acid differences', ha='center', fontsize=12)
fig.text(0.05, 0.5, 'Frequency (number of samples with this number of\namino acid differences from given haplotype)',
         va='center', ha='center', fontsize=12, rotation=90)

Text(0.05, 0.5, 'Frequency (number of samples with this number of\namino acid differences from given haplotype)')

../../_images/csp-cterm-analysis_59_1.png

Figure Legend: Analysis of c-terminal of csp. Histograms of number of amino acid differences between samples in each major sub-population and the 3D7 haplotype (upper plot) and Dd2 haplotype (lower plot).

Save the plot¶

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

fig.savefig('/content/drive/My Drive/figure_csp_c_terminal_histogram.pdf', dpi=480, bbox_inches="tight") # increase the dpi for higher resolution

Barplot of Non-reference Allele Proportions in Different Sub-populations¶

Before proceeding with plotting, we need to determine the allele proportions, a process that involves the following two steps:

Extracting mutations from the substitution column.
Calculating the count of each mutation within a subpopulation.

Extracting mutations: Format and tidy-up substitutions column¶

Let’s examine a value in the substitutions column to determine the appropriate method for extraction.

df_samples_hom['csp_277-397_ns_changes'][0]

'S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q'

Since all mutations are split by the “/” character from each other, we can write a simple function to seperate them.

def tidylist(x):

    # Split mutations
    x = (x.split('/'))

    # Create list to store extractions
    tidy_list = []

    # Loop through item list
    for item in x:
        item=item.upper() # Convert all to caps
        item=item.replace("*", "") # Remove unknown characters
        item=item.replace("!", "") # Remove unknown characters
        tidy_list.append(item)
    return tidy_list

# Apply function and store the results in a new column
df_samples_hom= df_samples_hom.copy()
df_samples_hom['list'] = df_samples_hom['csp_277-397_ns_changes'].apply(tidylist)

Extracting mutations: Unique mutations sorted by positon number¶

We will find unique mutations and sort them by positions in ascending order which will be useful for the final figure.

# Create array of all mutation types
# Find all unique mutations
unique_array = df_samples_hom['list'].explode().unique()

# Sort by positions (second character of string)
unique_array=sorted(unique_array,key=lambda t: t[1:])

Now, we will flag samples for the presence of these mutations, assigning True if they exist and False if they do not.

# Create boolean data frame, showing True or False values for each mutation
def boolean_df(item_lists, unique_items):
# Create empty dict
    bool_dict = {}
    # Loop through all the tags
    for item in unique_items:
        # Apply boolean mask
        bool_dict[item] = item_lists.apply(lambda x: item in x)
    # Return the results as a dataframe
    return pd.DataFrame(bool_dict)

ns_bool = boolean_df(df_samples_hom['list'], unique_array)
ns_bool.head()

	H283R	D288G	D288N	V293A	D294N	E295A	E295D	E295K	N296D	N296S	...	D356N	E357K	E357Q	D359N	A361E	A361I	A361K	A361S	L391F	3D7
0	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	True	True	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	False	False	False
5	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	True

5 rows × 68 columns

We will use the index of the samples to match sample populations from the df_samples since we have not resetted index so far.

# Make dictionary of sample and population
dictionary = dict(zip(df_samples_hom.index,df_samples_hom.Population))
# Map population to boolean dataframe using dictionary and set as index
ns_bool['population']= ns_bool.index.map(dictionary)
# Change sample names to population
ns_bool=ns_bool.set_index(['population'])
ns_bool

	H283R	D288G	D288N	V293A	D294N	E295A	E295D	E295K	N296D	N296S	...	D356N	E357K	E357Q	D359N	A361E	A361I	A361K	A361S	L391F	3D7
population
AF-W	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	False	False	False
AF-W	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	False	False	False
AF-W	False	False	False	False	False	False	False	False	False	False	...	False	False	False	True	True	False	False	False	False	False
AF-W	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	False	False	False
AF-W	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
AF-E	False	False	False	False	False	False	False	False	False	False	...	False	False	True	True	True	False	False	False	False	False
AF-E	False	False	False	False	False	False	False	False	False	False	...	True	False	False	False	True	False	False	False	False	False
AF-E	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	True	False	False	False	False	False
AF-E	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	True	False	False	False	False	False
AF-E	False	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	False	False	False

11260 rows × 68 columns

Formatting: Add columns for aa position, and amino acids before and after¶

Before calculating allele frequencies, some data formatting is required. We will transverse the dataframe to extract: aa_positions, wild-type and mutation.

#Transverse dataframe to calculate bool counts for each amino acid position change
ns_bool_T = ns_bool.T
ns_bool_T.head()

#Copy this data frame to modify later on when selecting my population
ns_bool_T_original = ns_bool_T.copy()

# Convert index to column
ns_bool_T = ns_bool.T.reset_index().rename(columns={'index': 'aa_change'})

# Drop last row - 3D7
ns_bool_T = ns_bool_T.drop(ns_bool_T.index[-1])

# Split aa_change column into the position, original amino acid, and mutated amino acid
ns_bool_T['aa_position']=ns_bool_T['aa_change'].str[1:4].astype(int)
ns_bool_T['OG_aa']=ns_bool_T['aa_change'].str[0]
ns_bool_T['mut_aa']=ns_bool_T['aa_change'].str[-1]
ns_bool_T.head()

# Make copy to use later
pop_bool = ns_bool_T.copy()

We can easily count how many mutations within each sample and find the frequency by dividing overall population size.

# Add count columnns
ns_bool_T['count']=ns_bool_T[['AF-W','AS-SE-E','AS-SE-W','AS-S-FE','AF-C','OC-NG','AF-NE', 'AS-S-E','SA','AF-E']].sum(axis=1) #create column for count

# Add count percentage column
df_samples.shape
pop_size=len(df_samples_hom)

# Normalise by population.
ns_bool_T['count_percentage']=ns_bool_T['count'].apply(lambda x: (x/pop_size)*100)

# Add ranking column
ns_bool_T=ns_bool_T.reset_index(drop=True).sort_values(by=['aa_position', 'count_percentage'], ascending=[True, False])
ns_bool_T['alt_acid_number'] = ns_bool_T.groupby(['aa_position']).cumcount()+1; ns_bool_T

# Make dictionary of alt_acid mumher and aa_change to map later on when looking at populations
rank_dict = dict(zip(ns_bool_T['aa_change'].to_list(), ns_bool_T['alt_acid_number'].to_list()))

In order to create stacked bars in the plot, we will create multi-level index.

#set index
ns_bool_T = ns_bool_T.set_index(['aa_position', 'OG_aa','alt_acid_number'])

pd.set_option('display.max_rows', 100)
ns_bool_T

		population	aa_change	AF-W	AF-W	AF-W	AF-W	AF-W	AF-W	AF-W	AF-W	AF-W	...	AF-E	AF-E	AF-E	AF-E	AF-E	AF-E	AF-E	mut_aa	count	count_percentage
aa_position	OG_aa	alt_acid_number
283	H	1	H283R	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	R	1	0.008881
288	D	1	D288N	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	N	72	0.639432
288	D	2	D288G	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	G	3	0.026643
293	V	1	V293A	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	A	1	0.008881
294	D	1	D294N	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	N	137	1.216696
295	E	1	E295K	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	K	31	0.275311
		2	E295A	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	A	7	0.062167
		3	E295D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	2	0.017762
296	N	1	N296D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	30	0.266430
296	N	2	N296S	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	S	2	0.017762
297	A	1	A297D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	2	0.017762
297	A	2	A297S	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	S	1	0.008881
298	N	1	N298K	False	False	False	False	False	False	True	False	False	...	False	False	False	False	True	True	False	K	768	6.820604
299	A	1	A299G	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	G	73	0.648313
		2	A299S	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	S	21	0.186501
		3	A299D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	14	0.124334
301	S	1	S301N	True	True	True	True	False	True	True	True	True	...	True	True	True	True	True	True	True	N	10761	95.568384
301	S	2	S301D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	25	0.222025
302	A	1	A302D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	1655	14.698046
		2	A302G	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	G	38	0.337478
		3	A302P	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	P	15	0.133215
		4	A302V	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	V	6	0.053286
303	V	1	V303A	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	A	3	0.026643
		2	V303E	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	E	3	0.026643
		3	V303G	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	G	1	0.008881
305	N	1	N305K	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	K	1	0.008881
310	E	1	E310V	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	V	1	0.008881
314	K	1	K314Q	True	True	False	True	False	True	False	False	False	...	False	False	True	False	False	False	True	Q	2269	20.150977
317	K	1	K317E	True	True	True	True	False	True	True	True	True	...	True	True	True	False	True	True	True	E	9745	86.545293
317	K	2	K317T	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	T	722	6.412078
318	E	1	E318Q	False	False	True	False	False	False	True	True	True	...	True	True	False	False	True	True	False	Q	6889	61.181172
		2	E318K	True	True	False	True	False	True	False	False	False	...	False	False	True	False	False	False	True	K	2456	21.811723
		3	E318D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	7	0.062167
320	L	1	L320I	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	I	45	0.399645
321	N	1	N321K	True	True	True	True	False	True	False	False	True	...	False	True	False	False	True	True	True	K	9690	86.056838
		2	N321Q	False	False	False	False	False	False	False	False	False	...	False	False	True	False	False	False	False	Q	365	3.241563
		3	N321T	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	T	221	1.962700
		4	N321P	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	P	1	0.008881
322	K	1	K322T	True	True	True	True	False	True	False	False	True	...	False	False	False	False	True	True	True	T	2012	17.868561
		2	K322N	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	N	770	6.838366
		3	K322I	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	I	524	4.653641
		4	K322R	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	R	391	3.472469
		5	K322E	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	E	194	1.722913
		6	K322L	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	L	4	0.035524
323	I	1	I323M	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	M	68	0.603908
324	Q	1	Q324K	True	True	True	True	False	False	True	True	True	...	True	True	False	False	False	False	False	K	1972	17.513321
324	Q	2	Q324R	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	R	149	1.323268
325	N	1	N325Y	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	Y	1334	11.847247
326	S	1	S326A	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	A	34	0.301954
327	L	1	L327I	False	False	False	False	False	False	True	True	False	...	True	True	False	True	False	False	False	I	982	8.721137
349	G	1	G349D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	18	0.159858
349	G	2	G349S	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	S	2	0.017762
352	N	1	N352D	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	D	1696	15.062167
352	N	2	N352G	False	False	False	False	False	False	False	False	False	...	False	False	False	True	False	False	False	G	894	7.939609
354	P	1	P354S	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	S	574	5.097691
354	P	2	P354L	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	L	1	0.008881
355	K	1	K355R	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	R	9	0.079929
356	D	1	D356N	False	False	False	False	False	False	False	False	False	...	False	False	False	True	False	False	False	N	332	2.948490
356	D	2	D356E	False	False	False	False	False	False	False	False	True	...	False	False	False	False	False	False	False	E	63	0.559503
357	E	1	E357Q	True	True	False	True	False	True	False	False	False	...	False	False	True	False	True	True	True	Q	4887	43.401421
357	E	2	E357K	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	K	1	0.008881
359	D	1	D359N	False	False	True	False	False	False	True	False	False	...	True	False	True	False	False	False	False	N	1345	11.944938
361	A	1	A361E	False	False	True	False	False	False	True	False	True	...	True	True	True	True	True	True	False	E	8580	76.198934
		2	A361I	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	I	72	0.639432
		3	A361K	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	K	1	0.008881
		4	A361S	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	S	1	0.008881
391	L	1	L391F	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	F	3	0.026643

67 rows × 11264 columns

Figure preparation¶

The last step is doing unstacking and grouping operations to prepare our data in a format suitable for creating a stacked bar plot.

UNSTACK - for count percentage values¶

# Unstack index by count percentage and groupby amino acid position
count_unstacked = ns_bool_T['count_percentage'].unstack().groupby(['aa_position','OG_aa']).first()
count_unstacked

	alt_acid_number	1	2	3	4	5	6
aa_position	OG_aa
283	H	0.008881	NaN	NaN	NaN	NaN	NaN
288	D	0.639432	0.026643	NaN	NaN	NaN	NaN
293	V	0.008881	NaN	NaN	NaN	NaN	NaN
294	D	1.216696	NaN	NaN	NaN	NaN	NaN
295	E	0.275311	0.062167	0.017762	NaN	NaN	NaN
296	N	0.266430	0.017762	NaN	NaN	NaN	NaN
297	A	0.017762	0.008881	NaN	NaN	NaN	NaN
298	N	6.820604	NaN	NaN	NaN	NaN	NaN
299	A	0.648313	0.186501	0.124334	NaN	NaN	NaN
301	S	95.568384	0.222025	NaN	NaN	NaN	NaN
302	A	14.698046	0.337478	0.133215	0.053286	NaN	NaN
303	V	0.026643	0.026643	0.008881	NaN	NaN	NaN
305	N	0.008881	NaN	NaN	NaN	NaN	NaN
310	E	0.008881	NaN	NaN	NaN	NaN	NaN
314	K	20.150977	NaN	NaN	NaN	NaN	NaN
317	K	86.545293	6.412078	NaN	NaN	NaN	NaN
318	E	61.181172	21.811723	0.062167	NaN	NaN	NaN
320	L	0.399645	NaN	NaN	NaN	NaN	NaN
321	N	86.056838	3.241563	1.962700	0.008881	NaN	NaN
322	K	17.868561	6.838366	4.653641	3.472469	1.722913	0.035524
323	I	0.603908	NaN	NaN	NaN	NaN	NaN
324	Q	17.513321	1.323268	NaN	NaN	NaN	NaN
325	N	11.847247	NaN	NaN	NaN	NaN	NaN
326	S	0.301954	NaN	NaN	NaN	NaN	NaN
327	L	8.721137	NaN	NaN	NaN	NaN	NaN
349	G	0.159858	0.017762	NaN	NaN	NaN	NaN
352	N	15.062167	7.939609	NaN	NaN	NaN	NaN
354	P	5.097691	0.008881	NaN	NaN	NaN	NaN
355	K	0.079929	NaN	NaN	NaN	NaN	NaN
356	D	2.948490	0.559503	NaN	NaN	NaN	NaN
357	E	43.401421	0.008881	NaN	NaN	NaN	NaN
359	D	11.944938	NaN	NaN	NaN	NaN	NaN
361	A	76.198934	0.639432	0.008881	0.008881	NaN	NaN
391	L	0.026643	NaN	NaN	NaN	NaN	NaN

In order to add non-substutional positions in the figure, we will create a dataframe with empty calls and merge with count_unstacked dataset to fill the above percentages.

#Create new data frame with empty calls and merge with this one!
aa_position=np.arange(277,363)
og_aa=list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYAN")
tuples=list(zip(aa_position,og_aa))

cols = count_unstacked.columns
index = pd.MultiIndex.from_tuples(tuples, names=['aa_position', 'OG_aa'])
empty_df = pd.DataFrame(np.nan,index, cols)
count_unstacked
empty_df

	alt_acid_number	1	2	3	4	5	6
aa_position	OG_aa
277	Q	NaN	NaN	NaN	NaN	NaN	NaN
278	G	NaN	NaN	NaN	NaN	NaN	NaN
279	N	NaN	NaN	NaN	NaN	NaN	NaN
280	G	NaN	NaN	NaN	NaN	NaN	NaN
281	Q	NaN	NaN	NaN	NaN	NaN	NaN
282	G	NaN	NaN	NaN	NaN	NaN	NaN
283	H	NaN	NaN	NaN	NaN	NaN	NaN
284	N	NaN	NaN	NaN	NaN	NaN	NaN
285	M	NaN	NaN	NaN	NaN	NaN	NaN
286	P	NaN	NaN	NaN	NaN	NaN	NaN
287	N	NaN	NaN	NaN	NaN	NaN	NaN
288	D	NaN	NaN	NaN	NaN	NaN	NaN
289	P	NaN	NaN	NaN	NaN	NaN	NaN
290	N	NaN	NaN	NaN	NaN	NaN	NaN
291	R	NaN	NaN	NaN	NaN	NaN	NaN
292	N	NaN	NaN	NaN	NaN	NaN	NaN
293	V	NaN	NaN	NaN	NaN	NaN	NaN
294	D	NaN	NaN	NaN	NaN	NaN	NaN
295	E	NaN	NaN	NaN	NaN	NaN	NaN
296	N	NaN	NaN	NaN	NaN	NaN	NaN
297	A	NaN	NaN	NaN	NaN	NaN	NaN
298	N	NaN	NaN	NaN	NaN	NaN	NaN
299	A	NaN	NaN	NaN	NaN	NaN	NaN
300	N	NaN	NaN	NaN	NaN	NaN	NaN
301	S	NaN	NaN	NaN	NaN	NaN	NaN
302	A	NaN	NaN	NaN	NaN	NaN	NaN
303	V	NaN	NaN	NaN	NaN	NaN	NaN
304	K	NaN	NaN	NaN	NaN	NaN	NaN
305	N	NaN	NaN	NaN	NaN	NaN	NaN
306	N	NaN	NaN	NaN	NaN	NaN	NaN
307	N	NaN	NaN	NaN	NaN	NaN	NaN
308	N	NaN	NaN	NaN	NaN	NaN	NaN
309	E	NaN	NaN	NaN	NaN	NaN	NaN
310	E	NaN	NaN	NaN	NaN	NaN	NaN
311	P	NaN	NaN	NaN	NaN	NaN	NaN
312	S	NaN	NaN	NaN	NaN	NaN	NaN
313	D	NaN	NaN	NaN	NaN	NaN	NaN
314	K	NaN	NaN	NaN	NaN	NaN	NaN
315	H	NaN	NaN	NaN	NaN	NaN	NaN
316	I	NaN	NaN	NaN	NaN	NaN	NaN
317	K	NaN	NaN	NaN	NaN	NaN	NaN
318	E	NaN	NaN	NaN	NaN	NaN	NaN
319	Y	NaN	NaN	NaN	NaN	NaN	NaN
320	L	NaN	NaN	NaN	NaN	NaN	NaN
321	N	NaN	NaN	NaN	NaN	NaN	NaN
322	K	NaN	NaN	NaN	NaN	NaN	NaN
323	I	NaN	NaN	NaN	NaN	NaN	NaN
324	Q	NaN	NaN	NaN	NaN	NaN	NaN
325	N	NaN	NaN	NaN	NaN	NaN	NaN
326	S	NaN	NaN	NaN	NaN	NaN	NaN
327	L	NaN	NaN	NaN	NaN	NaN	NaN
328	S	NaN	NaN	NaN	NaN	NaN	NaN
329	T	NaN	NaN	NaN	NaN	NaN	NaN
330	E	NaN	NaN	NaN	NaN	NaN	NaN
331	W	NaN	NaN	NaN	NaN	NaN	NaN
332	S	NaN	NaN	NaN	NaN	NaN	NaN
333	P	NaN	NaN	NaN	NaN	NaN	NaN
334	C	NaN	NaN	NaN	NaN	NaN	NaN
335	S	NaN	NaN	NaN	NaN	NaN	NaN
336	V	NaN	NaN	NaN	NaN	NaN	NaN
337	T	NaN	NaN	NaN	NaN	NaN	NaN
338	C	NaN	NaN	NaN	NaN	NaN	NaN
339	G	NaN	NaN	NaN	NaN	NaN	NaN
340	N	NaN	NaN	NaN	NaN	NaN	NaN
341	G	NaN	NaN	NaN	NaN	NaN	NaN
342	I	NaN	NaN	NaN	NaN	NaN	NaN
343	Q	NaN	NaN	NaN	NaN	NaN	NaN
344	V	NaN	NaN	NaN	NaN	NaN	NaN
345	R	NaN	NaN	NaN	NaN	NaN	NaN
346	I	NaN	NaN	NaN	NaN	NaN	NaN
347	K	NaN	NaN	NaN	NaN	NaN	NaN
348	P	NaN	NaN	NaN	NaN	NaN	NaN
349	G	NaN	NaN	NaN	NaN	NaN	NaN
350	S	NaN	NaN	NaN	NaN	NaN	NaN
351	A	NaN	NaN	NaN	NaN	NaN	NaN
352	N	NaN	NaN	NaN	NaN	NaN	NaN
353	K	NaN	NaN	NaN	NaN	NaN	NaN
354	P	NaN	NaN	NaN	NaN	NaN	NaN
355	K	NaN	NaN	NaN	NaN	NaN	NaN
356	D	NaN	NaN	NaN	NaN	NaN	NaN
357	E	NaN	NaN	NaN	NaN	NaN	NaN
358	L	NaN	NaN	NaN	NaN	NaN	NaN
359	D	NaN	NaN	NaN	NaN	NaN	NaN
360	Y	NaN	NaN	NaN	NaN	NaN	NaN
361	A	NaN	NaN	NaN	NaN	NaN	NaN
362	N	NaN	NaN	NaN	NaN	NaN	NaN

#fill empty data frame with data to plot values ---- complete! & fill nan's with a zero int value so that the bars will plot
data_to_plot = empty_df.fillna(count_unstacked).fillna(0)

UNSTACK for amino acid lists¶

Second and last dataframe will convey amino acids instead of percentages.

#Unstack index by count percentage and groupby amino acid position
mut_aa_unstacked = ns_bool_T['mut_aa'].unstack().groupby(['aa_position','OG_aa']).first()

mut_aa_test = empty_df.fillna(mut_aa_unstacked).fillna(" ")

one=mut_aa_test[1].to_list()
two=mut_aa_test[2].to_list()
three=mut_aa_test[3].to_list()
four=mut_aa_test[4].to_list()
five=mut_aa_test[5].to_list()
six=mut_aa_test[6].to_list()

mut_aa_test

	alt_acid_number	1	2	3	4	5	6
aa_position	OG_aa
277	Q
278	G
279	N
280	G
281	Q
282	G
283	H	R
284	N
285	M
286	P
287	N
288	D	N	G
289	P
290	N
291	R
292	N
293	V	A
294	D	N
295	E	K	A	D
296	N	D	S
297	A	D	S
298	N	K
299	A	G	S	D
300	N
301	S	N	D
302	A	D	G	P	V
303	V	A	E	G
304	K
305	N	K
306	N
307	N
308	N
309	E
310	E	V
311	P
312	S
313	D
314	K	Q
315	H
316	I
317	K	E	T
318	E	Q	K	D
319	Y
320	L	I
321	N	K	Q	T	P
322	K	T	N	I	R	E	L
323	I	M
324	Q	K	R
325	N	Y
326	S	A
327	L	I
328	S
329	T
330	E
331	W
332	S
333	P
334	C
335	S
336	V
337	T
338	C
339	G
340	N
341	G
342	I
343	Q
344	V
345	R
346	I
347	K
348	P
349	G	D	S
350	S
351	A
352	N	D	G
353	K
354	P	S	L
355	K	R
356	D	N	E
357	E	Q	K
358	L
359	D	N
360	Y
361	A	E	I	K	S
362	N

Insert empty rows for all amino acids in sequence.

#Use 'test.reset_index().index' to set in plotting to set x_ticks
test=data_to_plot.copy()
test.reset_index().index

RangeIndex(start=0, stop=86, step=1)

Plotting¶

There are three versions of this plot that we will go through one by one:

Showing allele proportions for Pf7
Showing allele proportions for one subpopulation
Showing allele proportions for Pf7 by subpopulations

Showing allele proportions for Pf7¶

We can assign a color to each rank of the alternate amino acid (1-6). We will display the corresponding colors in the legend.

# Make color dictionary
column_list=[1,2,3,4,5,6,]
color_list=['#0a2530', '#0796a9', '#d5e369', '#dc9922', '#943722', '#4e1b0f',]
columns_and_colors = (zip(column_list, color_list))
colors=dict(columns_and_colors)

We will create a stacked bar graph illustrating the percentage of samples with alternative amino acids. The primary axis (ax1) will display the stacked bars for each position, while secondary axes (ax2) underneath will depict alternative amino acid labels.

# Set figure
fig= plt.figure()
# Create first axis
ax1 = fig.add_axes((0.1,0.5,0.8,0.6))
# Plot bar graph on ax 1
ax1 = data_to_plot.plot(kind='bar',
                    stacked=True,
                    y=[1,2,3,4,5,6],
                    figsize=(50,10),
                    width=0.9,
                    color=[colors[i] for i in column_list],
                    linewidth = 0,
                    grid=True,
                    ax=ax1,
                    sharex = True,
                    legend=False)


#--------------------------------------------------------------------------------------------------------------------
#BAR GRAPH
#--------------------------------------------------------------------------------------------------------------------
#y axis of bar graph
ax1.set_ylabel("% of samples with alternative amino acid ", fontsize=22)
ax1.set_yticks([0,20,40,60,80,100])
ax1.set_yticklabels(["0","20","40","60","80","100"], fontweight='bold',fontsize=20)
#ax1.set_yscale('log')

#--------------------------------------------------------------------------------------------------------------------
#x axis of bar graph
ax1.set_xlim(-1,85)
ax1.set_xticks(test.reset_index().index)
ax1.set_xlabel(' ')
ax1.set_xticklabels(aa_position, fontweight='bold', fontsize=18)

#--------------------------------------------------------------------------------------------------------------------
# create secondary axis underneath
#--------------------------------------------------------------------------------------------------------------------
#Create OG aa x axis
ax2 = fig.add_axes((0.1,0.42,0.8,0.0))
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
og_aa_plot=list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYAN")
ax2.set_xticklabels(og_aa_plot,fontweight='bold',fontsize=25)

#--------------------------------------------------------------------------------------------------------------------
#Create 1st alt x axis
ax2 = fig.add_axes((0.1,0.35,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(one, fontsize=25)


#--------------------------------------------------------------------------------------------------------------------
#Create 2nd  alt x axis
ax2 = fig.add_axes((0.1,0.32,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(two,fontsize=25)


#--------------------------------------------------------------------------------------------------------------------
#Create 3rd alt x axis
ax2 = fig.add_axes((0.1,0.29,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(three,fontsize=25)


#--------------------------------------------------------------------------------------------------------------------
#Create 4rd alt x axis
ax2 = fig.add_axes((0.1,0.26,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(four,fontsize=25)


#--------------------------------------------------------------------------------------------------------------------
#Create 5rd alt x axis
ax2 = fig.add_axes((0.1,0.23,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(five,fontsize=25)


#--------------------------------------------------------------------------------------------------------------------
#Create 6rd alt x axis
ax2 = fig.add_axes((0.1,0.20,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(six,fontsize=25)

# Legend
# Add text boxes for alternative amino acid labelling

fig.text(0.09,0.38, " REF",   bbox=dict(facecolor='white'  ), fontsize=18, fontweight='bold')
fig.text(0.09,0.32, "   1   ", bbox=dict(facecolor=colors[1]), fontsize=18, color='white' )
fig.text(0.09,0.29, "   2   ", bbox=dict(facecolor=colors[2]), fontsize=18, color='white')
fig.text(0.09,0.26, "   3   ", bbox=dict(facecolor=colors[3]), fontsize=18 )
fig.text(0.09,0.23, "   4   ", bbox=dict(facecolor=colors[4]), fontsize=18 )
fig.text(0.09,0.20, "   5   ", bbox=dict(facecolor=colors[5]), fontsize=18,color='white' )
fig.text(0.09,0.17, "   6   ", bbox=dict(facecolor=colors[6]), fontsize=18,color='white' )

fig.text(0.081,0.10, "Alternate Amino Acid",rotation=90, fontsize=20)

Text(0.081, 0.1, 'Alternate Amino Acid')

../../_images/csp-cterm-analysis_105_1.png

Figure Legend: Variation in c-terminal of csp in Pf7 samples.

Showing allele proportions for one subpopulation¶

To generate the depicted plot for a specific population, we need to filter the data corresponding to that population and normalize the allele proportions based on the population size. We will create a function that performs these two steps, and the remaining part of the plot remains the same as above.

def make_pop_plots(pop_bool, pop, pop_size):

    ns_bool_T = pop_bool[[pop, 'aa_change', 'aa_position', 'OG_aa', 'mut_aa']]

    # Add count columnns
    ns_bool_T=ns_bool_T.copy()
    ns_bool_T.loc[:, 'count'] = ns_bool_T[[pop]].sum(axis=1)  # create column for count

    # Add count percentage
    ns_bool_T.loc[:, 'count_percentage'] = ns_bool_T['count'].apply(lambda x: (x/pop_size)*100)  # Normalize by population

    # Add ranking column
    ns_bool_T=ns_bool_T.reset_index(drop=True).sort_values(by=['aa_position', 'count_percentage'], ascending=[True, False])
    ns_bool_T['alt_acid_number'] = ns_bool_T['aa_change'].map(rank_dict)

    # Set index
    ns_bool_T = ns_bool_T.set_index(['aa_position', 'OG_aa','alt_acid_number'])

    # Unstack index by count percentage and groupby amino acid position
    count_unstacked = ns_bool_T['count_percentage'].unstack().groupby(['aa_position','OG_aa']).first()

    # Create new data frame with empty calls and merge with this one!
    aa_position=np.arange(277,363)
    og_aa=list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYANDIEKKICKMEKCSSVFNVVNSSIGLIMVL")
    tuples=list(zip(aa_position,og_aa))

    cols = count_unstacked.columns
    index = pd.MultiIndex.from_tuples(tuples, names=['aa_position', 'OG_aa'])
    empty_df = pd.DataFrame(np.nan,index, cols)
    count_unstacked
    empty_df

    # Fill empty data frame with data to plot values ---- complete! & fill nan's with a zero int value so that the bars will plot
    pd.set_option('display.max_rows', 115)
    data_to_plot = empty_df.fillna(count_unstacked).fillna(0)

    # Turn count into an int and only select rows with count >0
    ns_bool_T['count']=ns_bool_T['count'].apply(lambda x: int(x))
    ns_bool_T.loc[ns_bool_T['count']> 0] #this is diff from all populations, need to select only those with count > 0

    # Unstack index by count percentage and groupby amino acid position
    mut_aa_unstacked = ns_bool_T.loc[ns_bool_T['count']> 0]['mut_aa'].unstack().groupby(['aa_position','OG_aa']).first()

    # Create alternative amino acid axis context
    mut_aa_test = empty_df.fillna(mut_aa_unstacked).fillna(" ")
    mut_aa_test

    one=mut_aa_test[1].to_list()
    two=mut_aa_test[2].to_list()
    three=mut_aa_test[3].to_list()
    four=mut_aa_test[4].to_list()
    five=mut_aa_test[5].to_list()
    six=mut_aa_test[6].to_list()

    # Make index range for column lenght in df.plot
    test=data_to_plot.copy()
    test.reset_index().index

    og_aa_plot=list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYAN")

    #Set figure
    fig= plt.figure()
    #Create first axis
    ax1 = fig.add_axes((0.1,0.5,0.8,0.6))
    #Plot bar graph on ax 1
    ax1 = data_to_plot.plot(kind='bar',
                        stacked=True,
                        y=[1,2,3,4,5,6],
                        figsize=(50,10),
                        width=1,
                        color=[colors[i] for i in data_to_plot.columns],
                        linewidth = 0,
                        grid=True,
                        ax=ax1,
                        sharex = True,
                        legend=False)

    ax1.set_title(pop, fontsize=30)
    #--------------------------------------------------------------------------------------------------------------------
    #BAR GRAPH
    #--------------------------------------------------------------------------------------------------------------------
    #y axis of bar graph
    ax1.set_ylabel("% of samples with alternative amino acid ", fontsize=22)
    ax1.set_yticks([0,20,40,60,80,100])
    ax1.set_yticklabels(["0","20","40","60","80","100"], fontweight='bold',fontsize=20)
    #ax1.set_yscale('log')

    #--------------------------------------------------------------------------------------------------------------------
    #x axis of bar graph
    ax1.set_xlim(-1,85)
    ax1.set_xticks(test.reset_index().index)
    ax1.set_xlabel(' ')
    ax1.set_xticklabels(aa_position, fontweight='bold', fontsize=18)

    #--------------------------------------------------------------------------------------------------------------------
    # create secondary axis underneath
    #--------------------------------------------------------------------------------------------------------------------
    #Create OG aa x axis
    ax2 = fig.add_axes((0.1,0.42,0.8,0.0))
    ax2.yaxis.set_visible(False) # hide the yaxis
    ax2.set_xlim(-1,85)
    ax2.set_xticks(test.reset_index().index)
    ax2.set_xticklabels(og_aa_plot,fontweight='bold',fontsize=25)

    #--------------------------------------------------------------------------------------------------------------------
    #Create 1st alt x axis
    ax2 = fig.add_axes((0.1,0.35,0.8,0.0), frameon=False)
    ax2.yaxis.set_visible(False) # hide the yaxis
    ax2.set_xlim(-1,85)
    ax2.set_xticks(test.reset_index().index)
    ax2.tick_params(length=0)
    ax2.set_xticklabels(one, fontsize=25)


    #--------------------------------------------------------------------------------------------------------------------
    #Create 2nd  alt x axis
    ax2 = fig.add_axes((0.1,0.32,0.8,0.0), frameon=False)
    ax2.yaxis.set_visible(False) # hide the yaxis
    ax2.set_xlim(-1,85)
    ax2.set_xticks(test.reset_index().index)
    ax2.tick_params(length=0)
    ax2.set_xticklabels(two,fontsize=25)




    #--------------------------------------------------------------------------------------------------------------------
    #Create 3rd alt x axis
    ax2 = fig.add_axes((0.1,0.29,0.8,0.0), frameon=False)
    ax2.yaxis.set_visible(False) # hide the yaxis
    ax2.set_xlim(-1,85)
    ax2.set_xticks(test.reset_index().index)
    ax2.tick_params(length=0)
    ax2.set_xticklabels(three,fontsize=25)


    #--------------------------------------------------------------------------------------------------------------------
    #Create 4rd alt x axis
    ax2 = fig.add_axes((0.1,0.26,0.8,0.0), frameon=False)
    ax2.yaxis.set_visible(False) # hide the yaxis
    ax2.set_xlim(-1,85)
    ax2.set_xticks(test.reset_index().index)
    ax2.tick_params(length=0)
    ax2.set_xticklabels(four,fontsize=25)


    #--------------------------------------------------------------------------------------------------------------------
    #Create 5rd alt x axis
    ax2 = fig.add_axes((0.1,0.23,0.8,0.0), frameon=False)
    ax2.yaxis.set_visible(False) # hide the yaxis
    ax2.set_xlim(-1,85)
    ax2.set_xticks(test.reset_index().index)
    ax2.tick_params(length=0)
    ax2.set_xticklabels(five,fontsize=25)


    #--------------------------------------------------------------------------------------------------------------------
    #Create 6rd alt x axis
    ax2 = fig.add_axes((0.1,0.20,0.8,0.0), frameon=False)
    ax2.yaxis.set_visible(False) # hide the yaxis
    ax2.set_xlim(-1,85)
    ax2.set_xticks(test.reset_index().index)
    ax2.tick_params(length=0)
    ax2.set_xticklabels(six,fontsize=25)

    #--------------------------------------------------------------------------------------------------------------------
    #--------------------------------------------------------------------------------------------------------------------
    #Add text boxes for alternative amino acid labelling
    fig.text(0.09,0.38, " REF",   bbox=dict(facecolor='white'  ), fontsize=18, fontweight='bold')
    fig.text(0.09,0.32, "   1   ", bbox=dict(facecolor='#1d3557'), fontsize=18, color='white' )
    fig.text(0.09,0.29, "   2   ", bbox=dict(facecolor='#457b9d'), fontsize=18, color='white')
    fig.text(0.09,0.26, "   3   ", bbox=dict(facecolor='#a8dadc'), fontsize=18 )
    fig.text(0.09,0.23, "   4   ", bbox=dict(facecolor='#f0f3bd'), fontsize=18 )
    fig.text(0.09,0.20, "   5   ", bbox=dict(facecolor='#ec9a9a'), fontsize=18 )
    fig.text(0.09,0.17, "   6   ", bbox=dict(facecolor='#e63946'), fontsize=18 )

    fig.text(0.081,0.10, "Alternate Amino Acid",rotation=90, fontsize=20)
    return fig

Let’s create a plot for the West African subpopulation.

fig_AF_W = make_pop_plots(pop_bool, 'AF-W', 3450)

../../_images/csp-cterm-analysis_111_0.png

Figure Legend: Variation in c-terminal of csp in samples from West Africa.

Showing allele proportions for Pf7 by subpopulations¶

The final figure will display each subpopulation stacked on top of each other with a shared x-axis. We will update the make_pop_plots function to generate a subplot.

def make_pop_plots2(pop_bool, pop, pop_size, subplot_num):
    """
    Generate a stacked bar graph representing the percentage of samples with alternative amino acids
    for a given population.

    Parameters:
        pop_bool (pd.DataFrame): DataFrame containing boolean values for alternative amino acid presence.
        pop (str): Name of the population.
        pop_size (int): Size of the population.
        subplot_num (AxesSubplot): Subplot to use for plotting.

    Returns:
        AxesSubplot: The subplot containing the generated stacked bar graph.
    """

    # Extract relevant columns from the DataFrame
    ns_bool_T = pop_bool[[pop, 'aa_change', 'aa_position', 'OG_aa', 'mut_aa']]

    # Create a copy of the DataFrame for modification
    ns_bool_T = ns_bool_T.copy()

    # Add columns for count and count percentage
    ns_bool_T.loc[:, 'count'] = ns_bool_T[[pop]].sum(axis=1)
    ns_bool_T.loc[:, 'count_percentage'] = ns_bool_T['count'].apply(lambda x: (x / pop_size) * 100)

    # Add a ranking column and set the index
    ns_bool_T = ns_bool_T.reset_index(drop=True).sort_values(by=['aa_position', 'count_percentage'], ascending=[True, False])
    ns_bool_T.loc[:, 'alt_acid_number'] = ns_bool_T['aa_change'].map(rank_dict)
    ns_bool_T = ns_bool_T.set_index(['aa_position', 'OG_aa', 'alt_acid_number'])

    # Unstack the index by count percentage and group by amino acid position
    count_unstacked = ns_bool_T['count_percentage'].unstack().groupby(['aa_position', 'OG_aa']).first()

    # Create a new DataFrame with empty cells and merge it with the existing one
    aa_position = np.arange(277, 363)
    og_aa = list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYANDIEKKICKMEKCSSVFNVVNSSIGLIMVL")
    tuples = list(zip(aa_position, og_aa))

    cols = count_unstacked.columns
    index = pd.MultiIndex.from_tuples(tuples, names=['aa_position', 'OG_aa'])
    empty_df = pd.DataFrame(np.nan, index, cols)
    data_to_plot = empty_df.fillna(count_unstacked).fillna(0)

    # Turn count into an int and select rows with count > 0
    ns_bool_T['count'] = ns_bool_T['count'].apply(lambda x: int(x))
    ns_bool_T.loc[ns_bool_T['count'] > 0]

    # Unstack the index by count percentage and group by amino acid position
    mut_aa_unstacked = ns_bool_T.loc[ns_bool_T['count'] > 0]['mut_aa'].unstack().groupby(['aa_position', 'OG_aa']).first()

    # Plot the stacked bar graph on the specified subplot
    subplot_num = data_to_plot.plot(kind='bar',
                                    stacked=True,
                                    y=[1, 2, 3, 4, 5, 6],
                                    figsize=(15, 10),
                                    width=0.9,
                                    color=[colors[i] for i in column_list],
                                    linewidth=1,
                                    ax=subplot_num,
                                    sharex=True,
                                    legend=False)

    # Create a twin axis for additional annotations
    twin = subplot_num.twinx()
    twin.set_ylabel(pop, fontsize=8)
    twin.set_yticks([])
    twin.set_yticklabels([])

    # Customize y-axis and x-axis of the bar graph
    subplot_num.set_yticks([0, 20, 40, 60, 80, 100])
    subplot_num.set_yticklabels(["0", "20", "40", "60", "80", "100"], fontsize=10)
    subplot_num.set_xlim(-1, 85)
    subplot_num.set_xticks(data_to_plot.reset_index().index)
    subplot_num.set_xlabel(' ')
    subplot_num.set_xticklabels(aa_position, fontweight='bold', fontsize=10)

    return subplot_num

Now, we will invoke the function for each subpopulation, establish the x-axis, and incorporate a legend.

# Set up a 10-subplot figure with shared x and y axes
fig, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10) = plt.subplots(10, sharex=True, sharey=True, figsize=(50, 200))

# Plot stacked bar graphs for different populations using the make_pop_plots2 function
make_pop_plots2(pop_bool, 'SA', 152, ax1)
make_pop_plots2(pop_bool, 'AF-W', 3450, ax2)
make_pop_plots2(pop_bool, 'AF-C', 291, ax3)
make_pop_plots2(pop_bool, 'AF-NE', 83, ax4)
make_pop_plots2(pop_bool, 'AF-E', 927, ax5)
make_pop_plots2(pop_bool, 'AS-S-E', 135, ax6)
make_pop_plots2(pop_bool, 'AS-S-FE', 1066, ax7)
make_pop_plots2(pop_bool, 'AS-SE-W', 1710, ax8)
make_pop_plots2(pop_bool, 'AS-SE-E', 3166, ax9)
make_pop_plots2(pop_bool, 'OC-NG', 274, ax10)

# Create secondary axes underneath for alternative amino acid labels

# Create OG aa x-axis
ax11 = fig.add_axes((0.1, 0.07, 0.8, 0.0), frameon=False)
ax11.yaxis.set_visible(False)
ax11.set_xlim(-4, 85)
ax11.set_xticks(test.reset_index().index)
ax11.tick_params(length=0)
ax11.set_xticklabels(og_aa_plot, fontweight='bold', fontsize=10)

# Create alternative x-axes for different amino acid labels
for i, labels in enumerate([one, two, three, four, five, six]):
    ax11 = fig.add_axes((0.1, 0.07 - (i + 1) * 0.01, 0.8, 0.0), frameon=False)
    ax11.yaxis.set_visible(False)
    ax11.set_xlim(-4, 85)
    ax11.set_xticks(test.reset_index().index)
    ax11.tick_params(length=0)
    ax11.set_xticklabels(labels, fontsize=10)

# Add text boxes for alternative amino acid labeling
fig.text(0.1, 0.064, " REF", bbox=dict(facecolor='white'), fontsize=8, fontweight='bold')
for i, color in enumerate(colors.values(), start=1):
    fig.text(0.1, 0.048 - i * 0.008, f"   {i}   ", bbox=dict(facecolor=color, edgecolor='black', boxstyle='round,pad=0.3'), fontsize=5, color='white')

# Add text for the alternative amino acid axis
fig.text(0.09, 0.001, "Alt. Amino Acid", rotation=90, fontsize=7)

Text(0.09, 0.001, 'Alt. Amino Acid')

../../_images/csp-cterm-analysis_117_1.png

Figure Legend: Variation in c-terminal of csp. The x-axis shows amino acid positions and the y-axis the proportion of non-reference alleles in different sub-populations. Different alternative amino acids are represented by different colours as represented by the legend below the x-axis.

Save the plot¶

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

fig.savefig('/content/drive/My Drive/figure_csp_c_terminal_barstacked.pdf', dpi=480, bbox_inches="tight") # increase the dpi for higher resolution

MalariaGEN parasite data user guide

Analyse csp C-Terminal Variation

Contents

Analyse csp C-Terminal Variation¶

Introduction¶

Setup¶

Access Pf7 Data¶

Access the csp c-terminal Haplotypes Data¶

Frequency of csp c-terminal haplotypes and global mean distances¶

Plotting¶

Save the plot¶

Histograms of Amino Acid Differences Between Subpopulations for csp c-terminal¶

Count amino acid differences to non-3D7 haplotypes¶

Count numbers of amino acid differences from each haplotype for each sample¶

Create plot showing differences between 3D7 and Dd2 haplotypes¶

Save the plot¶

Barplot of Non-reference Allele Proportions in Different Sub-populations¶

Extracting mutations: Format and tidy-up substitutions column¶

Extracting mutations: Unique mutations sorted by positon number¶

Formatting: Add columns for aa position, and amino acids before and after¶

Figure preparation¶

UNSTACK - for count percentage values¶

UNSTACK for amino acid lists¶

Plotting¶

Showing allele proportions for Pf7¶

Showing allele proportions for one subpopulation¶

Showing allele proportions for Pf7 by subpopulations¶

Save the plot¶