Analyse csp C-Terminal Variation
Contents
Analyse csp C-Terminal Variation¶
Introduction¶
The c-terminal domain of the Plasmodium falciparum circumsporozoite protein (csp) has been targeted by the RTS,S and R21 vaccines, underscoring the importance of monitoring the variability of this region.
This notebook recreates the following figures from the Pf7 paper:
Figure 4A: frequency of csp c-terminal haplotypes and global mean distances
Figure 4B: histograms of amino acid differences between subpopulations and 3D7 for csp c-terminal
Supplementary Figure 10: Barplot of non-reference allele proportions in different sub-populations.
This notebook takes 5 minutes to run.
Setup¶
Install and import the malariagen Python package:
!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 146.2/146.2 kB 1.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 11.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.8/52.8 MB 7.6 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 19.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 31.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 11.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 1.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 21.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 33.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 9.4 MB/s eta 0:00:00
?25h Preparing metadata (setup.py) ... ?25l?25hdone
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 24.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 22.2 MB/s eta 0:00:00
?25h Building wheel for asciitree (setup.py) ... ?25l?25hdone
Import required python libraries that are installed at colab by default.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections
import re
import scipy
from google.colab import drive
Access Pf7 Data¶
We use the malariagen data package to load the release data.
release_data = malariagen_data.Pf7()
sample_metadata = release_data.sample_metadata()
# take a glance at the metadata dataframe
sample_metadata.head(3)
Sample | Study | Country | Admin level 1 | Country latitude | Country longitude | Admin level 1 latitude | Admin level 1 longitude | Year | ENA | All samples same case | Population | % callable | QC pass | Exclusion reason | Sample type | Sample was in Pf6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FP0008-C | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR1081237 | FP0008-C | AF-W | 82.16 | True | Analysis_set | gDNA | True |
1 | FP0009-C | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR1081238 | FP0009-C | AF-W | 88.85 | True | Analysis_set | gDNA | True |
2 | FP0010-CW | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR2889621 | FP0010-CW | AF-W | 86.46 | True | Analysis_set | sWGA | False |
Access the csp c-terminal Haplotypes Data¶
We can download the csp c-terminal region haplotypes for 16,203 samples from the MalariaGEN web resources using wget
command.
!wget ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_csp_c_terminal_haplotypes.txt
--2024-01-09 16:44:02-- ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_csp_c_terminal_haplotypes.txt
=> ‘Pf7_csp_c_terminal_haplotypes.txt’
Resolving ngs.sanger.ac.uk (ngs.sanger.ac.uk)... 193.62.203.79
Connecting to ngs.sanger.ac.uk (ngs.sanger.ac.uk)|193.62.203.79|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /production/malaria/Resource/34 ... done.
==> SIZE Pf7_csp_c_terminal_haplotypes.txt ... 10196635
==> PASV ... done. ==> RETR Pf7_csp_c_terminal_haplotypes.txt ... done.
Length: 10196635 (9.7M) (unauthoritative)
Pf7_csp_c_terminal_ 100%[===================>] 9.72M 7.16MB/s in 1.4s
2024-01-09 16:44:05 (7.16 MB/s) - ‘Pf7_csp_c_terminal_haplotypes.txt’ saved [10196635]
# Read the data into a dataset
df_csp_haplotypes = pd.read_csv('Pf7_csp_c_terminal_haplotypes.txt', sep='\t', low_memory=False)
# Rename first column as 'Samples' in accordance with sample_metadata
df_csp_haplotypes = df_csp_haplotypes.rename(columns={df_csp_haplotypes.columns[0]: 'Sample'})
# Print first rows
df_csp_haplotypes.head()
Sample | csp_277-397_aa_haplotype | csp_277-397_nucleotide_haplotype | csp_277-397_ns_changes | |
---|---|---|---|---|
0 | FP0008-C | QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT... | ctaattaaggaacaagaaggataataccattattaatcctattgaa... | S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q |
1 | FP0009-C | QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT... | ctaattaaggaacaagaaggataataccattattaatcctattgaa... | S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q |
2 | FP0010-CW | QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT... | ctaattaaggaacaagaaggataataccattattaatcctattgaa... | s301n/k314q/k317e/e318k/n321k/k322t/q324k/e357... |
3 | FP0011-CW | QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDKHIEQYLKT... | ctaattaaggaacaagaaggataataccattattaatcctattgaa... | S301N/K317E/E318Q/N321K/K322T/Q324K/D359N/A361E |
4 | FP0012-CW | QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT... | ctaattaaggaacaagaaggataataccattattaatcctattgaa... | S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q |
Let’s merge the metadata and haplotypes dataframes, sample_metadata
and df_csp_haplotypes
, using Sample
IDs.
df_samples = sample_metadata.merge(df_csp_haplotypes, on='Sample')
df_samples.head(3)
Sample | Study | Country | Admin level 1 | Country latitude | Country longitude | Admin level 1 latitude | Admin level 1 longitude | Year | ENA | All samples same case | Population | % callable | QC pass | Exclusion reason | Sample type | Sample was in Pf6 | csp_277-397_aa_haplotype | csp_277-397_nucleotide_haplotype | csp_277-397_ns_changes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FP0008-C | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR1081237 | FP0008-C | AF-W | 82.16 | True | Analysis_set | gDNA | True | QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT... | ctaattaaggaacaagaaggataataccattattaatcctattgaa... | S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q |
1 | FP0009-C | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR1081238 | FP0009-C | AF-W | 88.85 | True | Analysis_set | gDNA | True | QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT... | ctaattaaggaacaagaaggataataccattattaatcctattgaa... | S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q |
2 | FP0010-CW | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR2889621 | FP0010-CW | AF-W | 86.46 | True | Analysis_set | sWGA | False | QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDQHIEKYLKT... | ctaattaaggaacaagaaggataataccattattaatcctattgaa... | s301n/k314q/k317e/e318k/n321k/k322t/q324k/e357... |
As we examine a region comprising 121 amino acids, haplotypes exceeding this length are considered heterozygous calls, indicative of mixed infections. Let’s find out the percentage of samples that comprise 121 amino acids.
df_samples['csp_277-397_aa_haplotype'].apply(len).value_counts() / 16203*100
121 69.493304
244 20.551750
1 6.801210
243 3.178424
202 0.006172
177 0.006172
Name: csp_277-397_aa_haplotype, dtype: float64
We will only use homozygous calls for our subsequent analyses.
# Classify samples with expected haplotype length as homozygous
df_samples['c_terminal_hap_is_hom'] = ( df_samples['csp_277-397_aa_haplotype'].apply(len) == 121 )
Frequency of csp c-terminal haplotypes and global mean distances¶
We will start with finding out the frequency of haplotypes.
The csp_277-397_ns_changes column provides information about nucleotide sequence changes in each sample. An empty value in this column indicates no change in the nucleotide sequence, and such haplotypes are referred to as 3D7, representing the reference sample in Pf7.
df_samples['csp_277-397_ns_changes'] = df_samples['csp_277-397_ns_changes'].fillna('3D7')
We can group the samples with identical nucleotide changes and count them using a simple function.
def my_agg(x):
names = collections.OrderedDict()
names['csp_277-397_aa_haplotype'] = x['csp_277-397_aa_haplotype']
names['number_of_samples'] = len(x)
return(pd.Series(names))
We will only apply the function to samples with homozygous calls, and lab strains will be excluded from the analysis.
df_haplotypes = pd.DataFrame(
df_samples.loc[
# Select only samples with homozygous calls
( df_samples['csp_277-397_aa_haplotype'].apply(len) == 121 ) &
# Exclude the lab strains
( df_samples['Study'] != '1153-PF-Pf3KLAB-KWIATKOWSKI' )
# Group samples by ns changes and apply the function to count samples
]).groupby('csp_277-397_ns_changes').apply(my_agg)
print(df_haplotypes.shape)
df_haplotypes.head()
(248, 2)
csp_277-397_aa_haplotype | number_of_samples | |
---|---|---|
csp_277-397_ns_changes | ||
3D7 | 5 QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSD... | 368 |
A297D/S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q | 59 QGNGQGHNMPNDPNRNVDENDNANNAVKNNNNEEPSD... | 2 |
A297S/S301N/K317E/E318Q/E357Q | 11584 QGNGQGHNMPNDPNRNVDENSNANNAVKNNNNEEPSD... | 1 |
A299D/S301N/K314Q/K317E/E318K/N321K/E357Q | 505 QGNGQGHNMPNDPNRNVDENANDNNAVKNNNNEEPSD... | 14 |
A299G/L327I/E357Q | 1837 QGNGQGHNMPNDPNRNVDENANGNSAVKNNNNEEPSD... | 8 |
Next, we will create a distance matrix based on pairwise differences to compute the global mean distance for each sample. The diagonal of the matrix is filled with NaN to exclude self-comparisons.
# Initialize distance matrix
haplotype_distance_matrix = np.zeros((len(df_haplotypes), len(df_haplotypes)))
# Populate distance matrix
for i in range(len(df_haplotypes)):
for j in range(len(df_haplotypes)):
# Select the first copy of the haplotype for comparison
hap1 = df_haplotypes.iloc[i, 0].values[0]
hap2 = df_haplotypes.iloc[j, 0].values[0]
haplotype_distance_matrix[i, j] = np.count_nonzero(np.array(list(hap1)) != np.array(list(hap2)))
# Fill diagonal with NaN
np.fill_diagonal(haplotype_distance_matrix, np.nan)
# Print the shape and the distance matrix
print(haplotype_distance_matrix.shape)
print(haplotype_distance_matrix)
(248, 248)
[[nan 9. 5. ... 8. 8. 8.]
[ 9. nan 6. ... 4. 6. 4.]
[ 5. 6. nan ... 6. 5. 6.]
...
[ 8. 4. 6. ... nan 5. 2.]
[ 8. 6. 5. ... 5. nan 6.]
[ 8. 4. 6. ... 2. 6. nan]]
After creating the distance matrix, each row in the matrix represents the global mean distance for each sample. Finally, the samples are sorted based on their global mean distance for visual purposes of the figure.
df_haplotypes['global_mean_distance'] = np.nanmean(haplotype_distance_matrix, axis=0)
df_haplotypes.sort_values('global_mean_distance', inplace=True)
df_haplotypes
csp_277-397_aa_haplotype | number_of_samples | global_mean_distance | |
---|---|---|---|
csp_277-397_ns_changes | |||
S301N/K317E/E318Q/N321K/E357Q/A361E | 10856 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD... | 3 | 4.647773 |
S301N/K317E/E318Q/N321K/A361E | 712 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD... | 2760 | 4.708502 |
S301N/K317E/N321K/A361E | 6736 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSDK... | 1 | 4.805668 |
S301N/K317E/E318Q/N321K/K322T/E357Q/A361E | 4581 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD... | 6 | 4.951417 |
S301N/K314Q/K317E/E318K/N321K/E357Q/A361E | 15085 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD... | 1 | 5.032389 |
... | ... | ... | ... |
S301N/L327I/N352G/D356E/D359N/A361E | 13069 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD... | 1 | 8.570850 |
K317T/N321K/N352G/P354S/D356N/A361E | 15402 QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSD... | 1 | 8.655870 |
D294N/S301N/N321T/L327I/N352G/D359N/A361E | 8343 QGNGQGHNMPNDPNRNVNENANANNAVKNNNNEEPSDK... | 1 | 8.724696 |
L327I/N352G/D356N/A361E | 1905 QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDK... | 3 | 8.740891 |
D294N/S301N/K317T/N321K/N352G/P354S/K355R/A361E | 2364 QGNGQGHNMPNDPNRNVNENANANNAVKNNNNEEPSD... | 9 | 8.870445 |
248 rows × 3 columns
We would like to show haplotypes found in some lab strains. We haplotype-match these lab strains and store them in a new dataframe.
df_haplotypes_lab = df_haplotypes.loc[df_haplotypes.index.isin(df_samples.loc[df_samples['Study'] == '1153-PF-Pf3KLAB-KWIATKOWSKI', 'csp_277-397_ns_changes'])]
# List of labels
labels = ['Dd2', 'IT', '7G8', 'GB4', '3D7', 'HB3']
# Create 'labels' column
df_haplotypes_lab= df_haplotypes_lab.copy()
df_haplotypes_lab.loc[:, 'labels'] = labels
df_haplotypes_lab
csp_277-397_aa_haplotype | number_of_samples | global_mean_distance | labels | |
---|---|---|---|---|
csp_277-397_ns_changes | ||||
S301N/K317E/E318Q/N321K/A361E | 712 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD... | 2760 | 4.708502 | Dd2 |
S301N/K317E/E318Q/N321K/N352D/E357Q/A361E | 3090 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD... | 80 | 5.259109 | IT |
S301N/K317E/E318Q/N321K/Q324K/L327I/A361E | 23 QGNGQGHNMPNDPNRNVDENANANNAVKNNNNEEPSD... | 125 | 5.558704 | 7G8 |
N298K/S301N/N321K/K322I | 2174 QGNGQGHNMPNDPNRNVDENAKANNAVKNNNNEEPSD... | 4 | 6.736842 | GB4 |
3D7 | 5 QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSD... | 368 | 6.882591 | 3D7 |
D288N/S301N/K317E/E318Q/N321K/N352G/P354S/A361E | 92 QGNGQGHNMPNNPNRNVDENANANNAVKNNNNEEPSD... | 47 | 7.210526 | HB3 |
Plotting¶
We are ready to create the first plot which will have 2 subplots - one for showing the haplotype frequencies and the other one for showing the global mean distance. Then, we will paint lab strains with red colour using df_haplotypes_lab
dataset.
fig, axs = plt.subplots(2, 1, figsize=(12, 6))
axs[0].bar(df_haplotypes.index, df_haplotypes['number_of_samples'], log='y', color='grey')
axs[0].bar(df_haplotypes_lab.index, df_haplotypes_lab['number_of_samples'], log='y', color='red')
axs[0].set_yticks([1, 10, 100, 1000])
axs[0].set_yticklabels([1, 10, 100, 1000])
axs[0].set_xticks(df_haplotypes_lab.index)
axs[0].set_xticklabels(df_haplotypes_lab.labels)
axs[0].set_ylabel('Number of samples (log scale)')
# axs[0].set_ylim([0.8, 3000])
axs[1].bar(df_haplotypes.index, df_haplotypes['global_mean_distance'], color='grey')
axs[1].bar(df_haplotypes_lab.index, df_haplotypes_lab['global_mean_distance'], color='red')
# axs[1].set_xticks([])
axs[1].set_xticks(df_haplotypes_lab.labels)
axs[1].set_xticklabels([])
axs[1].set_ylabel('Global mean distance')
axs[1].invert_yaxis()
axs[1].xaxis.tick_top()
fig.tight_layout()
Figure Legend: Analysis of c-terminal of csp. Upper panel - frequency of different haplotypes of c-terminal of csp. Haplotypes found in some lab strains are named and highlighted in red. Haplotypes are ordered as per lower panel. Lower panel - global mean distance (number of amino acid differences) to all other haplotypes.
Save the plot¶
We can save the plot by running:
# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
fig.savefig('/content/drive/My Drive/figure_csp_c_terminal_all_haplotypes.pdf', dpi=480, bbox_inches="tight") # increase the dpi for higher resolution
Histograms of Amino Acid Differences Between Subpopulations for csp c-terminal¶
Before creating this figure, we need to count the number of nucleotide changes within samples.
Count amino acid differences to non-3D7 haplotypes¶
Short way of counting substitions is using the csp_277-397_ns_changes
column in the dataframe. Since different substutions are concatanated using ‘/’ (i.e. D288N/S301N), we will first split the string and then find the count.
def determine_num_aa_changes(x):
if x == '' or x is None:
return(0)
else:
return(len(x.split(',')[0].split('/')))
df_samples_hom = df_samples.loc[df_samples['c_terminal_hap_is_hom']]
df_samples.loc[df_samples['c_terminal_hap_is_hom'] == True, 'csp_277-397_ns_changes'].apply(determine_num_aa_changes).value_counts()
5 3008
8 2087
6 1894
7 1876
9 1704
1 374
4 222
10 45
2 33
3 17
Name: csp_277-397_ns_changes, dtype: int64
The long way of counting the amino acid differences is by cross-matching the haplotype sequence with the reference and counting the differences.
def determine_num_aa_changes_from_haplotype(
x,
# 3D7 haplotype
haplotype='QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYANDIEKKICKMEKCSSVFNVVNSSIGLIMVLSFLFLN'
):
num_aa_changes = 0
if len(x) != len(haplotype):
return(0)
# raise ValueError(f"Length of x ({len(x)}) different to that of haplotype ({len(haplotype)})")
for i in np.arange(len(x)):
if x[i] != haplotype[i]:
num_aa_changes += 1
return(num_aa_changes)
(
df_samples_hom['csp_277-397_aa_haplotype']
.fillna('')
.apply(
lambda x: determine_num_aa_changes_from_haplotype(
x,
# 3D7
'QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYANDIEKKICKMEKCSSVFNVVNSSIGLIMVLSFLFLN'
)
)
.value_counts()
)
5 3008
8 2087
6 1894
7 1876
9 1704
0 369
4 222
10 45
2 33
3 17
1 5
Name: csp_277-397_aa_haplotype, dtype: int64
Both methods gave the same results.
Count numbers of amino acid differences from each haplotype for each sample¶
We already found the amino acid differences to 3D7, as a bonus step, we can utilize the same function to count amino acid differences between each haplotype.
# Df to store haplotype counts in samples
num_aa_changes = df_samples_hom.copy()
# Ignore the dirty performance warning
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
# Iterate over each haplotype
for haplotype in df_samples_hom['csp_277-397_aa_haplotype'].unique():
print('.', end='')
# Pick the corresponding substitutions (aa_changes) for the haplotype
aa_changes = df_samples_hom.loc[df_samples_hom['csp_277-397_aa_haplotype'] == haplotype, 'csp_277-397_ns_changes'].values[0]
# Create a column for aa_changes
# Apply function to find number of changes
num_aa_changes.loc[:,aa_changes] = (
df_samples['csp_277-397_aa_haplotype']
.apply(
lambda x: determine_num_aa_changes_from_haplotype(
x,
haplotype
)
)
)
........................................................................................................................................................................................................................................................
Create plot showing differences between 3D7 and Dd2 haplotypes¶
Let’s define a colour to each subpopulation to be used in the figure.
population_colours = collections.OrderedDict()
population_colours['SA']= "#4daf4a"
population_colours['AF-W']= "#e31a1c"
population_colours['AF-C'] = "#fd8d3c"
population_colours['AF-NE'] = "#bb8129"
population_colours['AF-E'] = "#fecc5c"
population_colours['AS-S-E'] = "#dfc0eb"
population_colours['AS-S-FE'] = "#984ea3"
population_colours['AS-SE-W'] = "#9ecae1"
population_colours['AS-SE-E'] = "#3182bd"
population_colours['OC-NG'] = "#f781bf"
In this plot:
Each subplot represents samples from a distinct subpopulation.
The x-axis denotes the number of amino acid differences, ranging from 1 to 10.
The y-axis indicates the corresponding number of samples.
In addition to the 3D7 reference, substitutions to the Dd2 reference will also be featured in the analysis.
# Create a 5x5 grid of subplots with specified dimensions and spacing
fig, axs = plt.subplots(5, 5, figsize=(15, 8), gridspec_kw={'height_ratios': [1, 1, 0.5, 1, 1], 'hspace': 0.25})
# Define an ordered dictionary to hold references and associated labels for each subplot
plots = collections.OrderedDict()
plots['3D7'] = '3D7'
plots['S301N/K317E/E318Q/N321K/A361E'] = 'Dd2'
# Loop through each reference in the plots dictionary
for plot_num, aa_changes in enumerate(plots):
# Loop through each subpopulation and enumerate its index
for pop_num, population in enumerate(population_colours):
# Calculate the subplot indices based on the current reference and subpopulation
plot_index_row = (pop_num // 5) * 3 + (plot_num)
plot_index_col = pop_num % 5
# Extract the number of amino acid changes for the current subpopulation and reference
num_csp_ns_mutations_this_pop = num_aa_changes.loc[
(num_aa_changes['Population'] == population),
aa_changes
].values
# Create a histogram for the amino acid changes in the current subplot
axs[plot_index_row, plot_index_col].hist(num_csp_ns_mutations_this_pop, bins=np.arange(0, 12),
color=population_colours[population])
# Set x-axis ticks and labels for better readability
axs[plot_index_row, plot_index_col].set_xticks(np.arange(0.5, 11))
# Set subplot title for the first row and x-axis labels for the second row
if plot_num == 0:
axs[plot_index_row, plot_index_col].set_title(population, fontsize=12)
if plot_num == 1:
axs[plot_index_row, plot_index_col].set_xticklabels(np.arange(0, 11))
else:
axs[plot_index_row, plot_index_col].set_xticklabels([])
# Set y-axis label for each subplot
if plot_index_col == 0:
axs[plot_index_row, plot_index_col].set_ylabel(plots[aa_changes])
# Turn off the middle row of subplots for aesthetic reasons
for column_num in np.arange(5):
axs[2, column_num].axis('off')
# Add x-axis and y-axis labels to the entire figure
fig.text(0.5, 0.04, '$csp$ c-terminal number of amino acid differences', ha='center', fontsize=12)
fig.text(0.05, 0.5, 'Frequency (number of samples with this number of\namino acid differences from given haplotype)',
va='center', ha='center', fontsize=12, rotation=90)
Text(0.05, 0.5, 'Frequency (number of samples with this number of\namino acid differences from given haplotype)')
Figure Legend: Analysis of c-terminal of csp. Histograms of number of amino acid differences between samples in each major sub-population and the 3D7 haplotype (upper plot) and Dd2 haplotype (lower plot).
Save the plot¶
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
fig.savefig('/content/drive/My Drive/figure_csp_c_terminal_histogram.pdf', dpi=480, bbox_inches="tight") # increase the dpi for higher resolution
Barplot of Non-reference Allele Proportions in Different Sub-populations¶
Before proceeding with plotting, we need to determine the allele proportions, a process that involves the following two steps:
Extracting mutations from the substitution column.
Calculating the count of each mutation within a subpopulation.
Extracting mutations: Format and tidy-up substitutions column¶
Let’s examine a value in the substitutions column to determine the appropriate method for extraction.
df_samples_hom['csp_277-397_ns_changes'][0]
'S301N/K314Q/K317E/E318K/N321K/K322T/Q324K/E357Q'
Since all mutations are split by the “/” character from each other, we can write a simple function to seperate them.
def tidylist(x):
# Split mutations
x = (x.split('/'))
# Create list to store extractions
tidy_list = []
# Loop through item list
for item in x:
item=item.upper() # Convert all to caps
item=item.replace("*", "") # Remove unknown characters
item=item.replace("!", "") # Remove unknown characters
tidy_list.append(item)
return tidy_list
# Apply function and store the results in a new column
df_samples_hom= df_samples_hom.copy()
df_samples_hom['list'] = df_samples_hom['csp_277-397_ns_changes'].apply(tidylist)
Extracting mutations: Unique mutations sorted by positon number¶
We will find unique mutations and sort them by positions in ascending order which will be useful for the final figure.
# Create array of all mutation types
# Find all unique mutations
unique_array = df_samples_hom['list'].explode().unique()
# Sort by positions (second character of string)
unique_array=sorted(unique_array,key=lambda t: t[1:])
Now, we will flag samples for the presence of these mutations, assigning True
if they exist and False
if they do not.
# Create boolean data frame, showing True or False values for each mutation
def boolean_df(item_lists, unique_items):
# Create empty dict
bool_dict = {}
# Loop through all the tags
for item in unique_items:
# Apply boolean mask
bool_dict[item] = item_lists.apply(lambda x: item in x)
# Return the results as a dataframe
return pd.DataFrame(bool_dict)
ns_bool = boolean_df(df_samples_hom['list'], unique_array)
ns_bool.head()
H283R | D288G | D288N | V293A | D294N | E295A | E295D | E295K | N296D | N296S | ... | D356N | E357K | E357Q | D359N | A361E | A361I | A361K | A361S | L391F | 3D7 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | True | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
5 rows × 68 columns
We will use the index of the samples to match sample populations from the df_samples
since we have not resetted index so far.
# Make dictionary of sample and population
dictionary = dict(zip(df_samples_hom.index,df_samples_hom.Population))
# Map population to boolean dataframe using dictionary and set as index
ns_bool['population']= ns_bool.index.map(dictionary)
# Change sample names to population
ns_bool=ns_bool.set_index(['population'])
ns_bool
H283R | D288G | D288N | V293A | D294N | E295A | E295D | E295K | N296D | N296S | ... | D356N | E357K | E357Q | D359N | A361E | A361I | A361K | A361S | L391F | 3D7 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
population | |||||||||||||||||||||
AF-W | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
AF-W | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
AF-W | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | True | False | False | False | False | False |
AF-W | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
AF-W | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
AF-E | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | True | True | False | False | False | False | False |
AF-E | False | False | False | False | False | False | False | False | False | False | ... | True | False | False | False | True | False | False | False | False | False |
AF-E | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | True | False | False | False | False | False |
AF-E | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | True | False | False | False | False | False |
AF-E | False | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | False | False | False |
11260 rows × 68 columns
Formatting: Add columns for aa position, and amino acids before and after¶
Before calculating allele frequencies, some data formatting is required. We will transverse the dataframe to extract: aa_positions, wild-type and mutation.
#Transverse dataframe to calculate bool counts for each amino acid position change
ns_bool_T = ns_bool.T
ns_bool_T.head()
#Copy this data frame to modify later on when selecting my population
ns_bool_T_original = ns_bool_T.copy()
# Convert index to column
ns_bool_T = ns_bool.T.reset_index().rename(columns={'index': 'aa_change'})
# Drop last row - 3D7
ns_bool_T = ns_bool_T.drop(ns_bool_T.index[-1])
# Split aa_change column into the position, original amino acid, and mutated amino acid
ns_bool_T['aa_position']=ns_bool_T['aa_change'].str[1:4].astype(int)
ns_bool_T['OG_aa']=ns_bool_T['aa_change'].str[0]
ns_bool_T['mut_aa']=ns_bool_T['aa_change'].str[-1]
ns_bool_T.head()
# Make copy to use later
pop_bool = ns_bool_T.copy()
We can easily count how many mutations within each sample and find the frequency by dividing overall population size.
# Add count columnns
ns_bool_T['count']=ns_bool_T[['AF-W','AS-SE-E','AS-SE-W','AS-S-FE','AF-C','OC-NG','AF-NE', 'AS-S-E','SA','AF-E']].sum(axis=1) #create column for count
# Add count percentage column
df_samples.shape
pop_size=len(df_samples_hom)
# Normalise by population.
ns_bool_T['count_percentage']=ns_bool_T['count'].apply(lambda x: (x/pop_size)*100)
# Add ranking column
ns_bool_T=ns_bool_T.reset_index(drop=True).sort_values(by=['aa_position', 'count_percentage'], ascending=[True, False])
ns_bool_T['alt_acid_number'] = ns_bool_T.groupby(['aa_position']).cumcount()+1; ns_bool_T
# Make dictionary of alt_acid mumher and aa_change to map later on when looking at populations
rank_dict = dict(zip(ns_bool_T['aa_change'].to_list(), ns_bool_T['alt_acid_number'].to_list()))
In order to create stacked bars in the plot, we will create multi-level index.
#set index
ns_bool_T = ns_bool_T.set_index(['aa_position', 'OG_aa','alt_acid_number'])
pd.set_option('display.max_rows', 100)
ns_bool_T
population | aa_change | AF-W | AF-W | AF-W | AF-W | AF-W | AF-W | AF-W | AF-W | AF-W | ... | AF-E | AF-E | AF-E | AF-E | AF-E | AF-E | AF-E | mut_aa | count | count_percentage | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa_position | OG_aa | alt_acid_number | |||||||||||||||||||||
283 | H | 1 | H283R | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | R | 1 | 0.008881 |
288 | D | 1 | D288N | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | N | 72 | 0.639432 |
2 | D288G | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | G | 3 | 0.026643 | ||
293 | V | 1 | V293A | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | A | 1 | 0.008881 |
294 | D | 1 | D294N | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | N | 137 | 1.216696 |
295 | E | 1 | E295K | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | K | 31 | 0.275311 |
2 | E295A | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | A | 7 | 0.062167 | ||
3 | E295D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 2 | 0.017762 | ||
296 | N | 1 | N296D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 30 | 0.266430 |
2 | N296S | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | S | 2 | 0.017762 | ||
297 | A | 1 | A297D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 2 | 0.017762 |
2 | A297S | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | S | 1 | 0.008881 | ||
298 | N | 1 | N298K | False | False | False | False | False | False | True | False | False | ... | False | False | False | False | True | True | False | K | 768 | 6.820604 |
299 | A | 1 | A299G | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | G | 73 | 0.648313 |
2 | A299S | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | S | 21 | 0.186501 | ||
3 | A299D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 14 | 0.124334 | ||
301 | S | 1 | S301N | True | True | True | True | False | True | True | True | True | ... | True | True | True | True | True | True | True | N | 10761 | 95.568384 |
2 | S301D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 25 | 0.222025 | ||
302 | A | 1 | A302D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 1655 | 14.698046 |
2 | A302G | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | G | 38 | 0.337478 | ||
3 | A302P | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | P | 15 | 0.133215 | ||
4 | A302V | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | V | 6 | 0.053286 | ||
303 | V | 1 | V303A | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | A | 3 | 0.026643 |
2 | V303E | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | E | 3 | 0.026643 | ||
3 | V303G | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | G | 1 | 0.008881 | ||
305 | N | 1 | N305K | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | K | 1 | 0.008881 |
310 | E | 1 | E310V | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | V | 1 | 0.008881 |
314 | K | 1 | K314Q | True | True | False | True | False | True | False | False | False | ... | False | False | True | False | False | False | True | Q | 2269 | 20.150977 |
317 | K | 1 | K317E | True | True | True | True | False | True | True | True | True | ... | True | True | True | False | True | True | True | E | 9745 | 86.545293 |
2 | K317T | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | T | 722 | 6.412078 | ||
318 | E | 1 | E318Q | False | False | True | False | False | False | True | True | True | ... | True | True | False | False | True | True | False | Q | 6889 | 61.181172 |
2 | E318K | True | True | False | True | False | True | False | False | False | ... | False | False | True | False | False | False | True | K | 2456 | 21.811723 | ||
3 | E318D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 7 | 0.062167 | ||
320 | L | 1 | L320I | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | I | 45 | 0.399645 |
321 | N | 1 | N321K | True | True | True | True | False | True | False | False | True | ... | False | True | False | False | True | True | True | K | 9690 | 86.056838 |
2 | N321Q | False | False | False | False | False | False | False | False | False | ... | False | False | True | False | False | False | False | Q | 365 | 3.241563 | ||
3 | N321T | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | T | 221 | 1.962700 | ||
4 | N321P | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | P | 1 | 0.008881 | ||
322 | K | 1 | K322T | True | True | True | True | False | True | False | False | True | ... | False | False | False | False | True | True | True | T | 2012 | 17.868561 |
2 | K322N | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | N | 770 | 6.838366 | ||
3 | K322I | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | I | 524 | 4.653641 | ||
4 | K322R | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | R | 391 | 3.472469 | ||
5 | K322E | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | E | 194 | 1.722913 | ||
6 | K322L | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | L | 4 | 0.035524 | ||
323 | I | 1 | I323M | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | M | 68 | 0.603908 |
324 | Q | 1 | Q324K | True | True | True | True | False | False | True | True | True | ... | True | True | False | False | False | False | False | K | 1972 | 17.513321 |
2 | Q324R | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | R | 149 | 1.323268 | ||
325 | N | 1 | N325Y | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | Y | 1334 | 11.847247 |
326 | S | 1 | S326A | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | A | 34 | 0.301954 |
327 | L | 1 | L327I | False | False | False | False | False | False | True | True | False | ... | True | True | False | True | False | False | False | I | 982 | 8.721137 |
349 | G | 1 | G349D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 18 | 0.159858 |
2 | G349S | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | S | 2 | 0.017762 | ||
352 | N | 1 | N352D | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | D | 1696 | 15.062167 |
2 | N352G | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | G | 894 | 7.939609 | ||
354 | P | 1 | P354S | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | S | 574 | 5.097691 |
2 | P354L | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | L | 1 | 0.008881 | ||
355 | K | 1 | K355R | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | R | 9 | 0.079929 |
356 | D | 1 | D356N | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | N | 332 | 2.948490 |
2 | D356E | False | False | False | False | False | False | False | False | True | ... | False | False | False | False | False | False | False | E | 63 | 0.559503 | ||
357 | E | 1 | E357Q | True | True | False | True | False | True | False | False | False | ... | False | False | True | False | True | True | True | Q | 4887 | 43.401421 |
2 | E357K | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | K | 1 | 0.008881 | ||
359 | D | 1 | D359N | False | False | True | False | False | False | True | False | False | ... | True | False | True | False | False | False | False | N | 1345 | 11.944938 |
361 | A | 1 | A361E | False | False | True | False | False | False | True | False | True | ... | True | True | True | True | True | True | False | E | 8580 | 76.198934 |
2 | A361I | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | I | 72 | 0.639432 | ||
3 | A361K | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | K | 1 | 0.008881 | ||
4 | A361S | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | S | 1 | 0.008881 | ||
391 | L | 1 | L391F | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | F | 3 | 0.026643 |
67 rows × 11264 columns
Figure preparation¶
The last step is doing unstacking and grouping operations to prepare our data in a format suitable for creating a stacked bar plot.
UNSTACK - for count percentage values¶
# Unstack index by count percentage and groupby amino acid position
count_unstacked = ns_bool_T['count_percentage'].unstack().groupby(['aa_position','OG_aa']).first()
count_unstacked
alt_acid_number | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
aa_position | OG_aa | ||||||
283 | H | 0.008881 | NaN | NaN | NaN | NaN | NaN |
288 | D | 0.639432 | 0.026643 | NaN | NaN | NaN | NaN |
293 | V | 0.008881 | NaN | NaN | NaN | NaN | NaN |
294 | D | 1.216696 | NaN | NaN | NaN | NaN | NaN |
295 | E | 0.275311 | 0.062167 | 0.017762 | NaN | NaN | NaN |
296 | N | 0.266430 | 0.017762 | NaN | NaN | NaN | NaN |
297 | A | 0.017762 | 0.008881 | NaN | NaN | NaN | NaN |
298 | N | 6.820604 | NaN | NaN | NaN | NaN | NaN |
299 | A | 0.648313 | 0.186501 | 0.124334 | NaN | NaN | NaN |
301 | S | 95.568384 | 0.222025 | NaN | NaN | NaN | NaN |
302 | A | 14.698046 | 0.337478 | 0.133215 | 0.053286 | NaN | NaN |
303 | V | 0.026643 | 0.026643 | 0.008881 | NaN | NaN | NaN |
305 | N | 0.008881 | NaN | NaN | NaN | NaN | NaN |
310 | E | 0.008881 | NaN | NaN | NaN | NaN | NaN |
314 | K | 20.150977 | NaN | NaN | NaN | NaN | NaN |
317 | K | 86.545293 | 6.412078 | NaN | NaN | NaN | NaN |
318 | E | 61.181172 | 21.811723 | 0.062167 | NaN | NaN | NaN |
320 | L | 0.399645 | NaN | NaN | NaN | NaN | NaN |
321 | N | 86.056838 | 3.241563 | 1.962700 | 0.008881 | NaN | NaN |
322 | K | 17.868561 | 6.838366 | 4.653641 | 3.472469 | 1.722913 | 0.035524 |
323 | I | 0.603908 | NaN | NaN | NaN | NaN | NaN |
324 | Q | 17.513321 | 1.323268 | NaN | NaN | NaN | NaN |
325 | N | 11.847247 | NaN | NaN | NaN | NaN | NaN |
326 | S | 0.301954 | NaN | NaN | NaN | NaN | NaN |
327 | L | 8.721137 | NaN | NaN | NaN | NaN | NaN |
349 | G | 0.159858 | 0.017762 | NaN | NaN | NaN | NaN |
352 | N | 15.062167 | 7.939609 | NaN | NaN | NaN | NaN |
354 | P | 5.097691 | 0.008881 | NaN | NaN | NaN | NaN |
355 | K | 0.079929 | NaN | NaN | NaN | NaN | NaN |
356 | D | 2.948490 | 0.559503 | NaN | NaN | NaN | NaN |
357 | E | 43.401421 | 0.008881 | NaN | NaN | NaN | NaN |
359 | D | 11.944938 | NaN | NaN | NaN | NaN | NaN |
361 | A | 76.198934 | 0.639432 | 0.008881 | 0.008881 | NaN | NaN |
391 | L | 0.026643 | NaN | NaN | NaN | NaN | NaN |
In order to add non-substutional positions in the figure, we will create a dataframe with empty calls and merge with count_unstacked
dataset to fill the above percentages.
#Create new data frame with empty calls and merge with this one!
aa_position=np.arange(277,363)
og_aa=list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYAN")
tuples=list(zip(aa_position,og_aa))
cols = count_unstacked.columns
index = pd.MultiIndex.from_tuples(tuples, names=['aa_position', 'OG_aa'])
empty_df = pd.DataFrame(np.nan,index, cols)
count_unstacked
empty_df
alt_acid_number | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
aa_position | OG_aa | ||||||
277 | Q | NaN | NaN | NaN | NaN | NaN | NaN |
278 | G | NaN | NaN | NaN | NaN | NaN | NaN |
279 | N | NaN | NaN | NaN | NaN | NaN | NaN |
280 | G | NaN | NaN | NaN | NaN | NaN | NaN |
281 | Q | NaN | NaN | NaN | NaN | NaN | NaN |
282 | G | NaN | NaN | NaN | NaN | NaN | NaN |
283 | H | NaN | NaN | NaN | NaN | NaN | NaN |
284 | N | NaN | NaN | NaN | NaN | NaN | NaN |
285 | M | NaN | NaN | NaN | NaN | NaN | NaN |
286 | P | NaN | NaN | NaN | NaN | NaN | NaN |
287 | N | NaN | NaN | NaN | NaN | NaN | NaN |
288 | D | NaN | NaN | NaN | NaN | NaN | NaN |
289 | P | NaN | NaN | NaN | NaN | NaN | NaN |
290 | N | NaN | NaN | NaN | NaN | NaN | NaN |
291 | R | NaN | NaN | NaN | NaN | NaN | NaN |
292 | N | NaN | NaN | NaN | NaN | NaN | NaN |
293 | V | NaN | NaN | NaN | NaN | NaN | NaN |
294 | D | NaN | NaN | NaN | NaN | NaN | NaN |
295 | E | NaN | NaN | NaN | NaN | NaN | NaN |
296 | N | NaN | NaN | NaN | NaN | NaN | NaN |
297 | A | NaN | NaN | NaN | NaN | NaN | NaN |
298 | N | NaN | NaN | NaN | NaN | NaN | NaN |
299 | A | NaN | NaN | NaN | NaN | NaN | NaN |
300 | N | NaN | NaN | NaN | NaN | NaN | NaN |
301 | S | NaN | NaN | NaN | NaN | NaN | NaN |
302 | A | NaN | NaN | NaN | NaN | NaN | NaN |
303 | V | NaN | NaN | NaN | NaN | NaN | NaN |
304 | K | NaN | NaN | NaN | NaN | NaN | NaN |
305 | N | NaN | NaN | NaN | NaN | NaN | NaN |
306 | N | NaN | NaN | NaN | NaN | NaN | NaN |
307 | N | NaN | NaN | NaN | NaN | NaN | NaN |
308 | N | NaN | NaN | NaN | NaN | NaN | NaN |
309 | E | NaN | NaN | NaN | NaN | NaN | NaN |
310 | E | NaN | NaN | NaN | NaN | NaN | NaN |
311 | P | NaN | NaN | NaN | NaN | NaN | NaN |
312 | S | NaN | NaN | NaN | NaN | NaN | NaN |
313 | D | NaN | NaN | NaN | NaN | NaN | NaN |
314 | K | NaN | NaN | NaN | NaN | NaN | NaN |
315 | H | NaN | NaN | NaN | NaN | NaN | NaN |
316 | I | NaN | NaN | NaN | NaN | NaN | NaN |
317 | K | NaN | NaN | NaN | NaN | NaN | NaN |
318 | E | NaN | NaN | NaN | NaN | NaN | NaN |
319 | Y | NaN | NaN | NaN | NaN | NaN | NaN |
320 | L | NaN | NaN | NaN | NaN | NaN | NaN |
321 | N | NaN | NaN | NaN | NaN | NaN | NaN |
322 | K | NaN | NaN | NaN | NaN | NaN | NaN |
323 | I | NaN | NaN | NaN | NaN | NaN | NaN |
324 | Q | NaN | NaN | NaN | NaN | NaN | NaN |
325 | N | NaN | NaN | NaN | NaN | NaN | NaN |
326 | S | NaN | NaN | NaN | NaN | NaN | NaN |
327 | L | NaN | NaN | NaN | NaN | NaN | NaN |
328 | S | NaN | NaN | NaN | NaN | NaN | NaN |
329 | T | NaN | NaN | NaN | NaN | NaN | NaN |
330 | E | NaN | NaN | NaN | NaN | NaN | NaN |
331 | W | NaN | NaN | NaN | NaN | NaN | NaN |
332 | S | NaN | NaN | NaN | NaN | NaN | NaN |
333 | P | NaN | NaN | NaN | NaN | NaN | NaN |
334 | C | NaN | NaN | NaN | NaN | NaN | NaN |
335 | S | NaN | NaN | NaN | NaN | NaN | NaN |
336 | V | NaN | NaN | NaN | NaN | NaN | NaN |
337 | T | NaN | NaN | NaN | NaN | NaN | NaN |
338 | C | NaN | NaN | NaN | NaN | NaN | NaN |
339 | G | NaN | NaN | NaN | NaN | NaN | NaN |
340 | N | NaN | NaN | NaN | NaN | NaN | NaN |
341 | G | NaN | NaN | NaN | NaN | NaN | NaN |
342 | I | NaN | NaN | NaN | NaN | NaN | NaN |
343 | Q | NaN | NaN | NaN | NaN | NaN | NaN |
344 | V | NaN | NaN | NaN | NaN | NaN | NaN |
345 | R | NaN | NaN | NaN | NaN | NaN | NaN |
346 | I | NaN | NaN | NaN | NaN | NaN | NaN |
347 | K | NaN | NaN | NaN | NaN | NaN | NaN |
348 | P | NaN | NaN | NaN | NaN | NaN | NaN |
349 | G | NaN | NaN | NaN | NaN | NaN | NaN |
350 | S | NaN | NaN | NaN | NaN | NaN | NaN |
351 | A | NaN | NaN | NaN | NaN | NaN | NaN |
352 | N | NaN | NaN | NaN | NaN | NaN | NaN |
353 | K | NaN | NaN | NaN | NaN | NaN | NaN |
354 | P | NaN | NaN | NaN | NaN | NaN | NaN |
355 | K | NaN | NaN | NaN | NaN | NaN | NaN |
356 | D | NaN | NaN | NaN | NaN | NaN | NaN |
357 | E | NaN | NaN | NaN | NaN | NaN | NaN |
358 | L | NaN | NaN | NaN | NaN | NaN | NaN |
359 | D | NaN | NaN | NaN | NaN | NaN | NaN |
360 | Y | NaN | NaN | NaN | NaN | NaN | NaN |
361 | A | NaN | NaN | NaN | NaN | NaN | NaN |
362 | N | NaN | NaN | NaN | NaN | NaN | NaN |
#fill empty data frame with data to plot values ---- complete! & fill nan's with a zero int value so that the bars will plot
data_to_plot = empty_df.fillna(count_unstacked).fillna(0)
UNSTACK for amino acid lists¶
Second and last dataframe will convey amino acids instead of percentages.
#Unstack index by count percentage and groupby amino acid position
mut_aa_unstacked = ns_bool_T['mut_aa'].unstack().groupby(['aa_position','OG_aa']).first()
mut_aa_test = empty_df.fillna(mut_aa_unstacked).fillna(" ")
one=mut_aa_test[1].to_list()
two=mut_aa_test[2].to_list()
three=mut_aa_test[3].to_list()
four=mut_aa_test[4].to_list()
five=mut_aa_test[5].to_list()
six=mut_aa_test[6].to_list()
mut_aa_test
alt_acid_number | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
aa_position | OG_aa | ||||||
277 | Q | ||||||
278 | G | ||||||
279 | N | ||||||
280 | G | ||||||
281 | Q | ||||||
282 | G | ||||||
283 | H | R | |||||
284 | N | ||||||
285 | M | ||||||
286 | P | ||||||
287 | N | ||||||
288 | D | N | G | ||||
289 | P | ||||||
290 | N | ||||||
291 | R | ||||||
292 | N | ||||||
293 | V | A | |||||
294 | D | N | |||||
295 | E | K | A | D | |||
296 | N | D | S | ||||
297 | A | D | S | ||||
298 | N | K | |||||
299 | A | G | S | D | |||
300 | N | ||||||
301 | S | N | D | ||||
302 | A | D | G | P | V | ||
303 | V | A | E | G | |||
304 | K | ||||||
305 | N | K | |||||
306 | N | ||||||
307 | N | ||||||
308 | N | ||||||
309 | E | ||||||
310 | E | V | |||||
311 | P | ||||||
312 | S | ||||||
313 | D | ||||||
314 | K | Q | |||||
315 | H | ||||||
316 | I | ||||||
317 | K | E | T | ||||
318 | E | Q | K | D | |||
319 | Y | ||||||
320 | L | I | |||||
321 | N | K | Q | T | P | ||
322 | K | T | N | I | R | E | L |
323 | I | M | |||||
324 | Q | K | R | ||||
325 | N | Y | |||||
326 | S | A | |||||
327 | L | I | |||||
328 | S | ||||||
329 | T | ||||||
330 | E | ||||||
331 | W | ||||||
332 | S | ||||||
333 | P | ||||||
334 | C | ||||||
335 | S | ||||||
336 | V | ||||||
337 | T | ||||||
338 | C | ||||||
339 | G | ||||||
340 | N | ||||||
341 | G | ||||||
342 | I | ||||||
343 | Q | ||||||
344 | V | ||||||
345 | R | ||||||
346 | I | ||||||
347 | K | ||||||
348 | P | ||||||
349 | G | D | S | ||||
350 | S | ||||||
351 | A | ||||||
352 | N | D | G | ||||
353 | K | ||||||
354 | P | S | L | ||||
355 | K | R | |||||
356 | D | N | E | ||||
357 | E | Q | K | ||||
358 | L | ||||||
359 | D | N | |||||
360 | Y | ||||||
361 | A | E | I | K | S | ||
362 | N |
Insert empty rows for all amino acids in sequence.
#Use 'test.reset_index().index' to set in plotting to set x_ticks
test=data_to_plot.copy()
test.reset_index().index
RangeIndex(start=0, stop=86, step=1)
Plotting¶
There are three versions of this plot that we will go through one by one:
Showing allele proportions for Pf7
Showing allele proportions for one subpopulation
Showing allele proportions for Pf7 by subpopulations
Showing allele proportions for Pf7¶
We can assign a color to each rank of the alternate amino acid (1-6). We will display the corresponding colors in the legend.
# Make color dictionary
column_list=[1,2,3,4,5,6,]
color_list=['#0a2530', '#0796a9', '#d5e369', '#dc9922', '#943722', '#4e1b0f',]
columns_and_colors = (zip(column_list, color_list))
colors=dict(columns_and_colors)
We will create a stacked bar graph illustrating the percentage of samples with alternative amino acids. The primary axis (ax1) will display the stacked bars for each position, while secondary axes (ax2) underneath will depict alternative amino acid labels.
# Set figure
fig= plt.figure()
# Create first axis
ax1 = fig.add_axes((0.1,0.5,0.8,0.6))
# Plot bar graph on ax 1
ax1 = data_to_plot.plot(kind='bar',
stacked=True,
y=[1,2,3,4,5,6],
figsize=(50,10),
width=0.9,
color=[colors[i] for i in column_list],
linewidth = 0,
grid=True,
ax=ax1,
sharex = True,
legend=False)
#--------------------------------------------------------------------------------------------------------------------
#BAR GRAPH
#--------------------------------------------------------------------------------------------------------------------
#y axis of bar graph
ax1.set_ylabel("% of samples with alternative amino acid ", fontsize=22)
ax1.set_yticks([0,20,40,60,80,100])
ax1.set_yticklabels(["0","20","40","60","80","100"], fontweight='bold',fontsize=20)
#ax1.set_yscale('log')
#--------------------------------------------------------------------------------------------------------------------
#x axis of bar graph
ax1.set_xlim(-1,85)
ax1.set_xticks(test.reset_index().index)
ax1.set_xlabel(' ')
ax1.set_xticklabels(aa_position, fontweight='bold', fontsize=18)
#--------------------------------------------------------------------------------------------------------------------
# create secondary axis underneath
#--------------------------------------------------------------------------------------------------------------------
#Create OG aa x axis
ax2 = fig.add_axes((0.1,0.42,0.8,0.0))
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
og_aa_plot=list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYAN")
ax2.set_xticklabels(og_aa_plot,fontweight='bold',fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 1st alt x axis
ax2 = fig.add_axes((0.1,0.35,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(one, fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 2nd alt x axis
ax2 = fig.add_axes((0.1,0.32,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(two,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 3rd alt x axis
ax2 = fig.add_axes((0.1,0.29,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(three,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 4rd alt x axis
ax2 = fig.add_axes((0.1,0.26,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(four,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 5rd alt x axis
ax2 = fig.add_axes((0.1,0.23,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(five,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 6rd alt x axis
ax2 = fig.add_axes((0.1,0.20,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(six,fontsize=25)
# Legend
# Add text boxes for alternative amino acid labelling
fig.text(0.09,0.38, " REF", bbox=dict(facecolor='white' ), fontsize=18, fontweight='bold')
fig.text(0.09,0.32, " 1 ", bbox=dict(facecolor=colors[1]), fontsize=18, color='white' )
fig.text(0.09,0.29, " 2 ", bbox=dict(facecolor=colors[2]), fontsize=18, color='white')
fig.text(0.09,0.26, " 3 ", bbox=dict(facecolor=colors[3]), fontsize=18 )
fig.text(0.09,0.23, " 4 ", bbox=dict(facecolor=colors[4]), fontsize=18 )
fig.text(0.09,0.20, " 5 ", bbox=dict(facecolor=colors[5]), fontsize=18,color='white' )
fig.text(0.09,0.17, " 6 ", bbox=dict(facecolor=colors[6]), fontsize=18,color='white' )
fig.text(0.081,0.10, "Alternate Amino Acid",rotation=90, fontsize=20)
Text(0.081, 0.1, 'Alternate Amino Acid')
Figure Legend: Variation in c-terminal of csp in Pf7 samples.
Showing allele proportions for one subpopulation¶
To generate the depicted plot for a specific population, we need to filter the data corresponding to that population and normalize the allele proportions based on the population size. We will create a function that performs these two steps, and the remaining part of the plot remains the same as above.
def make_pop_plots(pop_bool, pop, pop_size):
ns_bool_T = pop_bool[[pop, 'aa_change', 'aa_position', 'OG_aa', 'mut_aa']]
# Add count columnns
ns_bool_T=ns_bool_T.copy()
ns_bool_T.loc[:, 'count'] = ns_bool_T[[pop]].sum(axis=1) # create column for count
# Add count percentage
ns_bool_T.loc[:, 'count_percentage'] = ns_bool_T['count'].apply(lambda x: (x/pop_size)*100) # Normalize by population
# Add ranking column
ns_bool_T=ns_bool_T.reset_index(drop=True).sort_values(by=['aa_position', 'count_percentage'], ascending=[True, False])
ns_bool_T['alt_acid_number'] = ns_bool_T['aa_change'].map(rank_dict)
# Set index
ns_bool_T = ns_bool_T.set_index(['aa_position', 'OG_aa','alt_acid_number'])
# Unstack index by count percentage and groupby amino acid position
count_unstacked = ns_bool_T['count_percentage'].unstack().groupby(['aa_position','OG_aa']).first()
# Create new data frame with empty calls and merge with this one!
aa_position=np.arange(277,363)
og_aa=list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYANDIEKKICKMEKCSSVFNVVNSSIGLIMVL")
tuples=list(zip(aa_position,og_aa))
cols = count_unstacked.columns
index = pd.MultiIndex.from_tuples(tuples, names=['aa_position', 'OG_aa'])
empty_df = pd.DataFrame(np.nan,index, cols)
count_unstacked
empty_df
# Fill empty data frame with data to plot values ---- complete! & fill nan's with a zero int value so that the bars will plot
pd.set_option('display.max_rows', 115)
data_to_plot = empty_df.fillna(count_unstacked).fillna(0)
# Turn count into an int and only select rows with count >0
ns_bool_T['count']=ns_bool_T['count'].apply(lambda x: int(x))
ns_bool_T.loc[ns_bool_T['count']> 0] #this is diff from all populations, need to select only those with count > 0
# Unstack index by count percentage and groupby amino acid position
mut_aa_unstacked = ns_bool_T.loc[ns_bool_T['count']> 0]['mut_aa'].unstack().groupby(['aa_position','OG_aa']).first()
# Create alternative amino acid axis context
mut_aa_test = empty_df.fillna(mut_aa_unstacked).fillna(" ")
mut_aa_test
one=mut_aa_test[1].to_list()
two=mut_aa_test[2].to_list()
three=mut_aa_test[3].to_list()
four=mut_aa_test[4].to_list()
five=mut_aa_test[5].to_list()
six=mut_aa_test[6].to_list()
# Make index range for column lenght in df.plot
test=data_to_plot.copy()
test.reset_index().index
og_aa_plot=list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYAN")
#Set figure
fig= plt.figure()
#Create first axis
ax1 = fig.add_axes((0.1,0.5,0.8,0.6))
#Plot bar graph on ax 1
ax1 = data_to_plot.plot(kind='bar',
stacked=True,
y=[1,2,3,4,5,6],
figsize=(50,10),
width=1,
color=[colors[i] for i in data_to_plot.columns],
linewidth = 0,
grid=True,
ax=ax1,
sharex = True,
legend=False)
ax1.set_title(pop, fontsize=30)
#--------------------------------------------------------------------------------------------------------------------
#BAR GRAPH
#--------------------------------------------------------------------------------------------------------------------
#y axis of bar graph
ax1.set_ylabel("% of samples with alternative amino acid ", fontsize=22)
ax1.set_yticks([0,20,40,60,80,100])
ax1.set_yticklabels(["0","20","40","60","80","100"], fontweight='bold',fontsize=20)
#ax1.set_yscale('log')
#--------------------------------------------------------------------------------------------------------------------
#x axis of bar graph
ax1.set_xlim(-1,85)
ax1.set_xticks(test.reset_index().index)
ax1.set_xlabel(' ')
ax1.set_xticklabels(aa_position, fontweight='bold', fontsize=18)
#--------------------------------------------------------------------------------------------------------------------
# create secondary axis underneath
#--------------------------------------------------------------------------------------------------------------------
#Create OG aa x axis
ax2 = fig.add_axes((0.1,0.42,0.8,0.0))
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.set_xticklabels(og_aa_plot,fontweight='bold',fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 1st alt x axis
ax2 = fig.add_axes((0.1,0.35,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(one, fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 2nd alt x axis
ax2 = fig.add_axes((0.1,0.32,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(two,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 3rd alt x axis
ax2 = fig.add_axes((0.1,0.29,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(three,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 4rd alt x axis
ax2 = fig.add_axes((0.1,0.26,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(four,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 5rd alt x axis
ax2 = fig.add_axes((0.1,0.23,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(five,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#Create 6rd alt x axis
ax2 = fig.add_axes((0.1,0.20,0.8,0.0), frameon=False)
ax2.yaxis.set_visible(False) # hide the yaxis
ax2.set_xlim(-1,85)
ax2.set_xticks(test.reset_index().index)
ax2.tick_params(length=0)
ax2.set_xticklabels(six,fontsize=25)
#--------------------------------------------------------------------------------------------------------------------
#--------------------------------------------------------------------------------------------------------------------
#Add text boxes for alternative amino acid labelling
fig.text(0.09,0.38, " REF", bbox=dict(facecolor='white' ), fontsize=18, fontweight='bold')
fig.text(0.09,0.32, " 1 ", bbox=dict(facecolor='#1d3557'), fontsize=18, color='white' )
fig.text(0.09,0.29, " 2 ", bbox=dict(facecolor='#457b9d'), fontsize=18, color='white')
fig.text(0.09,0.26, " 3 ", bbox=dict(facecolor='#a8dadc'), fontsize=18 )
fig.text(0.09,0.23, " 4 ", bbox=dict(facecolor='#f0f3bd'), fontsize=18 )
fig.text(0.09,0.20, " 5 ", bbox=dict(facecolor='#ec9a9a'), fontsize=18 )
fig.text(0.09,0.17, " 6 ", bbox=dict(facecolor='#e63946'), fontsize=18 )
fig.text(0.081,0.10, "Alternate Amino Acid",rotation=90, fontsize=20)
return fig
Let’s create a plot for the West African subpopulation.
fig_AF_W = make_pop_plots(pop_bool, 'AF-W', 3450)
Figure Legend: Variation in c-terminal of csp in samples from West Africa.
Showing allele proportions for Pf7 by subpopulations¶
The final figure will display each subpopulation stacked on top of each other with a shared x-axis. We will update the make_pop_plots function to generate a subplot.
def make_pop_plots2(pop_bool, pop, pop_size, subplot_num):
"""
Generate a stacked bar graph representing the percentage of samples with alternative amino acids
for a given population.
Parameters:
pop_bool (pd.DataFrame): DataFrame containing boolean values for alternative amino acid presence.
pop (str): Name of the population.
pop_size (int): Size of the population.
subplot_num (AxesSubplot): Subplot to use for plotting.
Returns:
AxesSubplot: The subplot containing the generated stacked bar graph.
"""
# Extract relevant columns from the DataFrame
ns_bool_T = pop_bool[[pop, 'aa_change', 'aa_position', 'OG_aa', 'mut_aa']]
# Create a copy of the DataFrame for modification
ns_bool_T = ns_bool_T.copy()
# Add columns for count and count percentage
ns_bool_T.loc[:, 'count'] = ns_bool_T[[pop]].sum(axis=1)
ns_bool_T.loc[:, 'count_percentage'] = ns_bool_T['count'].apply(lambda x: (x / pop_size) * 100)
# Add a ranking column and set the index
ns_bool_T = ns_bool_T.reset_index(drop=True).sort_values(by=['aa_position', 'count_percentage'], ascending=[True, False])
ns_bool_T.loc[:, 'alt_acid_number'] = ns_bool_T['aa_change'].map(rank_dict)
ns_bool_T = ns_bool_T.set_index(['aa_position', 'OG_aa', 'alt_acid_number'])
# Unstack the index by count percentage and group by amino acid position
count_unstacked = ns_bool_T['count_percentage'].unstack().groupby(['aa_position', 'OG_aa']).first()
# Create a new DataFrame with empty cells and merge it with the existing one
aa_position = np.arange(277, 363)
og_aa = list("QGNGQGHNMPNDPNRNVDENANANSAVKNNNNEEPSDKHIKEYLNKIQNSLSTEWSPCSVTCGNGIQVRIKPGSANKPKDELDYANDIEKKICKMEKCSSVFNVVNSSIGLIMVL")
tuples = list(zip(aa_position, og_aa))
cols = count_unstacked.columns
index = pd.MultiIndex.from_tuples(tuples, names=['aa_position', 'OG_aa'])
empty_df = pd.DataFrame(np.nan, index, cols)
data_to_plot = empty_df.fillna(count_unstacked).fillna(0)
# Turn count into an int and select rows with count > 0
ns_bool_T['count'] = ns_bool_T['count'].apply(lambda x: int(x))
ns_bool_T.loc[ns_bool_T['count'] > 0]
# Unstack the index by count percentage and group by amino acid position
mut_aa_unstacked = ns_bool_T.loc[ns_bool_T['count'] > 0]['mut_aa'].unstack().groupby(['aa_position', 'OG_aa']).first()
# Plot the stacked bar graph on the specified subplot
subplot_num = data_to_plot.plot(kind='bar',
stacked=True,
y=[1, 2, 3, 4, 5, 6],
figsize=(15, 10),
width=0.9,
color=[colors[i] for i in column_list],
linewidth=1,
ax=subplot_num,
sharex=True,
legend=False)
# Create a twin axis for additional annotations
twin = subplot_num.twinx()
twin.set_ylabel(pop, fontsize=8)
twin.set_yticks([])
twin.set_yticklabels([])
# Customize y-axis and x-axis of the bar graph
subplot_num.set_yticks([0, 20, 40, 60, 80, 100])
subplot_num.set_yticklabels(["0", "20", "40", "60", "80", "100"], fontsize=10)
subplot_num.set_xlim(-1, 85)
subplot_num.set_xticks(data_to_plot.reset_index().index)
subplot_num.set_xlabel(' ')
subplot_num.set_xticklabels(aa_position, fontweight='bold', fontsize=10)
return subplot_num
Now, we will invoke the function for each subpopulation, establish the x-axis, and incorporate a legend.
# Set up a 10-subplot figure with shared x and y axes
fig, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9, ax10) = plt.subplots(10, sharex=True, sharey=True, figsize=(50, 200))
# Plot stacked bar graphs for different populations using the make_pop_plots2 function
make_pop_plots2(pop_bool, 'SA', 152, ax1)
make_pop_plots2(pop_bool, 'AF-W', 3450, ax2)
make_pop_plots2(pop_bool, 'AF-C', 291, ax3)
make_pop_plots2(pop_bool, 'AF-NE', 83, ax4)
make_pop_plots2(pop_bool, 'AF-E', 927, ax5)
make_pop_plots2(pop_bool, 'AS-S-E', 135, ax6)
make_pop_plots2(pop_bool, 'AS-S-FE', 1066, ax7)
make_pop_plots2(pop_bool, 'AS-SE-W', 1710, ax8)
make_pop_plots2(pop_bool, 'AS-SE-E', 3166, ax9)
make_pop_plots2(pop_bool, 'OC-NG', 274, ax10)
# Create secondary axes underneath for alternative amino acid labels
# Create OG aa x-axis
ax11 = fig.add_axes((0.1, 0.07, 0.8, 0.0), frameon=False)
ax11.yaxis.set_visible(False)
ax11.set_xlim(-4, 85)
ax11.set_xticks(test.reset_index().index)
ax11.tick_params(length=0)
ax11.set_xticklabels(og_aa_plot, fontweight='bold', fontsize=10)
# Create alternative x-axes for different amino acid labels
for i, labels in enumerate([one, two, three, four, five, six]):
ax11 = fig.add_axes((0.1, 0.07 - (i + 1) * 0.01, 0.8, 0.0), frameon=False)
ax11.yaxis.set_visible(False)
ax11.set_xlim(-4, 85)
ax11.set_xticks(test.reset_index().index)
ax11.tick_params(length=0)
ax11.set_xticklabels(labels, fontsize=10)
# Add text boxes for alternative amino acid labeling
fig.text(0.1, 0.064, " REF", bbox=dict(facecolor='white'), fontsize=8, fontweight='bold')
for i, color in enumerate(colors.values(), start=1):
fig.text(0.1, 0.048 - i * 0.008, f" {i} ", bbox=dict(facecolor=color, edgecolor='black', boxstyle='round,pad=0.3'), fontsize=5, color='white')
# Add text for the alternative amino acid axis
fig.text(0.09, 0.001, "Alt. Amino Acid", rotation=90, fontsize=7)
Text(0.09, 0.001, 'Alt. Amino Acid')
Figure Legend: Variation in c-terminal of csp. The x-axis shows amino acid positions and the y-axis the proportion of non-reference alleles in different sub-populations. Different alternative amino acids are represented by different colours as represented by the legend below the x-axis.
Save the plot¶
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
fig.savefig('/content/drive/My Drive/figure_csp_c_terminal_barstacked.pdf', dpi=480, bbox_inches="tight") # increase the dpi for higher resolution