Plot eba175 C Allele Proportions
Contents
Plot eba175 C Allele Proportions¶
Introduction¶
This notebook recreates Supplement Figure 11 from the Pf7 paper - a bar plot which shows the proportions of the eba175 C allele in the major subpopulations.
Eba175 is a vaccine candidate gene with two distinct different allelic forms, known as the F- and C-types. Investigating the prevalence of the eba175 C allele can help assess whether eba175 experiences balancing selection, likely influenced by negative frequency-dependent selection driven by interactions with the human immune system.
This notebook should take approximately two minutes to run.
Setup¶
Install the malariagen Python package:
!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 kB 2.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 42.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 77.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 57.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 15.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 10.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 48.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 96.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 20.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.6/233.6 kB 22.0 MB/s eta 0:00:00
?25h Preparing metadata (setup.py) ... ?25l?25hdone
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 60.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 34.6 MB/s eta 0:00:00
?25h Building wheel for asciitree (setup.py) ... ?25l?25hdone
Import required python libraries that are installed at colab by default.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections
from google.colab import drive
Access Pf7 Data¶
We use the malariagen data package to load sample data and metadata.
release_data = malariagen_data.Pf7()
sample_metadata = release_data.sample_metadata()
For this figure, we also need to load the eba175 gene allele frequencies in major populations.
This data is available as a .txt file from a web link. We can directly read the data into a dataframe using the read_csv function in pandas!
If you click the left sidebar folder icon, you should see this file appear.
# Read in calls file direct from MalariaGEN website
df_eba = pd.read_csv(
'https://www.malariagen.net/wp-content/uploads/2023/11/Pf7_eba175_callset.txt'
, sep='\t'
, index_col=0
, low_memory=False
)
# Check shape (should contain 16,203 as this is the number of QC pass samples in Pf7)
print(df_eba.shape)
# View the new dataframe stucture
df_eba.head(3)
(16203, 3)
f_frac | c_frac | eba175_call | |
---|---|---|---|
Sample | |||
FP0008-C | 0.863946 | 0.000000 | F |
FP0009-C | 0.003988 | 0.627346 | Mixed |
FP0010-CW | 0.666667 | 0.000000 | F |
Combine data into a single dataframe¶
In order to create the figure, it’s convenient to work with combined data.
We retain all samples from sample_metadata which pass QC (n = 16,203).
# Merge the two dataframes on identical Sample IDs
df_calls = (
sample_metadata.loc[sample_metadata['QC pass']]
.merge(df_eba, on='Sample')
)
print(df_calls.shape)
# View the new dataframe stucture
df_calls.head(3)
(16203, 20)
Sample | Study | Country | Admin level 1 | Country latitude | Country longitude | Admin level 1 latitude | Admin level 1 longitude | Year | ENA | All samples same case | Population | % callable | QC pass | Exclusion reason | Sample type | Sample was in Pf6 | f_frac | c_frac | eba175_call | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FP0008-C | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR1081237 | FP0008-C | AF-W | 82.16 | True | Analysis_set | gDNA | True | 0.863946 | 0.000000 | F |
1 | FP0009-C | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR1081238 | FP0009-C | AF-W | 88.85 | True | Analysis_set | gDNA | True | 0.003988 | 0.627346 | Mixed |
2 | FP0010-CW | 1147-PF-MR-CONWAY | Mauritania | Hodh el Gharbi | 20.265149 | -10.337093 | 16.565426 | -9.832345 | 2014.0 | ERR2889621 | FP0010-CW | AF-W | 86.46 | True | Analysis_set | sWGA | False | 0.666667 | 0.000000 | F |
Distribution of allelic types by population¶
First, we check the frequency of allelic types in the overall dataset before checking in each population.
num_F = np.count_nonzero(df_calls['eba175_call'] == 'F')
num_C = np.count_nonzero(df_calls['eba175_call'] == 'C')
proportion_C = num_C / (num_C + num_F)
print(f"We see {num_F:,} samples with the F allele and {num_C:,} samples with the C allele, \
meaning that overall {proportion_C:.0%} of samples that have an unambiguous call have the rarer C allele.")
We see 7,380 samples with the F allele and 3,364 samples with the C allele, meaning that overall 31% of samples that have an unambiguous call have the rarer C allele.
Now, we will examine individual population allelic frequencies to generate the corresponding figure.
First we define the populations (for more details on populations, please visit the Supplement).
# Define populations in an ordered dictionary
populations = collections.OrderedDict()
populations['SA'] = 'South America'
populations['AF-W'] = 'West Africa'
populations['AF-C'] = 'Central Africa'
populations['AF-NE'] = 'Northeast Africa'
populations['AF-E'] = 'East Africa'
populations['AS-S-E'] = 'Eastern South Asia'
populations['AS-S-FE'] = 'Far-eastern South Asia'
populations['AS-SE-W'] = 'Western Southeast Asia'
populations['AS-SE-E'] = 'Eastern Southeast Asia'
populations['OC-NG'] = 'Oceania'
Next, we need a function to aggregate allele frequencies in each population
# Function to aggregate population frequencies
def my_agg(x):
names = collections.OrderedDict()
names['Number F'] = np.count_nonzero(x['eba175_call'] == 'F')
names['Number C'] = np.count_nonzero(x['eba175_call'] == 'C')
names['Number F or C'] = names['Number F'] + names['Number C']
names['Proportion C'] = names['Number C'] / names['Number F or C'] if names['Number F or C'] > 0 else np.nan
return pd.Series(names)
# Summary table with population aggregates
df_pop_summary = df_calls.groupby('Population').apply(my_agg).loc[populations.keys()]
df_pop_summary
Number F | Number C | Number F or C | Proportion C | |
---|---|---|---|---|
Population | ||||
SA | 28.0 | 44.0 | 72.0 | 0.611111 |
AF-W | 3111.0 | 762.0 | 3873.0 | 0.196747 |
AF-C | 191.0 | 30.0 | 221.0 | 0.135747 |
AF-NE | 45.0 | 48.0 | 93.0 | 0.516129 |
AF-E | 751.0 | 210.0 | 961.0 | 0.218522 |
AS-S-E | 90.0 | 15.0 | 105.0 | 0.142857 |
AS-S-FE | 712.0 | 343.0 | 1055.0 | 0.325118 |
AS-SE-W | 702.0 | 741.0 | 1443.0 | 0.513514 |
AS-SE-E | 1655.0 | 1056.0 | 2711.0 | 0.389524 |
OC-NG | 95.0 | 115.0 | 210.0 | 0.547619 |
Make the figure¶
First, we set-up population-specific colour-codes.
# Create an ordered dictionary which maps the codes for major sub-populations -from west to east- to a colour code.
population_colours = collections.OrderedDict()
population_colours['SA'] = "#4daf4a"
population_colours['AF-W'] = "#e31a1c"
population_colours['AF-C'] = "#fd8d3c"
population_colours['AF-NE'] = "#bb8129"
population_colours['AF-E'] = "#fecc5c"
population_colours['AS-S-E'] = "#dfc0eb"
population_colours['AS-S-FE'] = "#984ea3"
population_colours['AS-SE-W'] = "#9ecae1"
population_colours['AS-SE-E'] = "#3182bd"
population_colours['OC-NG'] = "#f781bf"
We are ready to make the bar plot using df_pop_summary data.
# Set the figure size
fig, ax = plt.subplots(1, 1, figsize=(5, 3))
# Set the chart variables
ax.bar(df_pop_summary.index, df_pop_summary['Proportion C'], color=population_colours.values())
# Set the labels
ax.set_xticklabels(df_pop_summary.index, rotation=90)
ax.set_xlabel('Population')
ax.set_ylabel('Proportion of C allele')
# Hide the right spine (axis line) of the plot
ax.spines['right'].set_visible(False)
# Hide the top spine (axis line) of the plot
ax.spines['top'].set_visible(False)
# Set the y-axis limits
ax.set_ylim(0, 1)
# Add a horizontal line showing %50 proportion
ax.axhline(0.5, color='lightgray')
# Adjust figure to ensure everything fits nicely
fig.tight_layout()
<ipython-input-10-1f8c815b50e5>:6: UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticklabels(df_pop_summary.index, rotation=90)
Figure legend: Proportion of C allele of eba175 in different major subpopulations. Horizontal line shows 50% proportion.
Save the figure¶
We can output this to a location in Google Drive
First we need to connect Google Drive by running the following:
# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
fig.savefig('/content/drive/My Drive/Proportion_of_C_allele_by_population_20210720.pdf')
fig.savefig('/content/drive/My Drive/Proportion_of_C_allele_by_population_20210720.png', dpi=480) # increase the dpi for higher resolution
Distribution of allelic types at a more granular level¶
Here we look at the population characteristics organized by country, first-level administrative division, and year.
Then, we identify key summary statistics, minimum and maximum proportions of C allele, within each year and country.
This table was not included in the Pf7 manuscript.
# Group the df_calls by variables (Population, Country, Admin level 1, and Year)
# and apply the custom aggregation function 'my_agg' to calculate the allele frequencies
df_admin1_year_summary = df_calls.groupby(['Population', 'Country', 'Admin level 1', 'Year']).apply(my_agg)
# Print the shape of the resulting DataFrame
print(df_admin1_year_summary.shape)
# Calculate and print the maximum and minimum proportions of the C-type allele
# considering n = number of samples with an F or C type allele called
# min and max C allele proportions where n >= 30
print(f"Max proporiton C (n > 30) = {df_admin1_year_summary.loc[df_admin1_year_summary['Number F or C'] > 30, 'Proportion C'].max()}")
print(f"Min proporiton C (n > 30) = {df_admin1_year_summary.loc[df_admin1_year_summary['Number F or C'] > 30, 'Proportion C'].min()}")
# display 200 rows of resulting df where n > 30
pd.options.display.max_rows = 200
df_admin1_year_summary.loc[df_admin1_year_summary['Number F or C'] > 30]
(297, 4)
Max proporiton C (n > 30) = 0.8048780487804879
Min proporiton C (n > 30) = 0.018518518518518517
Number F | Number C | Number F or C | Proportion C | ||||
---|---|---|---|---|---|---|---|
Population | Country | Admin level 1 | Year | ||||
AF-C | Democratic Republic of the Congo | Kinshasa | 2012.0 | 84.0 | 16.0 | 100.0 | 0.160000 |
2013.0 | 51.0 | 13.0 | 64.0 | 0.203125 | |||
2014.0 | 31.0 | 1.0 | 32.0 | 0.031250 | |||
AF-E | Kenya | Kilifi | 1996.0 | 31.0 | 9.0 | 40.0 | 0.225000 |
2007.0 | 42.0 | 8.0 | 50.0 | 0.160000 | |||
2009.0 | 34.0 | 5.0 | 39.0 | 0.128205 | |||
2010.0 | 47.0 | 19.0 | 66.0 | 0.287879 | |||
2011.0 | 39.0 | 8.0 | 47.0 | 0.170213 | |||
2012.0 | 49.0 | 16.0 | 65.0 | 0.246154 | |||
Malawi | Chikwawa | 2011.0 | 109.0 | 23.0 | 132.0 | 0.174242 | |
Tanzania | Kigoma | 2014.0 | 54.0 | 28.0 | 82.0 | 0.341463 | |
Lindi | 2013.0 | 30.0 | 4.0 | 34.0 | 0.117647 | ||
Tanga | 2013.0 | 75.0 | 15.0 | 90.0 | 0.166667 | ||
2014.0 | 70.0 | 27.0 | 97.0 | 0.278351 | |||
AF-NE | Sudan | Khartoum | 2017.0 | 17.0 | 26.0 | 43.0 | 0.604651 |
AF-W | Benin | Littoral | 2016.0 | 53.0 | 1.0 | 54.0 | 0.018519 |
Burkina Faso | Haut-Bassins | 2008.0 | 36.0 | 2.0 | 38.0 | 0.052632 | |
Cameroon | Sud-Ouest | 2013.0 | 95.0 | 26.0 | 121.0 | 0.214876 | |
Côte d'Ivoire | Abidjan | 2013.0 | 38.0 | 2.0 | 40.0 | 0.050000 | |
Gabon | Wouleu-Ntem | 2014.0 | 31.0 | 8.0 | 39.0 | 0.205128 | |
Gambia | North Bank | 1984.0 | 53.0 | 11.0 | 64.0 | 0.171875 | |
Upper River | 2013.0 | 25.0 | 14.0 | 39.0 | 0.358974 | ||
2014.0 | 106.0 | 62.0 | 168.0 | 0.369048 | |||
2015.0 | 33.0 | 8.0 | 41.0 | 0.195122 | |||
2016.0 | 37.0 | 13.0 | 50.0 | 0.260000 | |||
2017.0 | 36.0 | 8.0 | 44.0 | 0.181818 | |||
Western | 2008.0 | 43.0 | 19.0 | 62.0 | 0.306452 | ||
2015.0 | 54.0 | 26.0 | 80.0 | 0.325000 | |||
Ghana | Ashanti | 2014.0 | 87.0 | 18.0 | 105.0 | 0.171429 | |
2015.0 | 61.0 | 13.0 | 74.0 | 0.175676 | |||
Central | 2014.0 | 51.0 | 12.0 | 63.0 | 0.190476 | ||
Greater Accra | 2018.0 | 64.0 | 32.0 | 96.0 | 0.333333 | ||
Upper East | 2009.0 | 51.0 | 8.0 | 59.0 | 0.135593 | ||
2010.0 | 94.0 | 5.0 | 99.0 | 0.050505 | |||
2011.0 | 45.0 | 3.0 | 48.0 | 0.062500 | |||
2013.0 | 111.0 | 8.0 | 119.0 | 0.067227 | |||
2014.0 | 22.0 | 12.0 | 34.0 | 0.352941 | |||
2015.0 | 121.0 | 33.0 | 154.0 | 0.214286 | |||
2016.0 | 110.0 | 44.0 | 154.0 | 0.285714 | |||
2017.0 | 319.0 | 64.0 | 383.0 | 0.167102 | |||
2018.0 | 356.0 | 73.0 | 429.0 | 0.170163 | |||
Guinea | Nzerekore | 2011.0 | 44.0 | 11.0 | 55.0 | 0.200000 | |
Mali | Bamako | 2013.0 | 80.0 | 13.0 | 93.0 | 0.139785 | |
Kayes | 2015.0 | 35.0 | 15.0 | 50.0 | 0.300000 | ||
2016.0 | 27.0 | 12.0 | 39.0 | 0.307692 | |||
Koulikoro | 2013.0 | 49.0 | 11.0 | 60.0 | 0.183333 | ||
2014.0 | 40.0 | 1.0 | 41.0 | 0.024390 | |||
2016.0 | 139.0 | 37.0 | 176.0 | 0.210227 | |||
Nigeria | Lagos | 2017.0 | 24.0 | 14.0 | 38.0 | 0.368421 | |
Senegal | Dakar | 2013.0 | 31.0 | 13.0 | 44.0 | 0.295455 | |
Sedhiou | 2015.0 | 43.0 | 7.0 | 50.0 | 0.140000 | ||
AS-S-E | India | Odisha | 2017.0 | 28.0 | 6.0 | 34.0 | 0.176471 |
West Bengal | 2017.0 | 40.0 | 5.0 | 45.0 | 0.111111 | ||
AS-S-FE | Bangladesh | Chittagong | 2012.0 | 23.0 | 8.0 | 31.0 | 0.258065 |
2015.0 | 234.0 | 132.0 | 366.0 | 0.360656 | |||
2016.0 | 408.0 | 194.0 | 602.0 | 0.322259 | |||
AS-SE-E | Cambodia | Pailin | 2011.0 | 26.0 | 12.0 | 38.0 | 0.315789 |
Preah Vihear | 2011.0 | 49.0 | 10.0 | 59.0 | 0.169492 | ||
2012.0 | 24.0 | 8.0 | 32.0 | 0.250000 | |||
Pursat | 2010.0 | 34.0 | 49.0 | 83.0 | 0.590361 | ||
2011.0 | 30.0 | 47.0 | 77.0 | 0.610390 | |||
2013.0 | 8.0 | 27.0 | 35.0 | 0.771429 | |||
Ratanakiri | 2010.0 | 23.0 | 9.0 | 32.0 | 0.281250 | ||
2011.0 | 38.0 | 20.0 | 58.0 | 0.344828 | |||
2016.0 | 17.0 | 37.0 | 54.0 | 0.685185 | |||
Laos | Attapeu | 2011.0 | 23.0 | 13.0 | 36.0 | 0.361111 | |
2017.0 | 39.0 | 19.0 | 58.0 | 0.327586 | |||
2018.0 | 25.0 | 20.0 | 45.0 | 0.444444 | |||
Champasak | 2017.0 | 66.0 | 15.0 | 81.0 | 0.185185 | ||
2018.0 | 54.0 | 44.0 | 98.0 | 0.448980 | |||
Salavan | 2017.0 | 71.0 | 25.0 | 96.0 | 0.260417 | ||
2018.0 | 36.0 | 4.0 | 40.0 | 0.100000 | |||
Savannakhet | 2017.0 | 153.0 | 66.0 | 219.0 | 0.301370 | ||
2018.0 | 59.0 | 54.0 | 113.0 | 0.477876 | |||
Vietnam | Binh Phuoc | 2010.0 | 33.0 | 23.0 | 56.0 | 0.410714 | |
2011.0 | 53.0 | 21.0 | 74.0 | 0.283784 | |||
2015.0 | 24.0 | 8.0 | 32.0 | 0.250000 | |||
2017.0 | 21.0 | 19.0 | 40.0 | 0.475000 | |||
2018.0 | 85.0 | 83.0 | 168.0 | 0.494048 | |||
Dak Lak | 2017.0 | 46.0 | 41.0 | 87.0 | 0.471264 | ||
Dak Nong | 2017.0 | 18.0 | 23.0 | 41.0 | 0.560976 | ||
Gia Lai | 2017.0 | 104.0 | 54.0 | 158.0 | 0.341772 | ||
2018.0 | 75.0 | 32.0 | 107.0 | 0.299065 | |||
AS-SE-W | Myanmar | Bago | 2012.0 | 15.0 | 21.0 | 36.0 | 0.583333 |
Kayin | 2016.0 | 156.0 | 182.0 | 338.0 | 0.538462 | ||
2017.0 | 117.0 | 121.0 | 238.0 | 0.508403 | |||
Mandalay | 2013.0 | 14.0 | 32.0 | 46.0 | 0.695652 | ||
Tanintharyi | 2011.0 | 8.0 | 33.0 | 41.0 | 0.804878 | ||
Thailand | Tak | 2002.0 | 27.0 | 13.0 | 40.0 | 0.325000 | |
2003.0 | 33.0 | 27.0 | 60.0 | 0.450000 | |||
2004.0 | 31.0 | 24.0 | 55.0 | 0.436364 | |||
2005.0 | 22.0 | 15.0 | 37.0 | 0.405405 | |||
2008.0 | 109.0 | 69.0 | 178.0 | 0.387640 | |||
2011.0 | 18.0 | 18.0 | 36.0 | 0.500000 | |||
2012.0 | 34.0 | 32.0 | 66.0 | 0.484848 | |||
2013.0 | 25.0 | 48.0 | 73.0 | 0.657534 | |||
OC-NG | Papua New Guinea | East Sepik | 2017.0 | 30.0 | 24.0 | 54.0 | 0.444444 |
Distribution of allelic types in each country, first-level administrative division and year. This table presents a breakdown of longitudinal and geospatial data concerning the distribution of allelic types (F and C) across 97 combinations of first-level administrative divisions and years for which over 30 samples have either an F or a C type called. The fact that we see both types present in all 97 combinations of admin division and year adds weight to the argument that eba175 is under balancing selection.