Plot sample collection per country¶

Introduction¶

This notebook creates a bar plot which shows the number of samples in the Pf8 release, broken down by country. Each bar also details the number of samples passing (or not passing) quality control (QC) per country. Additionally, the notebook also creates a second figure, which compares the number of samples per country in the MalariaGEN Pf8 release with those in the Pf7 release.

This notebook should take approximately 1 minute to run.

Setup¶

Install and import the malariagen Python package:

!pip install malariagen_data -q --no-warn-conflicts
import malariagen_data

  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 30.1 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 4.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 775.9/775.9 kB 33.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.9/25.9 MB 46.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 89.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.6/210.6 kB 9.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 78.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 79.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 93.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.1/78.1 kB 4.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.7/101.7 kB 6.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 92.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 13.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 89.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 56.3 MB/s eta 0:00:00
?25h  Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
  Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required python libraries that are installed at colab by default.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections
from google.colab import drive

Access Pf8 Data¶

We use the malariagen data package to load the release data.

release_data = malariagen_data.Pf8()
sample_metadata = release_data.sample_metadata()

sample_metadata.head(3)

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	All samples same case	Population	% callable	QC pass	Exclusion reason	Sample type	Sample was in Pf7
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	FP0008-C	AF-W	82.48	True	Analysis_set	gDNA	True
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	FP0009-C	AF-W	88.95	True	Analysis_set	gDNA	True
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	FP0010-CW	AF-W	87.01	True	Analysis_set	sWGA	True

We can start exploring the data by answering these questions:

How many samples with QC pass?
How many samples in each country?

# Calculate the total number of samples
total_sample_number = sample_metadata.Sample.count()

# Calculate the number of samples that passed QC
qc_pass_count = (sample_metadata['QC pass'] == True).sum()

# Calculate the number of samples that failed QC
qc_fail_count = (sample_metadata['QC pass'] == False).sum()

print(f"We see {total_sample_number} samples of which {qc_pass_count} QC-pass and {qc_fail_count} QC fail in the overall Pf8 dataset.")

We see 33325 samples of which 24409 QC-pass and 8916 QC fail in the overall Pf8 dataset.

# Calculate the number of samples in each country
sample_metadata['Country'].value_counts()

	count
Country
Ghana	6653
Vietnam	2700
Mali	2428
Cambodia	2282
Kenya	2142
Gambia	1998
Laos	1994
Bangladesh	1658
Democratic Republic of the Congo	1549
Mozambique	1348
Nigeria	1303
Myanmar	1268
Thailand	1157
Tanzania	1144
Malawi	681
Sudan	356
Benin	334
India	318
Cameroon	294
Papua New Guinea	251
Guinea	199
Colombia	167
Senegal	155
Indonesia	133
Peru	106
Mauritania	104
Côte d'Ivoire	71
Gabon	59
Burkina Faso	58
Ethiopia	35
Madagascar	25
Uganda	15
Honduras	8
Venezuela	2

dtype: int64

Figure preparation: Defining populations¶

Countries are grouped into ten major sub-populations based on their geographic and genetic characteristics.

The dataframe has a Population column that contains abbreviated names, for clarity, we want to display the full name in the figure.

# Define populations in an ordered dictionary
populations = collections.OrderedDict()
populations['SA'] = 'South America'
populations['AF-W'] = 'West Africa'
populations['AF-C'] = 'Central Africa'
populations['AF-NE'] = 'Northeast Africa'
populations['AF-E'] = 'East Africa'
populations['AS-S-E'] = 'Eastern South Asia'
populations['AS-S-FE'] = 'Far-eastern South Asia'
populations['AS-SE-W'] = 'Western Southeast Asia'
populations['AS-SE-E'] = 'Eastern Southeast Asia'
populations['OC-NG'] = 'Oceania'

# Map continent names into the df by using Population column and populations dictionary
sample_metadata['Continent'] = sample_metadata['Population'].map(populations)
sample_metadata.head(3)

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	All samples same case	Population	% callable	QC pass	Exclusion reason	Sample type	Sample was in Pf7	Continent
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	FP0008-C	AF-W	82.48	True	Analysis_set	gDNA	True	West Africa
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	FP0009-C	AF-W	88.95	True	Analysis_set	gDNA	True	West Africa
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	FP0010-CW	AF-W	87.01	True	Analysis_set	sWGA	True	West Africa

# Create an ordered dictionary which maps the codes for major sub-populations -from west to east- to a colour code.
population_colours = collections.OrderedDict()
population_colours['SA']       = "#4daf4a"
population_colours['AF-W']     = "#e31a1c"
population_colours['AF-C']     = "#fd8d3c"
population_colours['AF-NE']    = "#bb8129"
population_colours['AF-E']     = "#fecc5c"
population_colours['AS-S-E']  = "#dfc0eb"
population_colours['AS-S-FE']  = "#984ea3"
population_colours['AS-SE-W'] = "#9ecae1"
population_colours['AS-SE-E'] = "#3182bd"
population_colours['OC-NG']    = "#f781bf"

# Map population colours into the df by using Population column and population_colours dictionary
sample_metadata['population_colour'] = sample_metadata['Population'].map(population_colours)

Figure preparation: Sort countries in geographic order¶

We want to sort the countries on the x-axis in geographic order, which means arranging them from left to right on the chart based on their geographical location, from west to east or by continents.

Using longitudes to locate country¶

To do this arrangement, we will use longitude coordinate countries which can be found in the dataset column Country longitude.

# Find the average of longitude of samples in each country
mean_population_longitude = sample_metadata.groupby('Population')['Country longitude'].mean()

# Add a new column that conveys mean population values for each sample
sample_metadata['Population_long'] = sample_metadata['Population'].map(mean_population_longitude)

Splitting countries with multi-populations¶

We identified three countries (Kenya, India, and Thailand) where the sampling locations are associated with more than one major sub-population”. For example, Kenya has sampling locations from AF-NE and AF-E, and this causes problems with ordering on country longitude because AF-NE and AF-E become mixed up in the table.

To accurately represent this diversity, we created a new column called Country_or_admin1 and Country_or_admin1_long in our sample metadata.

These columns categorizes these countries based on their first-level administrative divisions.

# Create a duplicate column with country names
sample_metadata['Country_or_admin1'] = sample_metadata['Country']
sample_metadata['Country_or_admin1_long'] = sample_metadata['Country longitude']

# Rename each 'Admin level 1' of split-countries
sample_metadata.loc[(sample_metadata['Country'] == 'Kenya') & (sample_metadata['Admin level 1'] == 'Kilifi'), 'Country_or_admin1'] = 'Kenya, Kilifi'
sample_metadata.loc[(sample_metadata['Country'] == 'Kenya') & (sample_metadata['Admin level 1'] == 'Kisumu'), 'Country_or_admin1'] = 'Kenya, Kisumu'
sample_metadata.loc[(sample_metadata['Country'] == 'India') & (sample_metadata['Admin level 1'] == 'Tripura'), 'Country_or_admin1'] = 'India, Tripura'
sample_metadata.loc[(sample_metadata['Country'] == 'India') & (sample_metadata['Admin level 1'] == 'Odisha'), 'Country_or_admin1'] = 'India, Odisha or West Bengal'
sample_metadata.loc[(sample_metadata['Country'] == 'India') & (sample_metadata['Admin level 1'] == 'West Bengal'), 'Country_or_admin1'] = 'India, Odisha or West Bengal'
sample_metadata.loc[(sample_metadata['Country'] == 'Thailand') & (sample_metadata['Admin level 1'] == 'Sisakhet'), 'Country_or_admin1'] = 'Thailand, Sisakhet or Ubon Ratchathani'
sample_metadata.loc[(sample_metadata['Country'] == 'Thailand') & (sample_metadata['Admin level 1'] == 'Ubon Ratchathani'), 'Country_or_admin1'] = 'Thailand, Sisakhet or Ubon Ratchathani'
sample_metadata.loc[(sample_metadata['Country'] == 'Thailand') & (sample_metadata['Admin level 1'] == 'Tak'), 'Country_or_admin1'] = 'Thailand, Tak or Ranong'
sample_metadata.loc[(sample_metadata['Country'] == 'Thailand') & (sample_metadata['Admin level 1'] == 'Ranong'), 'Country_or_admin1'] = 'Thailand, Tak or Ranong'

# Set longitude to that of admin1 for split countries
sample_metadata.loc[
    sample_metadata['Country_or_admin1'] != sample_metadata['Country'],
    'Country_or_admin1_long'
] = sample_metadata.loc[
    sample_metadata['Country_or_admin1'] != sample_metadata['Country'],
    'Admin level 1 longitude'
]

# Set longitude to that of admin1 with most samples for countries with more than one admin1 in population
sample_metadata.loc[
    sample_metadata['Country_or_admin1'] == 'India, Odisha or West Bengal',
    'Country_or_admin1_long'
] = sample_metadata.loc[
    ( sample_metadata['Country'] == 'India' )
     & ( sample_metadata['Admin level 1'] == 'Odisha' ),
    'Country_or_admin1_long'
].values[0]
sample_metadata.loc[
    sample_metadata['Country_or_admin1'] == 'Thailand, Tak or Ranong',
    'Country_or_admin1_long'
] = sample_metadata.loc[
    ( sample_metadata['Country'] == 'Thailand' )
     & ( sample_metadata['Admin level 1'] == 'Tak' ),
    'Country_or_admin1_long'
].values[0]

Next, we want to arrange the divisions from the same countries adjacent to each other in order to facilitate meaningful comparisons when we look at the figure.

In order to do that we simply adjust their longitude values.

# Adjust the longitude values to appear first or last
sample_metadata.loc[sample_metadata['Country_or_admin1'] == 'Kenya, Kisumu', 'Country_or_admin1_long'] = 40 # Want it to appear last in AF-NE
sample_metadata.loc[sample_metadata['Country_or_admin1'] == 'Kenya, Kilifi', 'Country_or_admin1_long'] = 34 # Want it to appear first in AF-E
sample_metadata.loc[sample_metadata['Country_or_admin1'] == 'India, Tripura', 'Country_or_admin1_long'] = 90 # Want it to appear first in AS-S-FE
sample_metadata.loc[sample_metadata['Country_or_admin1'] == 'Thailand, Sisakhet or Ubon Ratchathani', 'Country_or_admin1_long'] = 103 # Want it to appear first in AS-SE-E

Sorting countries¶

Now the countries are ready to sort geographically.

df_country_or_admin1 = (
    pd.DataFrame(
        sample_metadata
        .groupby(['Continent', 'Population', 'population_colour',
                  'Country_or_admin1', 'Population_long',
                  'Country_or_admin1_long'])
        .size()
    )
    .reset_index()
    .set_index('Country_or_admin1')
    .sort_values(['Population_long', 'Country_or_admin1_long'])
    .rename(columns={0: 'Frequency (number of samples)'})
)

print(df_country_or_admin1.shape)
df_country_or_admin1

(37, 6)

	Continent	Population	population_colour	Population_long	Country_or_admin1_long	Frequency (number of samples)
Country_or_admin1
Honduras	South America	SA	#4daf4a	-73.895869	-86.616200	8
Peru	South America	SA	#4daf4a	-73.895869	-74.356842	106
Colombia	South America	SA	#4daf4a	-73.895869	-73.086731	167
Venezuela	South America	SA	#4daf4a	-73.895869	-66.145936	2
Gambia	West Africa	AF-W	#e31a1c	-2.748816	-15.372910	1998
Senegal	West Africa	AF-W	#e31a1c	-2.748816	-14.470363	155
Guinea	West Africa	AF-W	#e31a1c	-2.748816	-10.936960	199
Mauritania	West Africa	AF-W	#e31a1c	-2.748816	-10.337093	104
Côte d'Ivoire	West Africa	AF-W	#e31a1c	-2.748816	-5.554446	71
Mali	West Africa	AF-W	#e31a1c	-2.748816	-3.522152	2428
Burkina Faso	West Africa	AF-W	#e31a1c	-2.748816	-1.745660	58
Ghana	West Africa	AF-W	#e31a1c	-2.748816	-1.210711	6653
Benin	West Africa	AF-W	#e31a1c	-2.748816	2.339713	334
Nigeria	West Africa	AF-W	#e31a1c	-2.748816	8.097575	1303
Gabon	West Africa	AF-W	#e31a1c	-2.748816	11.784989	59
Cameroon	West Africa	AF-W	#e31a1c	-2.748816	12.741504	294
Democratic Republic of the Congo	Central Africa	AF-C	#fd8d3c	23.660758	23.660758	1549
Sudan	Northeast Africa	AF-NE	#bb8129	31.865806	30.005646	356
Uganda	Northeast Africa	AF-NE	#bb8129	31.865806	32.391932	15
Ethiopia	Northeast Africa	AF-NE	#bb8129	31.865806	39.626195	35
Kenya, Kisumu	Northeast Africa	AF-NE	#bb8129	31.865806	40.000000	64
Kenya, Kilifi	East Africa	AF-E	#fecc5c	36.189028	34.000000	2078
Malawi	East Africa	AF-E	#fecc5c	36.189028	34.300482	681
Tanzania	East Africa	AF-E	#fecc5c	36.189028	34.825685	1144
Mozambique	East Africa	AF-E	#fecc5c	36.189028	35.551437	1348
Madagascar	East Africa	AF-E	#fecc5c	36.189028	46.698618	25
India, Odisha or West Bengal	Eastern South Asia	AS-S-E	#dfc0eb	79.622525	84.418059	246
India, Tripura	Far-eastern South Asia	AS-S-FE	#984ea3	89.833945	90.000000	72
Bangladesh	Far-eastern South Asia	AS-S-FE	#984ea3	89.833945	90.277384	1658
Myanmar	Western Southeast Asia	AS-SE-W	#9ecae1	98.489677	96.510201	1268
Thailand, Tak or Ranong	Western Southeast Asia	AS-SE-W	#9ecae1	98.489677	98.791050	994
Thailand, Sisakhet or Ubon Ratchathani	Eastern Southeast Asia	AS-SE-E	#3182bd	105.125266	103.000000	163
Laos	Eastern Southeast Asia	AS-SE-E	#3182bd	105.125266	103.768157	1994
Cambodia	Eastern Southeast Asia	AS-SE-E	#3182bd	105.125266	104.916873	2282
Vietnam	Eastern Southeast Asia	AS-SE-E	#3182bd	105.125266	106.551796	2700
Indonesia	Oceania	OC-NG	#f781bf	135.577208	117.314980	133
Papua New Guinea	Oceania	OC-NG	#f781bf	135.577208	145.254007	251

We want to seperate and sort QC pass samples which will help to distinguish QC-fail samples in the figure.

df_country_or_admin1_pass = (
    pd.DataFrame(
        sample_metadata
        .loc[sample_metadata['QC pass']]
        .groupby(['Continent', 'Population',
                  'population_colour', 'Country_or_admin1',
                  'Population_long', 'Country_or_admin1_long'])
        .size()
    )
    .reset_index()
    .set_index('Country_or_admin1')
    .sort_values(['Population_long', 'Country_or_admin1_long'])
    .rename(columns={0: 'Frequency (number of samples)'})
)
print(df_country_or_admin1_pass.shape)
df_country_or_admin1_pass.head()

(36, 6)

	Continent	Population	population_colour	Population_long	Country_or_admin1_long	Frequency (number of samples)
Country_or_admin1
Peru	South America	SA	#4daf4a	-73.895869	-74.356842	85
Colombia	South America	SA	#4daf4a	-73.895869	-73.086731	140
Venezuela	South America	SA	#4daf4a	-73.895869	-66.145936	2
Gambia	West Africa	AF-W	#e31a1c	-2.748816	-15.372910	1376
Senegal	West Africa	AF-W	#e31a1c	-2.748816	-14.470363	151

Some countries might have only QC-fail samples and no QC-pass samples. To ensure that these countries are represented in the QC pass dataset, we will merge the QC-pass dataset with the overall dataset of samples without the ‘Frequency (number of samples)’ column. This way, we can later fill the frequency value of those countries that have only QC-fail samples as ‘0’.

df_country_or_admin1_pass= df_country_or_admin1_pass.merge(
        df_country_or_admin1.drop(columns=['Frequency (number of samples)']),
        on=[col for col in df_country_or_admin1.columns if col != 'Frequency (number of samples)'],
        how='right',
    ).fillna(0).set_axis(df_country_or_admin1.index)

# Convert the 'Frequency (number of samples)' column to integer type
df_country_or_admin1_pass['Frequency (number of samples)'] = df_country_or_admin1_pass['Frequency (number of samples)'].astype(int)

df_country_or_admin1_pass.head()

	Continent	Population	population_colour	Population_long	Country_or_admin1_long	Frequency (number of samples)
Country_or_admin1
Honduras	South America	SA	#4daf4a	-73.895869	-86.616200	0
Peru	South America	SA	#4daf4a	-73.895869	-74.356842	85
Colombia	South America	SA	#4daf4a	-73.895869	-73.086731	140
Venezuela	South America	SA	#4daf4a	-73.895869	-66.145936	2
Gambia	West Africa	AF-W	#e31a1c	-2.748816	-15.372910	1376

Let’s make sure we have the same countries in the same order in the both datasets:

set(df_country_or_admin1.index) - set(df_country_or_admin1_pass.index)

set()

Finally, we rename some countries with long names to shorter names to prevent the restriction of figure size.

# rename the long-name countries in total samples df
df_country_or_admin1.rename(index={'Democratic Republic of the Congo': 'DRC'},inplace=True)
df_country_or_admin1.rename(index={'India, Odisha or West Bengal': 'India, Odisha\nor West Bengal'},inplace=True)
df_country_or_admin1.rename(index={'Thailand, Tak or Ranong': 'Thailand, Tak\nor Ranong'},inplace=True)
df_country_or_admin1.rename(index={'Thailand, Sisakhet or Ubon Ratchathani': 'Thailand\n,Sisakhet or\nUbon Ratchathani'},inplace=True)

# rename the long-name countries in QC-pass samples df
df_country_or_admin1_pass.rename(index={'Democratic Republic of the Congo': 'DRC'},inplace=True)
df_country_or_admin1_pass.rename(index={'India, Odisha or West Bengal': 'India, Odisha\nor West Bengal'},inplace=True)
df_country_or_admin1_pass.rename(index={'Thailand, Tak or Ranong': 'Thailand, Tak\nor Ranong'},inplace=True)
df_country_or_admin1_pass.rename(index={'Thailand, Sisakhet or Ubon Ratchathani': 'Thailand\n,Sisakhet or\nUbon Ratchathani'},inplace=True)

Make the figure¶

We have the following considerations when making this figure:

While QC failed samples are shown as outline only, others should have a solid-background to distinguish from each other
Lines and annotations at the bottom for both continent and population.
The y-axis is truncated at 2,000 samples for visual clarity. With over 3,000 samples, Ghana is affected by this truncation. Therefore, specific annotations for QC pass and fail are positioned above Ghana’s bar to highlight its significance.

# Adjust the figure size
fig, ax = plt.subplots(1, 1, figsize=(26, 14))

# Add the plot title
ax.set_title('Figure 1: Breakdown of QC pass samples per country')

# Create the bars for all samples with a white backgound
ax.bar(
    np.arange(len(df_country_or_admin1)),
    df_country_or_admin1['Frequency (number of samples)'],
    edgecolor = df_country_or_admin1['population_colour'],
    color = df_country_or_admin1['population_colour'],
    alpha = 0.2
)
# Create the bars for QC pass with solid-colour background
ax.bar(
    np.arange(len(df_country_or_admin1_pass)),
    df_country_or_admin1_pass['Frequency (number of samples)'],
    color = df_country_or_admin1['population_colour'],
    edgecolor = df_country_or_admin1['population_colour'],
)
# Set x-axis labels and rotate them for readability
ax.set_xticks(np.arange(len(df_country_or_admin1_pass)))
ax.set_xticklabels(df_country_or_admin1_pass.index, rotation=90)
ax.grid(True, axis='y')
# Set the y-axis limit to truncate bars at a maximum of 3000
ax.set_ylim(0, 3000)
# Set axis labels
ax.set_xlabel('Country or region',fontsize=15)
ax.set_ylabel('Frequency (number of samples)',fontsize=15)
trans = ax.get_xaxis_transform()
# Add specific annotation to Ghana
total_samples = collections.OrderedDict()
pass_samples = collections.OrderedDict()
x_pos = collections.OrderedDict()
# Set the index number for Ghana
x_pos['Ghana'] = 11
for country in x_pos:
    total_samples[country] = df_country_or_admin1.loc[country, 'Frequency (number of samples)']
    pass_samples[country] = df_country_or_admin1_pass.loc[country, 'Frequency (number of samples)']
    ax.annotate(f"{total_samples[country] - pass_samples[country]:,} /", xy=(x_pos[country], 1.1), xycoords=trans, ha="center", va="top")
    ax.annotate(f"{pass_samples[country]:,}", xy=(x_pos[country], 1.05), xycoords=trans, ha="center", va="top")
y_offset = -0.6
text_offset = 0.05
x_offset = 0.3

# Add annotations for Continents
ax.annotate('Continent', xy=(-3, y_offset-text_offset), xycoords=trans, ha="left", va="top",fontsize=18)
ax.annotate('South\nAmerica', xy=(1.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([0-x_offset, 3+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Africa', xy=(13.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([4-x_offset, 25+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Asia', xy=(30.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([25.7, 34.5],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Oceania', xy=(35.8, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([34.8, 36.8],[y_offset, y_offset], color="k", transform=trans, clip_on=False)

y_offset = -0.45
text_offset = 0.04
x_offset = 0.3

# Add annotations for Populations
ax.annotate('Population', xy=(-3, y_offset-text_offset), xycoords=trans, ha="left", va="top",fontsize=18)
ax.annotate('SA', xy=(1.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([0-x_offset, 3+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-W', xy=(9.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([4-x_offset, 15+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-C', xy=(16, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([16-x_offset, 16+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-NE', xy=(18.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([17-x_offset, 20+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-E', xy=(23, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([21-x_offset, 25+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-S-E', xy=(26, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([26-x_offset, 26+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-S-FE', xy=(27.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([27-x_offset, 28+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-SE-W', xy=(29.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([29-x_offset, 30+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-SE-E', xy=(32.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([30.8, 34.4],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('OC-NG', xy=(35.8, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
_ = ax.plot([34.8, 36.8],[y_offset, y_offset], color="k", transform=trans, clip_on=False)


# Customize tick label fonts and spacing
for tick in ax.xaxis.get_major_ticks():
    tick.label1.set_fontsize(14)
ax.tick_params(axis='x', pad=1)

fig.tight_layout()

../../_images/SamplesByCountry_Barplot_41_0.png

Figure Legend. Breakdown of samples by country. Opaque colours within bars represent samples which passed QC. The more transparent portion of bars represent samples that failed QC. The y-axis is truncated at 3,000 samples, with the numbers of QC pass/QC fail samples in Ghana shown above the bar. Bars are coloured according to the major sub-population to which the location is assigned.

Figure 2: New samples in Pf8¶

This time, we are interested in plotting how many new samples each country has in Pf8.

df_country_or_admin1_pf8 = (
    pd.DataFrame(
        sample_metadata
        .groupby(['Continent', 'Population', 'population_colour',
                  'Country_or_admin1', 'Population_long',
                  'Country_or_admin1_long'])
        .size()
    )
    .reset_index()
    .set_index('Country_or_admin1')
    .sort_values(['Population_long', 'Country_or_admin1_long'])
    .rename(columns={0: 'Frequency (number of samples)'})
)
print(df_country_or_admin1_pf8.shape)

df_country_or_admin1_pf7 = (
    pd.DataFrame(
        sample_metadata
        .loc[sample_metadata['Sample was in Pf7']]
        .groupby(['Continent', 'Population',
                  'population_colour', 'Country_or_admin1',
                  'Population_long', 'Country_or_admin1_long'])
        .size()
    )
    .reset_index()
    .set_index('Country_or_admin1')
    .sort_values(['Population_long', 'Country_or_admin1_long'])
    .rename(columns={0: 'Frequency (number of samples)'})
)

print(df_country_or_admin1_pf7.shape)

(37, 6)
(36, 6)

Similar to QC status, some countries that do not have any sample in Pf7 might have samples in Pf8. To ensure that these countries are represented in Pf7 dataset, we will merge the the two datasets with a similar technique.

df_country_or_admin1_pf7= df_country_or_admin1_pf7.merge(
        df_country_or_admin1_pf8.drop(columns=['Frequency (number of samples)']),
        on=[col for col in df_country_or_admin1_pf8.columns if col != 'Frequency (number of samples)'],
        how='right',
    ).fillna(0).set_axis(df_country_or_admin1_pf8.index)

# Convert the 'Frequency (number of samples)' column to integer type
df_country_or_admin1_pf7['Frequency (number of samples)'] = df_country_or_admin1_pf7['Frequency (number of samples)'].astype(int)

# rename the long-name countries in total samples df
df_country_or_admin1_pf7.rename(index={'Democratic Republic of the Congo': 'DRC'},inplace=True)
df_country_or_admin1_pf7.rename(index={'India, Odisha or West Bengal': 'India, Odisha\nor West Bengal'},inplace=True)
df_country_or_admin1_pf7.rename(index={'Thailand, Tak or Ranong': 'Thailand, Tak\nor Ranong'},inplace=True)
df_country_or_admin1_pf7.rename(index={'Thailand, Sisakhet or Ubon Ratchathani': 'Thailand,\nSisakhet or\nUbon Ratchathani'},inplace=True)


# rename the long-name countries in QC-pass samples df
df_country_or_admin1_pf8.rename(index={'Democratic Republic of the Congo': 'DRC'},inplace=True)
df_country_or_admin1_pf8.rename(index={'India, Odisha or West Bengal': 'India, Odisha\nor West Bengal'},inplace=True)
df_country_or_admin1_pf8.rename(index={'Thailand, Tak or Ranong': 'Thailand, Tak\nor Ranong'},inplace=True)
df_country_or_admin1_pf8.rename(index={'Thailand, Sisakhet or Ubon Ratchathani': 'Thailand,\nSisakhet or\nUbon Ratchathani'},inplace=True)

Now, we can create the figure using the same code in Figure 1.

# Adjust the figure size
fig2, ax = plt.subplots(1, 1, figsize=(26, 14))

# Add the plot title
ax.set_title('Figure 2: Breakdown of samples per country in Pf7 and Pf8')

# Create the bars for all samples with a white backgound
ax.bar(
    np.arange(len(df_country_or_admin1_pf8)),
    df_country_or_admin1_pf8['Frequency (number of samples)'],
    color = df_country_or_admin1_pf8['population_colour'],
    edgecolor = df_country_or_admin1_pf8['population_colour'],
    alpha = 0.2
)
# Create the bars for Pf8 with solid-colour background
ax.bar(
    np.arange(len(df_country_or_admin1_pf7)),
    df_country_or_admin1_pf7['Frequency (number of samples)'],
    color = df_country_or_admin1_pf8['population_colour'],
    edgecolor = df_country_or_admin1_pf8['population_colour'],
)
# Set x-axis labels and rotate them for readability
ax.set_xticks(np.arange(len(df_country_or_admin1_pf7)))
ax.set_xticklabels(df_country_or_admin1_pf7.index, rotation=90)
ax.grid(True, axis='y')
# Set the y-axis limit to truncate bars at a maximum of 3000
ax.set_ylim(0, 3000)
# Set axis labels
ax.set_xlabel('Country or region',fontsize=15)
ax.set_ylabel('Frequency (number of samples)',fontsize=15)
trans = ax.get_xaxis_transform()
# Add specific annotation to Ghana
pf8_samples = collections.OrderedDict()
pf7_samples = collections.OrderedDict()
x_pos = collections.OrderedDict()
# Set the index number for Ghana
x_pos['Ghana'] = 11
for country in x_pos:
    pf8_samples[country] = df_country_or_admin1_pf8.loc[country, 'Frequency (number of samples)']
    pf7_samples[country] = df_country_or_admin1_pf7.loc[country, 'Frequency (number of samples)']
    ax.annotate(f"{pf8_samples[country] - pf7_samples[country]:,} /", xy=(x_pos[country], 1.1), xycoords=trans, ha="center", va="top", fontsize = 14)
    ax.annotate(f"{pf7_samples[country]:,}", xy=(x_pos[country], 1.05), xycoords=trans, ha="center", va="top", fontsize = 14)
y_offset = -0.6
text_offset = 0.05
x_offset = 0.3

# Add annotations for Continents
ax.annotate('Continent', xy=(-3, y_offset-text_offset), xycoords=trans, ha="left", va="top",fontsize=18)
ax.annotate('South\nAmerica', xy=(1.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([0-x_offset, 3+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Africa', xy=(13.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([4-x_offset, 25+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Asia', xy=(30.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([25.7, 34.5],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Oceania', xy=(35.8, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([34.8, 36.8],[y_offset, y_offset], color="k", transform=trans, clip_on=False)

y_offset = -0.45
text_offset = 0.04
x_offset = 0.3

# Add annotations for Populations
ax.annotate('Population', xy=(-3, y_offset-text_offset), xycoords=trans, ha="left", va="top",fontsize=18)
ax.annotate('SA', xy=(1.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([0-x_offset, 3+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-W', xy=(9.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([4-x_offset, 15+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-C', xy=(16, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([16-x_offset, 16+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-NE', xy=(18.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([17-x_offset, 20+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-E', xy=(23, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([21-x_offset, 25+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-S-E', xy=(26, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([26-x_offset, 26+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-S-FE', xy=(27.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([27-x_offset, 28+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-SE-W', xy=(29.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([29-x_offset, 30+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-SE-E', xy=(32.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([30.8, 34.4],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('OC-NG', xy=(35.8, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
_ = ax.plot([34.8, 36.8],[y_offset, y_offset], color="k", transform=trans, clip_on=False)


# Customize tick label fonts and spacing
for tick in ax.xaxis.get_major_ticks():
    tick.label1.set_fontsize(14)
ax.tick_params(axis='x', pad=1)

fig2.tight_layout()

../../_images/SamplesByCountry_Barplot_49_0.png

Figure Legend. Samples new to Pf8, per country. Opaque colours within bars represent samples that were present in the previous release, Pf7. Samples which are new to Pf8 in each country are represented by the transparent portion of bars. The y-axis is truncated at 3,000 samples, with the numbers of Pf8 / Pf7 samples in Ghana shown above the bar. Bars are coloured according to the major sub-population to which the location is assigned. Kenya, India, and Thailand have locations in >1 sub-population, and therefore have bars for each of the countries’ sub-populations.

Save the figure¶

We can output this to a location in Google Drive

First we need to connect Google Drive by running the following:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')

Mounted at /content/drive

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

fig.savefig('/content/drive/My Drive/SamplesByCountry_QC_Barplot.pdf', dpi=480, bbox_inches = 'tight')
fig.savefig('/content/drive/My Drive/SamplesByCountry_QC_Barplot.png', dpi=480, bbox_inches = 'tight') # increase the dpi for higher resolution

fig2.savefig('/content/drive/My Drive/SamplesByCountry_NewSamples_Barplot.pdf', dpi=480, bbox_inches = 'tight')
fig2.savefig('/content/drive/My Drive/SamplesByCountry_NewSamples_Barplot.png', dpi=480, bbox_inches = 'tight')

MalariaGEN parasite data user guide

Plot sample collection per country

Contents