Plot sample collection per country

Introduction

This notebook creates a bar plot which shows the number of samples in the Pf8 release, broken down by country. Each bar also details the number of samples passing (or not passing) quality control (QC) per country. Additionally, the notebook also creates a second figure, which compares the number of samples per country in the MalariaGEN Pf8 release with those in the Pf7 release.

This notebook should take approximately 1 minute to run.

Setup

Install and import the malariagen Python package:

!pip install malariagen_data -q --no-warn-conflicts
import malariagen_data
  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 30.1 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 4.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 775.9/775.9 kB 33.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.9/25.9 MB 46.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 89.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.6/210.6 kB 9.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 78.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 79.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 93.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.1/78.1 kB 4.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.7/101.7 kB 6.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 92.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 13.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 89.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 56.3 MB/s eta 0:00:00
?25h  Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
  Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required python libraries that are installed at colab by default.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections
from google.colab import drive

Access Pf8 Data

We use the malariagen data package to load the release data.

release_data = malariagen_data.Pf8()
sample_metadata = release_data.sample_metadata()
sample_metadata.head(3)
Sample Study Country Admin level 1 Country latitude Country longitude Admin level 1 latitude Admin level 1 longitude Year ENA All samples same case Population % callable QC pass Exclusion reason Sample type Sample was in Pf7
0 FP0008-C 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR1081237 FP0008-C AF-W 82.48 True Analysis_set gDNA True
1 FP0009-C 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR1081238 FP0009-C AF-W 88.95 True Analysis_set gDNA True
2 FP0010-CW 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR2889621 FP0010-CW AF-W 87.01 True Analysis_set sWGA True

We can start exploring the data by answering these questions:

  • How many samples with QC pass?

  • How many samples in each country?

# Calculate the total number of samples
total_sample_number = sample_metadata.Sample.count()

# Calculate the number of samples that passed QC
qc_pass_count = (sample_metadata['QC pass'] == True).sum()

# Calculate the number of samples that failed QC
qc_fail_count = (sample_metadata['QC pass'] == False).sum()

print(f"We see {total_sample_number} samples of which {qc_pass_count} QC-pass and {qc_fail_count} QC fail in the overall Pf8 dataset.")
We see 33325 samples of which 24409 QC-pass and 8916 QC fail in the overall Pf8 dataset.
# Calculate the number of samples in each country
sample_metadata['Country'].value_counts()
count
Country
Ghana 6653
Vietnam 2700
Mali 2428
Cambodia 2282
Kenya 2142
Gambia 1998
Laos 1994
Bangladesh 1658
Democratic Republic of the Congo 1549
Mozambique 1348
Nigeria 1303
Myanmar 1268
Thailand 1157
Tanzania 1144
Malawi 681
Sudan 356
Benin 334
India 318
Cameroon 294
Papua New Guinea 251
Guinea 199
Colombia 167
Senegal 155
Indonesia 133
Peru 106
Mauritania 104
Côte d'Ivoire 71
Gabon 59
Burkina Faso 58
Ethiopia 35
Madagascar 25
Uganda 15
Honduras 8
Venezuela 2

Figure preparation: Defining populations

Countries are grouped into ten major sub-populations based on their geographic and genetic characteristics.

The dataframe has a Population column that contains abbreviated names, for clarity, we want to display the full name in the figure.

# Define populations in an ordered dictionary
populations = collections.OrderedDict()
populations['SA'] = 'South America'
populations['AF-W'] = 'West Africa'
populations['AF-C'] = 'Central Africa'
populations['AF-NE'] = 'Northeast Africa'
populations['AF-E'] = 'East Africa'
populations['AS-S-E'] = 'Eastern South Asia'
populations['AS-S-FE'] = 'Far-eastern South Asia'
populations['AS-SE-W'] = 'Western Southeast Asia'
populations['AS-SE-E'] = 'Eastern Southeast Asia'
populations['OC-NG'] = 'Oceania'

# Map continent names into the df by using Population column and populations dictionary
sample_metadata['Continent'] = sample_metadata['Population'].map(populations)
sample_metadata.head(3)
Sample Study Country Admin level 1 Country latitude Country longitude Admin level 1 latitude Admin level 1 longitude Year ENA All samples same case Population % callable QC pass Exclusion reason Sample type Sample was in Pf7 Continent
0 FP0008-C 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR1081237 FP0008-C AF-W 82.48 True Analysis_set gDNA True West Africa
1 FP0009-C 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR1081238 FP0009-C AF-W 88.95 True Analysis_set gDNA True West Africa
2 FP0010-CW 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR2889621 FP0010-CW AF-W 87.01 True Analysis_set sWGA True West Africa
# Create an ordered dictionary which maps the codes for major sub-populations -from west to east- to a colour code.
population_colours = collections.OrderedDict()
population_colours['SA']       = "#4daf4a"
population_colours['AF-W']     = "#e31a1c"
population_colours['AF-C']     = "#fd8d3c"
population_colours['AF-NE']    = "#bb8129"
population_colours['AF-E']     = "#fecc5c"
population_colours['AS-S-E']  = "#dfc0eb"
population_colours['AS-S-FE']  = "#984ea3"
population_colours['AS-SE-W'] = "#9ecae1"
population_colours['AS-SE-E'] = "#3182bd"
population_colours['OC-NG']    = "#f781bf"

# Map population colours into the df by using Population column and population_colours dictionary
sample_metadata['population_colour'] = sample_metadata['Population'].map(population_colours)

Figure preparation: Sort countries in geographic order

We want to sort the countries on the x-axis in geographic order, which means arranging them from left to right on the chart based on their geographical location, from west to east or by continents.

Using longitudes to locate country

To do this arrangement, we will use longitude coordinate countries which can be found in the dataset column Country longitude.

# Find the average of longitude of samples in each country
mean_population_longitude = sample_metadata.groupby('Population')['Country longitude'].mean()

# Add a new column that conveys mean population values for each sample
sample_metadata['Population_long'] = sample_metadata['Population'].map(mean_population_longitude)

Splitting countries with multi-populations

We identified three countries (Kenya, India, and Thailand) where the sampling locations are associated with more than one major sub-population”. For example, Kenya has sampling locations from AF-NE and AF-E, and this causes problems with ordering on country longitude because AF-NE and AF-E become mixed up in the table.

To accurately represent this diversity, we created a new column called Country_or_admin1 and Country_or_admin1_long in our sample metadata.

These columns categorizes these countries based on their first-level administrative divisions.

# Create a duplicate column with country names
sample_metadata['Country_or_admin1'] = sample_metadata['Country']
sample_metadata['Country_or_admin1_long'] = sample_metadata['Country longitude']

# Rename each 'Admin level 1' of split-countries
sample_metadata.loc[(sample_metadata['Country'] == 'Kenya') & (sample_metadata['Admin level 1'] == 'Kilifi'), 'Country_or_admin1'] = 'Kenya, Kilifi'
sample_metadata.loc[(sample_metadata['Country'] == 'Kenya') & (sample_metadata['Admin level 1'] == 'Kisumu'), 'Country_or_admin1'] = 'Kenya, Kisumu'
sample_metadata.loc[(sample_metadata['Country'] == 'India') & (sample_metadata['Admin level 1'] == 'Tripura'), 'Country_or_admin1'] = 'India, Tripura'
sample_metadata.loc[(sample_metadata['Country'] == 'India') & (sample_metadata['Admin level 1'] == 'Odisha'), 'Country_or_admin1'] = 'India, Odisha or West Bengal'
sample_metadata.loc[(sample_metadata['Country'] == 'India') & (sample_metadata['Admin level 1'] == 'West Bengal'), 'Country_or_admin1'] = 'India, Odisha or West Bengal'
sample_metadata.loc[(sample_metadata['Country'] == 'Thailand') & (sample_metadata['Admin level 1'] == 'Sisakhet'), 'Country_or_admin1'] = 'Thailand, Sisakhet or Ubon Ratchathani'
sample_metadata.loc[(sample_metadata['Country'] == 'Thailand') & (sample_metadata['Admin level 1'] == 'Ubon Ratchathani'), 'Country_or_admin1'] = 'Thailand, Sisakhet or Ubon Ratchathani'
sample_metadata.loc[(sample_metadata['Country'] == 'Thailand') & (sample_metadata['Admin level 1'] == 'Tak'), 'Country_or_admin1'] = 'Thailand, Tak or Ranong'
sample_metadata.loc[(sample_metadata['Country'] == 'Thailand') & (sample_metadata['Admin level 1'] == 'Ranong'), 'Country_or_admin1'] = 'Thailand, Tak or Ranong'

# Set longitude to that of admin1 for split countries
sample_metadata.loc[
    sample_metadata['Country_or_admin1'] != sample_metadata['Country'],
    'Country_or_admin1_long'
] = sample_metadata.loc[
    sample_metadata['Country_or_admin1'] != sample_metadata['Country'],
    'Admin level 1 longitude'
]

# Set longitude to that of admin1 with most samples for countries with more than one admin1 in population
sample_metadata.loc[
    sample_metadata['Country_or_admin1'] == 'India, Odisha or West Bengal',
    'Country_or_admin1_long'
] = sample_metadata.loc[
    ( sample_metadata['Country'] == 'India' )
     & ( sample_metadata['Admin level 1'] == 'Odisha' ),
    'Country_or_admin1_long'
].values[0]
sample_metadata.loc[
    sample_metadata['Country_or_admin1'] == 'Thailand, Tak or Ranong',
    'Country_or_admin1_long'
] = sample_metadata.loc[
    ( sample_metadata['Country'] == 'Thailand' )
     & ( sample_metadata['Admin level 1'] == 'Tak' ),
    'Country_or_admin1_long'
].values[0]

Next, we want to arrange the divisions from the same countries adjacent to each other in order to facilitate meaningful comparisons when we look at the figure.

In order to do that we simply adjust their longitude values.

# Adjust the longitude values to appear first or last
sample_metadata.loc[sample_metadata['Country_or_admin1'] == 'Kenya, Kisumu', 'Country_or_admin1_long'] = 40 # Want it to appear last in AF-NE
sample_metadata.loc[sample_metadata['Country_or_admin1'] == 'Kenya, Kilifi', 'Country_or_admin1_long'] = 34 # Want it to appear first in AF-E
sample_metadata.loc[sample_metadata['Country_or_admin1'] == 'India, Tripura', 'Country_or_admin1_long'] = 90 # Want it to appear first in AS-S-FE
sample_metadata.loc[sample_metadata['Country_or_admin1'] == 'Thailand, Sisakhet or Ubon Ratchathani', 'Country_or_admin1_long'] = 103 # Want it to appear first in AS-SE-E

Sorting countries

Now the countries are ready to sort geographically.

df_country_or_admin1 = (
    pd.DataFrame(
        sample_metadata
        .groupby(['Continent', 'Population', 'population_colour',
                  'Country_or_admin1', 'Population_long',
                  'Country_or_admin1_long'])
        .size()
    )
    .reset_index()
    .set_index('Country_or_admin1')
    .sort_values(['Population_long', 'Country_or_admin1_long'])
    .rename(columns={0: 'Frequency (number of samples)'})
)

print(df_country_or_admin1.shape)
df_country_or_admin1
(37, 6)
Continent Population population_colour Population_long Country_or_admin1_long Frequency (number of samples)
Country_or_admin1
Honduras South America SA #4daf4a -73.895869 -86.616200 8
Peru South America SA #4daf4a -73.895869 -74.356842 106
Colombia South America SA #4daf4a -73.895869 -73.086731 167
Venezuela South America SA #4daf4a -73.895869 -66.145936 2
Gambia West Africa AF-W #e31a1c -2.748816 -15.372910 1998
Senegal West Africa AF-W #e31a1c -2.748816 -14.470363 155
Guinea West Africa AF-W #e31a1c -2.748816 -10.936960 199
Mauritania West Africa AF-W #e31a1c -2.748816 -10.337093 104
Côte d'Ivoire West Africa AF-W #e31a1c -2.748816 -5.554446 71
Mali West Africa AF-W #e31a1c -2.748816 -3.522152 2428
Burkina Faso West Africa AF-W #e31a1c -2.748816 -1.745660 58
Ghana West Africa AF-W #e31a1c -2.748816 -1.210711 6653
Benin West Africa AF-W #e31a1c -2.748816 2.339713 334
Nigeria West Africa AF-W #e31a1c -2.748816 8.097575 1303
Gabon West Africa AF-W #e31a1c -2.748816 11.784989 59
Cameroon West Africa AF-W #e31a1c -2.748816 12.741504 294
Democratic Republic of the Congo Central Africa AF-C #fd8d3c 23.660758 23.660758 1549
Sudan Northeast Africa AF-NE #bb8129 31.865806 30.005646 356
Uganda Northeast Africa AF-NE #bb8129 31.865806 32.391932 15
Ethiopia Northeast Africa AF-NE #bb8129 31.865806 39.626195 35
Kenya, Kisumu Northeast Africa AF-NE #bb8129 31.865806 40.000000 64
Kenya, Kilifi East Africa AF-E #fecc5c 36.189028 34.000000 2078
Malawi East Africa AF-E #fecc5c 36.189028 34.300482 681
Tanzania East Africa AF-E #fecc5c 36.189028 34.825685 1144
Mozambique East Africa AF-E #fecc5c 36.189028 35.551437 1348
Madagascar East Africa AF-E #fecc5c 36.189028 46.698618 25
India, Odisha or West Bengal Eastern South Asia AS-S-E #dfc0eb 79.622525 84.418059 246
India, Tripura Far-eastern South Asia AS-S-FE #984ea3 89.833945 90.000000 72
Bangladesh Far-eastern South Asia AS-S-FE #984ea3 89.833945 90.277384 1658
Myanmar Western Southeast Asia AS-SE-W #9ecae1 98.489677 96.510201 1268
Thailand, Tak or Ranong Western Southeast Asia AS-SE-W #9ecae1 98.489677 98.791050 994
Thailand, Sisakhet or Ubon Ratchathani Eastern Southeast Asia AS-SE-E #3182bd 105.125266 103.000000 163
Laos Eastern Southeast Asia AS-SE-E #3182bd 105.125266 103.768157 1994
Cambodia Eastern Southeast Asia AS-SE-E #3182bd 105.125266 104.916873 2282
Vietnam Eastern Southeast Asia AS-SE-E #3182bd 105.125266 106.551796 2700
Indonesia Oceania OC-NG #f781bf 135.577208 117.314980 133
Papua New Guinea Oceania OC-NG #f781bf 135.577208 145.254007 251

We want to seperate and sort QC pass samples which will help to distinguish QC-fail samples in the figure.

df_country_or_admin1_pass = (
    pd.DataFrame(
        sample_metadata
        .loc[sample_metadata['QC pass']]
        .groupby(['Continent', 'Population',
                  'population_colour', 'Country_or_admin1',
                  'Population_long', 'Country_or_admin1_long'])
        .size()
    )
    .reset_index()
    .set_index('Country_or_admin1')
    .sort_values(['Population_long', 'Country_or_admin1_long'])
    .rename(columns={0: 'Frequency (number of samples)'})
)
print(df_country_or_admin1_pass.shape)
df_country_or_admin1_pass.head()
(36, 6)
Continent Population population_colour Population_long Country_or_admin1_long Frequency (number of samples)
Country_or_admin1
Peru South America SA #4daf4a -73.895869 -74.356842 85
Colombia South America SA #4daf4a -73.895869 -73.086731 140
Venezuela South America SA #4daf4a -73.895869 -66.145936 2
Gambia West Africa AF-W #e31a1c -2.748816 -15.372910 1376
Senegal West Africa AF-W #e31a1c -2.748816 -14.470363 151

Some countries might have only QC-fail samples and no QC-pass samples. To ensure that these countries are represented in the QC pass dataset, we will merge the QC-pass dataset with the overall dataset of samples without the ‘Frequency (number of samples)’ column. This way, we can later fill the frequency value of those countries that have only QC-fail samples as ‘0’.

df_country_or_admin1_pass= df_country_or_admin1_pass.merge(
        df_country_or_admin1.drop(columns=['Frequency (number of samples)']),
        on=[col for col in df_country_or_admin1.columns if col != 'Frequency (number of samples)'],
        how='right',
    ).fillna(0).set_axis(df_country_or_admin1.index)

# Convert the 'Frequency (number of samples)' column to integer type
df_country_or_admin1_pass['Frequency (number of samples)'] = df_country_or_admin1_pass['Frequency (number of samples)'].astype(int)

df_country_or_admin1_pass.head()
Continent Population population_colour Population_long Country_or_admin1_long Frequency (number of samples)
Country_or_admin1
Honduras South America SA #4daf4a -73.895869 -86.616200 0
Peru South America SA #4daf4a -73.895869 -74.356842 85
Colombia South America SA #4daf4a -73.895869 -73.086731 140
Venezuela South America SA #4daf4a -73.895869 -66.145936 2
Gambia West Africa AF-W #e31a1c -2.748816 -15.372910 1376

Let’s make sure we have the same countries in the same order in the both datasets:

set(df_country_or_admin1.index) - set(df_country_or_admin1_pass.index)
set()

Finally, we rename some countries with long names to shorter names to prevent the restriction of figure size.

# rename the long-name countries in total samples df
df_country_or_admin1.rename(index={'Democratic Republic of the Congo': 'DRC'},inplace=True)
df_country_or_admin1.rename(index={'India, Odisha or West Bengal': 'India, Odisha\nor West Bengal'},inplace=True)
df_country_or_admin1.rename(index={'Thailand, Tak or Ranong': 'Thailand, Tak\nor Ranong'},inplace=True)
df_country_or_admin1.rename(index={'Thailand, Sisakhet or Ubon Ratchathani': 'Thailand\n,Sisakhet or\nUbon Ratchathani'},inplace=True)

# rename the long-name countries in QC-pass samples df
df_country_or_admin1_pass.rename(index={'Democratic Republic of the Congo': 'DRC'},inplace=True)
df_country_or_admin1_pass.rename(index={'India, Odisha or West Bengal': 'India, Odisha\nor West Bengal'},inplace=True)
df_country_or_admin1_pass.rename(index={'Thailand, Tak or Ranong': 'Thailand, Tak\nor Ranong'},inplace=True)
df_country_or_admin1_pass.rename(index={'Thailand, Sisakhet or Ubon Ratchathani': 'Thailand\n,Sisakhet or\nUbon Ratchathani'},inplace=True)

Make the figure

We have the following considerations when making this figure:

  1. While QC failed samples are shown as outline only, others should have a solid-background to distinguish from each other

  2. Lines and annotations at the bottom for both continent and population.

  3. The y-axis is truncated at 2,000 samples for visual clarity. With over 3,000 samples, Ghana is affected by this truncation. Therefore, specific annotations for QC pass and fail are positioned above Ghana’s bar to highlight its significance.

# Adjust the figure size
fig, ax = plt.subplots(1, 1, figsize=(26, 14))

# Add the plot title
ax.set_title('Figure 1: Breakdown of QC pass samples per country')

# Create the bars for all samples with a white backgound
ax.bar(
    np.arange(len(df_country_or_admin1)),
    df_country_or_admin1['Frequency (number of samples)'],
    edgecolor = df_country_or_admin1['population_colour'],
    color = df_country_or_admin1['population_colour'],
    alpha = 0.2
)
# Create the bars for QC pass with solid-colour background
ax.bar(
    np.arange(len(df_country_or_admin1_pass)),
    df_country_or_admin1_pass['Frequency (number of samples)'],
    color = df_country_or_admin1['population_colour'],
    edgecolor = df_country_or_admin1['population_colour'],
)
# Set x-axis labels and rotate them for readability
ax.set_xticks(np.arange(len(df_country_or_admin1_pass)))
ax.set_xticklabels(df_country_or_admin1_pass.index, rotation=90)
ax.grid(True, axis='y')
# Set the y-axis limit to truncate bars at a maximum of 3000
ax.set_ylim(0, 3000)
# Set axis labels
ax.set_xlabel('Country or region',fontsize=15)
ax.set_ylabel('Frequency (number of samples)',fontsize=15)
trans = ax.get_xaxis_transform()
# Add specific annotation to Ghana
total_samples = collections.OrderedDict()
pass_samples = collections.OrderedDict()
x_pos = collections.OrderedDict()
# Set the index number for Ghana
x_pos['Ghana'] = 11
for country in x_pos:
    total_samples[country] = df_country_or_admin1.loc[country, 'Frequency (number of samples)']
    pass_samples[country] = df_country_or_admin1_pass.loc[country, 'Frequency (number of samples)']
    ax.annotate(f"{total_samples[country] - pass_samples[country]:,} /", xy=(x_pos[country], 1.1), xycoords=trans, ha="center", va="top")
    ax.annotate(f"{pass_samples[country]:,}", xy=(x_pos[country], 1.05), xycoords=trans, ha="center", va="top")
y_offset = -0.6
text_offset = 0.05
x_offset = 0.3

# Add annotations for Continents
ax.annotate('Continent', xy=(-3, y_offset-text_offset), xycoords=trans, ha="left", va="top",fontsize=18)
ax.annotate('South\nAmerica', xy=(1.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([0-x_offset, 3+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Africa', xy=(13.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([4-x_offset, 25+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Asia', xy=(30.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([25.7, 34.5],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Oceania', xy=(35.8, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([34.8, 36.8],[y_offset, y_offset], color="k", transform=trans, clip_on=False)

y_offset = -0.45
text_offset = 0.04
x_offset = 0.3

# Add annotations for Populations
ax.annotate('Population', xy=(-3, y_offset-text_offset), xycoords=trans, ha="left", va="top",fontsize=18)
ax.annotate('SA', xy=(1.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([0-x_offset, 3+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-W', xy=(9.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([4-x_offset, 15+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-C', xy=(16, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([16-x_offset, 16+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-NE', xy=(18.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([17-x_offset, 20+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-E', xy=(23, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([21-x_offset, 25+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-S-E', xy=(26, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([26-x_offset, 26+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-S-FE', xy=(27.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([27-x_offset, 28+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-SE-W', xy=(29.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([29-x_offset, 30+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-SE-E', xy=(32.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([30.8, 34.4],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('OC-NG', xy=(35.8, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
_ = ax.plot([34.8, 36.8],[y_offset, y_offset], color="k", transform=trans, clip_on=False)


# Customize tick label fonts and spacing
for tick in ax.xaxis.get_major_ticks():
    tick.label1.set_fontsize(14)
ax.tick_params(axis='x', pad=1)

fig.tight_layout()
../../_images/SamplesByCountry_Barplot_41_0.png

Figure Legend. Breakdown of samples by country. Opaque colours within bars represent samples which passed QC. The more transparent portion of bars represent samples that failed QC. The y-axis is truncated at 3,000 samples, with the numbers of QC pass/QC fail samples in Ghana shown above the bar. Bars are coloured according to the major sub-population to which the location is assigned.

Figure 2: New samples in Pf8

This time, we are interested in plotting how many new samples each country has in Pf8.

df_country_or_admin1_pf8 = (
    pd.DataFrame(
        sample_metadata
        .groupby(['Continent', 'Population', 'population_colour',
                  'Country_or_admin1', 'Population_long',
                  'Country_or_admin1_long'])
        .size()
    )
    .reset_index()
    .set_index('Country_or_admin1')
    .sort_values(['Population_long', 'Country_or_admin1_long'])
    .rename(columns={0: 'Frequency (number of samples)'})
)
print(df_country_or_admin1_pf8.shape)

df_country_or_admin1_pf7 = (
    pd.DataFrame(
        sample_metadata
        .loc[sample_metadata['Sample was in Pf7']]
        .groupby(['Continent', 'Population',
                  'population_colour', 'Country_or_admin1',
                  'Population_long', 'Country_or_admin1_long'])
        .size()
    )
    .reset_index()
    .set_index('Country_or_admin1')
    .sort_values(['Population_long', 'Country_or_admin1_long'])
    .rename(columns={0: 'Frequency (number of samples)'})
)

print(df_country_or_admin1_pf7.shape)
(37, 6)
(36, 6)

Similar to QC status, some countries that do not have any sample in Pf7 might have samples in Pf8. To ensure that these countries are represented in Pf7 dataset, we will merge the the two datasets with a similar technique.

df_country_or_admin1_pf7= df_country_or_admin1_pf7.merge(
        df_country_or_admin1_pf8.drop(columns=['Frequency (number of samples)']),
        on=[col for col in df_country_or_admin1_pf8.columns if col != 'Frequency (number of samples)'],
        how='right',
    ).fillna(0).set_axis(df_country_or_admin1_pf8.index)

# Convert the 'Frequency (number of samples)' column to integer type
df_country_or_admin1_pf7['Frequency (number of samples)'] = df_country_or_admin1_pf7['Frequency (number of samples)'].astype(int)

# rename the long-name countries in total samples df
df_country_or_admin1_pf7.rename(index={'Democratic Republic of the Congo': 'DRC'},inplace=True)
df_country_or_admin1_pf7.rename(index={'India, Odisha or West Bengal': 'India, Odisha\nor West Bengal'},inplace=True)
df_country_or_admin1_pf7.rename(index={'Thailand, Tak or Ranong': 'Thailand, Tak\nor Ranong'},inplace=True)
df_country_or_admin1_pf7.rename(index={'Thailand, Sisakhet or Ubon Ratchathani': 'Thailand,\nSisakhet or\nUbon Ratchathani'},inplace=True)


# rename the long-name countries in QC-pass samples df
df_country_or_admin1_pf8.rename(index={'Democratic Republic of the Congo': 'DRC'},inplace=True)
df_country_or_admin1_pf8.rename(index={'India, Odisha or West Bengal': 'India, Odisha\nor West Bengal'},inplace=True)
df_country_or_admin1_pf8.rename(index={'Thailand, Tak or Ranong': 'Thailand, Tak\nor Ranong'},inplace=True)
df_country_or_admin1_pf8.rename(index={'Thailand, Sisakhet or Ubon Ratchathani': 'Thailand,\nSisakhet or\nUbon Ratchathani'},inplace=True)

Now, we can create the figure using the same code in Figure 1.

# Adjust the figure size
fig2, ax = plt.subplots(1, 1, figsize=(26, 14))

# Add the plot title
ax.set_title('Figure 2: Breakdown of samples per country in Pf7 and Pf8')

# Create the bars for all samples with a white backgound
ax.bar(
    np.arange(len(df_country_or_admin1_pf8)),
    df_country_or_admin1_pf8['Frequency (number of samples)'],
    color = df_country_or_admin1_pf8['population_colour'],
    edgecolor = df_country_or_admin1_pf8['population_colour'],
    alpha = 0.2
)
# Create the bars for Pf8 with solid-colour background
ax.bar(
    np.arange(len(df_country_or_admin1_pf7)),
    df_country_or_admin1_pf7['Frequency (number of samples)'],
    color = df_country_or_admin1_pf8['population_colour'],
    edgecolor = df_country_or_admin1_pf8['population_colour'],
)
# Set x-axis labels and rotate them for readability
ax.set_xticks(np.arange(len(df_country_or_admin1_pf7)))
ax.set_xticklabels(df_country_or_admin1_pf7.index, rotation=90)
ax.grid(True, axis='y')
# Set the y-axis limit to truncate bars at a maximum of 3000
ax.set_ylim(0, 3000)
# Set axis labels
ax.set_xlabel('Country or region',fontsize=15)
ax.set_ylabel('Frequency (number of samples)',fontsize=15)
trans = ax.get_xaxis_transform()
# Add specific annotation to Ghana
pf8_samples = collections.OrderedDict()
pf7_samples = collections.OrderedDict()
x_pos = collections.OrderedDict()
# Set the index number for Ghana
x_pos['Ghana'] = 11
for country in x_pos:
    pf8_samples[country] = df_country_or_admin1_pf8.loc[country, 'Frequency (number of samples)']
    pf7_samples[country] = df_country_or_admin1_pf7.loc[country, 'Frequency (number of samples)']
    ax.annotate(f"{pf8_samples[country] - pf7_samples[country]:,} /", xy=(x_pos[country], 1.1), xycoords=trans, ha="center", va="top", fontsize = 14)
    ax.annotate(f"{pf7_samples[country]:,}", xy=(x_pos[country], 1.05), xycoords=trans, ha="center", va="top", fontsize = 14)
y_offset = -0.6
text_offset = 0.05
x_offset = 0.3

# Add annotations for Continents
ax.annotate('Continent', xy=(-3, y_offset-text_offset), xycoords=trans, ha="left", va="top",fontsize=18)
ax.annotate('South\nAmerica', xy=(1.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([0-x_offset, 3+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Africa', xy=(13.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([4-x_offset, 25+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Asia', xy=(30.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([25.7, 34.5],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('Oceania', xy=(35.8, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([34.8, 36.8],[y_offset, y_offset], color="k", transform=trans, clip_on=False)

y_offset = -0.45
text_offset = 0.04
x_offset = 0.3

# Add annotations for Populations
ax.annotate('Population', xy=(-3, y_offset-text_offset), xycoords=trans, ha="left", va="top",fontsize=18)
ax.annotate('SA', xy=(1.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([0-x_offset, 3+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-W', xy=(9.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([4-x_offset, 15+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-C', xy=(16, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([16-x_offset, 16+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-NE', xy=(18.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([17-x_offset, 20+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AF-E', xy=(23, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([21-x_offset, 25+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-S-E', xy=(26, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([26-x_offset, 26+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-S-FE', xy=(27.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([27-x_offset, 28+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-SE-W', xy=(29.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([29-x_offset, 30+x_offset],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('AS-SE-E', xy=(32.5, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
ax.plot([30.8, 34.4],[y_offset, y_offset], color="k", transform=trans, clip_on=False)
ax.annotate('OC-NG', xy=(35.8, y_offset-text_offset), xycoords=trans, ha="center", va="top",fontsize=16)
_ = ax.plot([34.8, 36.8],[y_offset, y_offset], color="k", transform=trans, clip_on=False)


# Customize tick label fonts and spacing
for tick in ax.xaxis.get_major_ticks():
    tick.label1.set_fontsize(14)
ax.tick_params(axis='x', pad=1)

fig2.tight_layout()
../../_images/SamplesByCountry_Barplot_49_0.png

Figure Legend. Samples new to Pf8, per country. Opaque colours within bars represent samples that were present in the previous release, Pf7. Samples which are new to Pf8 in each country are represented by the transparent portion of bars. The y-axis is truncated at 3,000 samples, with the numbers of Pf8 / Pf7 samples in Ghana shown above the bar. Bars are coloured according to the major sub-population to which the location is assigned. Kenya, India, and Thailand have locations in >1 sub-population, and therefore have bars for each of the countries’ sub-populations.

Save the figure

We can output this to a location in Google Drive

First we need to connect Google Drive by running the following:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

fig.savefig('/content/drive/My Drive/SamplesByCountry_QC_Barplot.pdf', dpi=480, bbox_inches = 'tight')
fig.savefig('/content/drive/My Drive/SamplesByCountry_QC_Barplot.png', dpi=480, bbox_inches = 'tight') # increase the dpi for higher resolution

fig2.savefig('/content/drive/My Drive/SamplesByCountry_NewSamples_Barplot.pdf', dpi=480, bbox_inches = 'tight')
fig2.savefig('/content/drive/My Drive/SamplesByCountry_NewSamples_Barplot.png', dpi=480, bbox_inches = 'tight')