Visualising sample collections over time

Introduction

This notebook creates a figure which shows the distribution of Plasmodium falciparum samples, based on when they were collected. The figure highlights the additional data collected for the latest release (Pf8), in relation to those collected for previous work (Pf7).

This notebook should take less than 2 minutes to run.

Setup

First, we need to ensure we have the packages installed which are needed to run the analysis. The code below installs the MalariaGEN data package, which is used to access the Pf8 data via the cloud.

!pip install malariagen_data -q --no-warn-conflicts
import malariagen_data
  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 41.4 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 2.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 775.9/775.9 kB 35.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.9/25.9 MB 40.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 58.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.6/210.6 kB 14.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 53.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 52.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 58.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.1/78.1 kB 5.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.7/101.7 kB 7.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 87.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 15.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 85.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 60.6 MB/s eta 0:00:00
?25h  Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
  Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
  Building wheel for asciitree (setup.py) ... ?25l?25hdone
# Install package
!pip install -q --no-warn-conflicts malariagen_data

Load the required Python libraries:

# Import previously installed packages
import malariagen_data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive

Data Access

First load the Pf8 metadata, then output the top of the dataframe to check it is what we are expecting:

# Load Pf8 data
release_data = malariagen_data.Pf8()
sample_metadata = release_data.sample_metadata()

# View the first few lines of the metadata
sample_metadata.head()
Sample Study Country Admin level 1 Country latitude Country longitude Admin level 1 latitude Admin level 1 longitude Year ENA All samples same case Population % callable QC pass Exclusion reason Sample type Sample was in Pf7
0 FP0008-C 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR1081237 FP0008-C AF-W 82.48 True Analysis_set gDNA True
1 FP0009-C 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR1081238 FP0009-C AF-W 88.95 True Analysis_set gDNA True
2 FP0010-CW 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR2889621 FP0010-CW AF-W 87.01 True Analysis_set sWGA True
3 FP0011-CW 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR2889624 FP0011-CW AF-W 86.95 True Analysis_set sWGA True
4 FP0012-CW 1147-PF-MR-CONWAY Mauritania Hodh el Gharbi 20.265149 -10.337093 16.565426 -9.832345 2014.0 ERR2889627 FP0012-CW AF-W 89.86 True Analysis_set sWGA True

Basic summaries of data

It is often useful to summarise the data by different variables. This gives a get a better sense of how the data is distributed, and is generally a good sense check to make sure the numbers are what we are expecting. Below are some simple summaries of the samples for three meta-variables included in the dataset:

  1. Population

  2. Year

  3. Study

# Estimate the counts of samples from each population
sample_metadata['Population'].value_counts(dropna=False)
count
Population
AF-W 13656
AS-SE-E 7139
AF-E 5276
AS-SE-W 2262
AS-S-FE 1730
AF-C 1549
AF-NE 470
OC-NG 384
NaN 330
SA 283
AS-S-E 246

# Then we calculate the mean number of samples per year of the study, there was an average of ~2015 samples collected per year
pd.to_numeric(sample_metadata['Year'], errors='coerce').describe()
Year
count 32995.000000
mean 2014.979057
std 5.250368
min 1966.000000
25% 2013.000000
50% 2016.000000
75% 2018.000000
max 2022.000000

# Then we calculate the number of samples, based on whether they were included in the Pf7 release.
# Here 'False' means these samples were made available as part of the new Pf8 dataset (a total of 12,461).
sample_metadata['Sample was in Pf7'].value_counts(dropna=False)
count
Sample was in Pf7
True 20864
False 12461

Plotting sample collection date over time

The next step is to visualise when the samples were collected over time.

Here, we generate a histogram that shows when the samples were collected, separated into those included in the Pf8 data release, versus those included in the previous release (Pf7). In this plot, the x axis shows the year of sample collection, while the Y axis shows the count of samples collected. The blue bars show the samples which were included in the Pf7 dataset, while the red bars show the new samples included in the Pf8 release. The figure was generated using the matplotlib Python package, a introduction to which can be found here: https://matplotlib.org/stable/tutorials/pyplot.html.

As shown on the plot, there were many new samples included in the Pf8 release (12,461), which were collected mostly between 2018-2021.

# Convert 'Year' to numeric and filter data
years = pd.to_numeric(sample_metadata['Year'], errors='coerce')
# Set the min and max year to determine the range of years to include in the plot
min_year = pd.to_numeric(sample_metadata['Year'], errors='coerce').min()
max_year = pd.to_numeric(sample_metadata['Year'], errors='coerce').max()

# Define bins for each year
bins = np.arange(min_year, max_year + 2)

# Create grouped data for stacking
pf7_samples = pd.to_numeric(sample_metadata.loc[sample_metadata['Sample was in Pf7'], 'Year'], errors='coerce')
pf8_samples = pd.to_numeric(sample_metadata.loc[~sample_metadata['Sample was in Pf7'], 'Year'], errors='coerce')

# Set up figure size and style
fig, ax = plt.subplots(figsize=(8, 4))

# Plot stacked histogram (stacked bar plot)
ax.hist(
    [pf7_samples, pf8_samples],  # Data for both groups
    bins=bins,
    stacked=True,  # Stacked bars
    color=['#67a9cf', '#ef8a62'],  # Colors for Pf7 and Pf8
    edgecolor='black',  # Border for better visibility
    label=['Pf7', 'New to Pf8']
)

# Include reference lines for the y-axis only
ax.grid(True, axis='y', linestyle='--', alpha=0.6)
ax.set_xticks(np.arange(1965, 2025, 5))

# Set axis labels and title
ax.set_xlabel('Year')
ax.set_ylabel('Number of samples')

# Set axis legend
ax.legend(loc='upper left')

# Minimize padding around the plot
fig.tight_layout()

plt.show()
../../_images/Visualising_sample_collections_over_time_14_0.png

Save the figure:

# Mount Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# Save samples over time plot
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
file_path = '/content/drive/My Drive/'
file_name = 'Pf8_sample_collections_over_time'

# We save as both .png and .PDF files
fig.savefig(f'{file_path}{file_name}.png', dpi=250)
fig.savefig(f'{file_path}{file_name}.pdf')

Conclusions

In summary, in this notebook, we accessed the MalariaGEN data available on the cloud for the latest Plasmodium falciparum data release (Pf8). After doing some basic summaries of the data, we created histograms of the samples per year, broken down by whether or not the sample was in Pf7 compared to Pf8.