Visualising sample collections over time¶

Introduction¶

This notebook creates a figure which shows the distribution of Plasmodium falciparum samples, based on when they were collected. The figure highlights the additional data collected for the latest release (Pf8), in relation to those collected for previous work (Pf7).

This notebook should take less than 2 minutes to run.

Setup¶

First, we need to ensure we have the packages installed which are needed to run the analysis. The code below installs the MalariaGEN data package, which is used to access the Pf8 data via the cloud.

!pip install malariagen_data -q --no-warn-conflicts
import malariagen_data

  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 41.4 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 2.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 775.9/775.9 kB 35.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25.9/25.9 MB 40.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 58.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.6/210.6 kB 14.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 53.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 52.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 58.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.1/78.1 kB 5.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.7/101.7 kB 7.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 87.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 15.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 85.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 60.6 MB/s eta 0:00:00
?25h  Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
  Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
  Building wheel for asciitree (setup.py) ... ?25l?25hdone

# Install package
!pip install -q --no-warn-conflicts malariagen_data

Load the required Python libraries:

# Import previously installed packages
import malariagen_data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive

Data Access¶

First load the Pf8 metadata, then output the top of the dataframe to check it is what we are expecting:

# Load Pf8 data
release_data = malariagen_data.Pf8()
sample_metadata = release_data.sample_metadata()

# View the first few lines of the metadata
sample_metadata.head()

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	All samples same case	Population	% callable	QC pass	Exclusion reason	Sample type	Sample was in Pf7
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	FP0008-C	AF-W	82.48	True	Analysis_set	gDNA	True
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	FP0009-C	AF-W	88.95	True	Analysis_set	gDNA	True
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	FP0010-CW	AF-W	87.01	True	Analysis_set	sWGA	True
3	FP0011-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889624	FP0011-CW	AF-W	86.95	True	Analysis_set	sWGA	True
4	FP0012-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889627	FP0012-CW	AF-W	89.86	True	Analysis_set	sWGA	True

Basic summaries of data¶

It is often useful to summarise the data by different variables. This gives a get a better sense of how the data is distributed, and is generally a good sense check to make sure the numbers are what we are expecting. Below are some simple summaries of the samples for three meta-variables included in the dataset:

Population
Year
Study

# Estimate the counts of samples from each population
sample_metadata['Population'].value_counts(dropna=False)

	count
Population
AF-W	13656
AS-SE-E	7139
AF-E	5276
AS-SE-W	2262
AS-S-FE	1730
AF-C	1549
AF-NE	470
OC-NG	384
NaN	330
SA	283
AS-S-E	246

dtype: int64

# Then we calculate the mean number of samples per year of the study, there was an average of ~2015 samples collected per year
pd.to_numeric(sample_metadata['Year'], errors='coerce').describe()

	Year
count	32995.000000
mean	2014.979057
std	5.250368
min	1966.000000
25%	2013.000000
50%	2016.000000
75%	2018.000000
max	2022.000000

dtype: float64

# Then we calculate the number of samples, based on whether they were included in the Pf7 release.
# Here 'False' means these samples were made available as part of the new Pf8 dataset (a total of 12,461).
sample_metadata['Sample was in Pf7'].value_counts(dropna=False)

	count
Sample was in Pf7
True	20864
False	12461

dtype: int64

Plotting sample collection date over time¶

The next step is to visualise when the samples were collected over time.

Here, we generate a histogram that shows when the samples were collected, separated into those included in the Pf8 data release, versus those included in the previous release (Pf7). In this plot, the x axis shows the year of sample collection, while the Y axis shows the count of samples collected. The blue bars show the samples which were included in the Pf7 dataset, while the red bars show the new samples included in the Pf8 release. The figure was generated using the matplotlib Python package, a introduction to which can be found here: https://matplotlib.org/stable/tutorials/pyplot.html.

As shown on the plot, there were many new samples included in the Pf8 release (12,461), which were collected mostly between 2018-2021.

# Convert 'Year' to numeric and filter data
years = pd.to_numeric(sample_metadata['Year'], errors='coerce')
# Set the min and max year to determine the range of years to include in the plot
min_year = pd.to_numeric(sample_metadata['Year'], errors='coerce').min()
max_year = pd.to_numeric(sample_metadata['Year'], errors='coerce').max()

# Define bins for each year
bins = np.arange(min_year, max_year + 2)

# Create grouped data for stacking
pf7_samples = pd.to_numeric(sample_metadata.loc[sample_metadata['Sample was in Pf7'], 'Year'], errors='coerce')
pf8_samples = pd.to_numeric(sample_metadata.loc[~sample_metadata['Sample was in Pf7'], 'Year'], errors='coerce')

# Set up figure size and style
fig, ax = plt.subplots(figsize=(8, 4))

# Plot stacked histogram (stacked bar plot)
ax.hist(
    [pf7_samples, pf8_samples],  # Data for both groups
    bins=bins,
    stacked=True,  # Stacked bars
    color=['#67a9cf', '#ef8a62'],  # Colors for Pf7 and Pf8
    edgecolor='black',  # Border for better visibility
    label=['Pf7', 'New to Pf8']
)

# Include reference lines for the y-axis only
ax.grid(True, axis='y', linestyle='--', alpha=0.6)
ax.set_xticks(np.arange(1965, 2025, 5))

# Set axis labels and title
ax.set_xlabel('Year')
ax.set_ylabel('Number of samples')

# Set axis legend
ax.legend(loc='upper left')

# Minimize padding around the plot
fig.tight_layout()

plt.show()

../../_images/Visualising_sample_collections_over_time_14_0.png

Save the figure:¶

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive

# Save samples over time plot
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
file_path = '/content/drive/My Drive/'
file_name = 'Pf8_sample_collections_over_time'

# We save as both .png and .PDF files
fig.savefig(f'{file_path}{file_name}.png', dpi=250)
fig.savefig(f'{file_path}{file_name}.pdf')

Conclusions¶

In summary, in this notebook, we accessed the MalariaGEN data available on the cloud for the latest Plasmodium falciparum data release (Pf8). After doing some basic summaries of the data, we created histograms of the samples per year, broken down by whether or not the sample was in Pf7 compared to Pf8.

MalariaGEN parasite data user guide

Visualising sample collections over time

Contents