Create sample location summary table¶

Introduction¶

This notebook creates a table - a summary of samples broken down by major sub-population, country, and first administrative division - using the publicly available Pf8 dataset for Plasmodium falciparum.

This notebook should take approximately one minute to run.

Setup¶

Install the malariagen Python package:

!pip install malariagen_data -q --no-warn-conflicts

  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 51.6 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 5.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 765.4/765.4 kB 40.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.4/24.4 MB 65.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 91.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.7/210.7 kB 14.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 77.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 81.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 83.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 47.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.4/77.4 kB 5.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 85.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 16.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.3/43.3 kB 2.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.6/12.6 MB 26.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 43.2 MB/s eta 0:00:00
?25h  Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
  Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required libraries:

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import malariagen_data
import pandas as pd
import collections
from google.colab import drive

Access Pf8 Data and Metadata¶

Here we define the data and metadata as pandas dataframes

release_data = malariagen_data.Pf8()
sample_metadata = release_data.sample_metadata()

Produce the Table of Samples per Admin Division Level 1¶

Define a dictionary for sub-population names¶

To make the eventual table output easier to interpret, we add a column ‘Region’ to the dataframe. The dictionary matches the codes for each major sub-population to their full names. E.g. ‘SA’ becomes ‘South America (SA)’ in the ‘Region’ column.

We initially give a numeric prefix to the sub-populations so that they will appear in our desired order (west to east) in the table. This will be removed later on.

# Define a dictionary to link population code with the full name of each region
population_legend = collections.OrderedDict()
population_legend['SA']       = "01. South America"
population_legend['AF-W']     = "02. Africa - West"
population_legend['AF-C']     = "03. Africa - Central"
population_legend['AF-NE']    = "04. Africa - Northeast"
population_legend['AF-E']     = "05. Africa - East"
population_legend['AS-S-E']  = "06. Asia - South - East"
population_legend['AS-S-FE']  = "07. Asia - South - Far East"
population_legend['AS-SE-W'] = "08. Asia - Southeast - West"
population_legend['AS-SE-E'] = "09. Asia - Southeast - East"
population_legend['OC-NG']    = "10. Oceania - New Guinea"

# Include full name of each population as 'Region' in the sample_metadata dataframe
sample_metadata['Region'] = sample_metadata['Population'].apply(lambda x: f'{population_legend[x]} ({x})' if x in population_legend else '11. Unverified identity')

Add a longitude column for sorting the table¶

We already have longitude information for country and admin division which we could use to sort the table in our desired west to east order. However, there are some locations who’s subpopulation region and country longitude mean they don’t sort in a consistent west-to-east manner.

Here we add a new longitude column for sorting the table ‘country_long_table’ which keeps all longitude values the same, except for locations which need a fake longitude to enable them to sort properly with the rest of their major sub-population or country.

We make the new column so that the original longitude values are retained, in case they were ever needed in the future.

sample_metadata['country_long_table'] = sample_metadata['Country longitude'] # Add the new column for sorting the table
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kisumu', 'country_long_table'] = 40 # Want it to appear last in AF-NE
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kilifi', 'country_long_table'] = 34 # Want it to appear first in AF-E

Aggredate the data by our fields of interest¶

Here we group by Region, Country, Admin level 1 and count the number of sequenced samples, then output to a dataframe, then order the locations from West to East based on country.

# Aggregate data for all samples and store in a dataframe 'agg1'
agg1 = (
    sample_metadata.groupby(
        ['Region','Country','Admin level 1','country_long_table'])
    .size()
    .to_frame('Sequenced samples')
    .reset_index()
    .sort_values(by = ['Region','country_long_table','Admin level 1'])
)

# Do this again, but just for samples passing QC rather than all samples
# We define new dataframes to keep each stage of data manipulation separate
agg2 = (
    sample_metadata.groupby(
        ['Region','Country','Admin level 1','country_long_table','QC pass']
        )
    .size()
    .to_frame('Analysis set samples')
    .reset_index()
    .sort_values(by = ['Region','country_long_table','Admin level 1'])
)
agg3 = agg2.loc[agg2['QC pass'] == True]

Merge these aggregated dataframes and drop duplicate columns¶

table_df = (
    pd.merge(agg1, agg3, on = 'Admin level 1', how = 'left',suffixes=('', '_y'))
    .sort_values(by = ['Region','country_long_table', 'Admin level 1'])
    ) # Merge
table_df.drop(table_df.filter(regex='_y$').columns, axis=1, inplace=True) # Drop duplicate columns
table_df_trimmed = table_df.drop(['country_long_table','QC pass'], axis = 1) # Drop further uneccesary columns and make a new df to avoid confusion

Trim off the number from Region column¶

# Remove the first 3 characters from each value in the column
table_df_trimmed['Region'] = table_df_trimmed['Region'].str[3:]

Add in samples with unverified identity¶

# Get our unverified location samples and count them
ui_samps = sample_metadata[sample_metadata['Region'] == '11. Unverified identity']
no_ui = len(ui_samps)
no_ui_qc = len(ui_samps.loc[ui_samps['QC pass'] == True])

# Create a mini dataframe to append to our main table
ui_sum = {
    'Region': ['Unverified identity'],
    'Country': [''],
    'Admin level 1': [''],
    'Sequenced samples': [no_ui],
    'Analysis set samples': [no_ui_qc]
}
ui_sum_df = pd.DataFrame(ui_sum)

# Add to the main table
table_df_ui = pd.concat([table_df_trimmed, ui_sum_df], ignore_index = True)
table_df_ui['Analysis set samples'] = table_df_ui['Analysis set samples'].fillna(0).astype('int') # Convert from float to integer

Add a totals row at the bottom¶

total_samps = table_df_ui['Sequenced samples'].agg('sum')
total_analysis = table_df_ui['Analysis set samples'].agg('sum')
totals =  {
    'Region': ['Total'],
    'Country': [''],
    'Admin level 1': [''],
    'Sequenced samples': [total_samps],
    'Analysis set samples': [total_analysis]
}
totals_df = pd.DataFrame(totals)
table_df_ui_totals = pd.concat([table_df_ui, totals_df], ignore_index = True)

# This has all the info we need - but it could still be clearer to read
# Make it so there is just one name per Region / Country
pd.options.display.max_rows = 300
table_df_final = table_df_ui_totals.set_index(['Region', 'Country', 'Admin level 1'])
table_df_final

			Sequenced samples	Analysis set samples
Region	Country	Admin level 1
South America (SA)	Honduras	Francisco Morazan	8	0
	Peru	Loreto	106	85
	Colombia	Cauca	146	123
		Choco	3	3
		Nariño	7	6
		Norte de Santander	8	5
		Valle del Cauca	3	3
	Venezuela	Bolivar	2	2
Africa - West (AF-W)	Gambia	Banjul	61	55
		North Bank	252	183
		Upper River	1121	646
		Western	564	492
	Senegal	Dakar	93	91
	Senegal	Sedhiou	62	60
	Guinea	Faranah	60	37
	Guinea	Nzerekore	139	112
	Mauritania	Guidimaka	23	21
		Hodh ech Chargui	40	32
		Hodh el Gharbi	41	39
	Côte d'Ivoire	Abidjan	71	71
	Mali	Bamako	215	209
		Kayes	379	250
		Koulikoro	1399	658
		Mopti	9	8
		Segou	265	71
		Sikasso	161	57
	Burkina Faso	Haut-Bassins	58	56
	Ghana	Ashanti	452	391
		Brong Ahafo	69	50
		Central	686	309
		Eastern	110	74
		Greater Accra	474	435
		Upper East	4821	3605
		Volta	41	41
	Benin	Atlantique	57	45
	Benin	Littoral	277	100
	Nigeria	Adamawa	72	37
		Cross River	73	16
		Delta	31	13
		Federal Capital Territory	5	0
		Kano	37	22
		Kebbi	71	25
		Kwara	8	5
		Lagos	188	129
		Ogun	96	38
		Ondo	56	29
		Osun	55	15
		Oyo	611	71
	Gabon	Wouleu-Ntem	59	55
	Cameroon	Sud-Ouest	294	261
Africa - Central (AF-C)	Democratic Republic of the Congo	Kinshasa	1549	1212
Africa - Northeast (AF-NE)	Sudan	Al Jazirah	17	1
		Blue Nile	128	1
		Kassala	41	18
		Khartoum	154	85
		North Kordofan	16	10
	Uganda	Apac	15	10
	Ethiopia	Amhara	15	10
		Oromia	19	12
		Southern Nations, Nationalities and Peoples	1	0
	Kenya	Kisumu	64	62
Africa - East (AF-E)	Kenya	Kilifi	2078	1872
	Malawi	Chikwawa	629	254
	Malawi	Zomba	52	34
	Tanzania	Kagera	61	52
		Kigoma	199	143
		Lindi	79	65
		Morogoro	34	32
		Pwani	447	436
		Tanga	324	297
	Mozambique	Cabo Delgado	171	142
		Gaza	171	95
		Inhambane	117	99
		Maputo	603	417
		Sofala	65	23
		Tete	128	103
		Zambezia	93	85
	Madagascar	Fianarantsoa	1	1
	Madagascar	Mahajanga	24	23
Asia - South - East (AS-S-E)	India	Odisha	124	93
Asia - South - East (AS-S-E)	India	West Bengal	122	95
Asia - South - Far East (AS-S-FE)	India	Tripura	72	62
Asia - South - Far East (AS-S-FE)	Bangladesh	Chittagong	1658	1315
Asia - Southeast - West (AS-SE-W)	Myanmar	Bago	124	89
		Kachin	28	26
		Kayin	768	645
		Mandalay	120	114
		Rakhine	19	9
		Sagaing	93	38
		Shan	65	30
		Tanintharyi	51	50
	Thailand	Ranong	27	20
	Thailand	Tak	967	867
Asia - Southeast - East (AS-SE-E)	Thailand	Sisakhet	152	97
	Thailand	Ubon Ratchathani	11	10
	Laos	Attapeu	538	444
		Champasak	282	244
		Salavan	240	214
		Savannakhet	891	783
		Sekong	43	38
	Cambodia	Battambang	65	51
		Koh Kong	5	5
		Pailin	286	187
		Preah Vihear	567	457
		Pursat	671	459
		Ratanakiri	420	357
		Stueng Traeng	268	232
	Vietnam	Bac Lieu	4	1
		Binh Phuoc	892	754
		Binh Thuan	11	0
		Dak Lak	311	300
		Dak Nong	116	112
		Gia Lai	815	756
		Khanh Hoa	95	82
		Ninh Thuan	215	85
		Phu Yen	94	89
		Quang Nam	95	76
		Quang Tri	52	47
Oceania - New Guinea (OC-NG)	Indonesia	Papua	133	119
	Papua New Guinea	East Sepik	166	149
		Madang	55	44
		Milne Bay	30	29
Unverified identity			330	0
Total			33325	24409

Table Legend¶

Breakdown of Pf8 samples by geography. Sites are divided into ten populations. Note that a) samples from Kisumu in western Kenya have been assigned to the Africa - Northeast (AF-NE) population, whereas samples from Kilifi in southern Kenya have been assigned to the Africa - East (AF-E) population, b) samples from Odisha and West Bengal in India to the west of Bangladesh have been assigned to the Asia - South Asia - West (AS-S-E) population, whereas samples from Tripura in India to the east of Bangladesh have been assigned to the Asia - South Asia - East (AS-S-FE) population and c) samples from Ranong and Tak in western Thailand have been assigned to the Western SE Asia (AS-SE-W) region, whereas samples from Sisakhet in eastern Thailand have been assigned to the Eastern SE Asia (AS-SE-E) region. Unverified identity refers to samples which contain suspected lab contaminants, are pure or mixed with lab strains, or had an incorrectly labelled continent (such as samples from returning travellers).

Write the table to a file¶

We can output this to a location in Google Drive

First we need to connect Google Drive by running the following:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')

Mounted at /content/drive

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
table_df_final.to_excel('/content/drive/My Drive/summary_table_sample_summary.xlsx')

MalariaGEN parasite data user guide

Create sample location summary table

Contents