Summarise Samples by Administrative Division¶

Introduction¶

This notebook recreates Supplementary Table 1 from the Pf7 paper - a summary of samples broken down by major sub-population, country, and first administrative division. Samples are sorted so that their corresponding countries appear in a West to East order in the table.

This notebook should take approximately one minute to run.

Setup¶

Install the malariagen Python package:

!pip install -q --no-warn-conflicts malariagen_data

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 kB 2.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 14.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 31.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 40.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 8.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 5.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 40.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 46.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 13.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.6/233.6 kB 10.2 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 37.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 57.1 MB/s eta 0:00:00
?25h  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required libraries:

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import malariagen_data
import pandas as pd
import collections
from google.colab import drive

Access Pf7 Data and Metadata¶

Here we define the data and metadata as pandas dataframes

release_data = malariagen_data.Pf7()
sample_metadata = release_data.sample_metadata()

Produce the Table of Samples per Admin Division Level 1¶

####Define a dictionary for sub-population names

To make the eventual table output easier to interpret, we add a column ‘Region’ to the dataframe. The dictionary matches the codes for each major sub-population to their full names. E.g. ‘SA’ becomes ‘South America (SA)’ in the ‘Region’ column.

We initially give a numeric prefix to the sub-populations so that they will appear in our desired order (west to east) in the table. This will be removed later on.

# Define a dictionary to link population code with the full name of each region
population_legend = collections.OrderedDict()
population_legend['SA']       = "01. South America"
population_legend['AF-W']     = "02. Africa - West"
population_legend['AF-C']     = "03. Africa - Central"
population_legend['AF-NE']    = "04. Africa - Northeast"
population_legend['AF-E']     = "05. Africa - East"
population_legend['AS-S-E']  = "06. Asia - South - East"
population_legend['AS-S-FE']  = "07. Asia - South - Far East"
population_legend['AS-SE-W'] = "08. Asia - Southeast - West"
population_legend['AS-SE-E'] = "09. Asia - Southeast - East"
population_legend['OC-NG']    = "10. Oceania - New Guinea"

# Include full name of each population as 'Region' in the pf7_metadata dataframe
sample_metadata['Region'] = sample_metadata['Population'].apply(lambda x: f'{population_legend[x]} ({x})' if x in population_legend else '11. Unverified identity')

####Add a longitude column for sorting the table We already have longitude information for country and admin division which we could use to sort the table in our desired west to east order. However, there are some locations who’s subpopulation region and country longitude mean they don’t sort in a consistent west-to-east manner.

Here we add a new longitude column for sorting the table ‘country_long_table’ which keeps all longitude values the same, except for locations which need a fake longitude to enable them to sort properly with the rest of their major sub-population or country.

We make the new column so that the original longitude values are retained, in case they were ever needed in the future.

sample_metadata['country_long_table'] = sample_metadata['Country longitude'] # Add the new column for sorting the table
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kisumu', 'country_long_table'] = 40 # Want it to appear last in AF-NE
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kilifi', 'country_long_table'] = 34 # Want it to appear first in AF-E

Aggregate the data by our fields of interest¶

Here we group by Region, Country, Admin level 1 and count the number of sequenced samples, then output to a dataframe, then order the locations from West to East based on country.

# Aggregate data for all samples and store in a dataframe 'agg1'
agg1 = (
    sample_metadata.groupby(
        ['Region','Country','Admin level 1','country_long_table'])
    .size()
    .to_frame('Sequenced samples')
    .reset_index()
    .sort_values(by = ['Region','country_long_table','Admin level 1'])
)

# Do this again, but just for samples passing QC rather than all samples
# We define new dataframes to keep each stage of data manipulation separate
agg2 = (
    sample_metadata.groupby(
        ['Region','Country','Admin level 1','country_long_table','QC pass']
        )
    .size()
    .to_frame('Analysis set samples')
    .reset_index()
    .sort_values(by = ['Region','country_long_table','Admin level 1'])
)
agg3 = agg2.loc[agg2['QC pass'] == True]

Merge these aggregated dataframes and drop duplicate columns¶

table_df = (
    pd.merge(agg1, agg3, on = 'Admin level 1', how = 'left',suffixes=('', '_y'))
    .sort_values(by = ['Region','country_long_table', 'Admin level 1'])
    ) # Merge
table_df.drop(table_df.filter(regex='_y$').columns, axis=1, inplace=True) # Drop duplicate columns
table_df_trimmed = table_df.drop(['country_long_table','QC pass'], axis = 1) # Drop further uneccesary columns and make a new df to avoid confusion

Trim off the number from Region column¶

# Remove the first 3 characters from each value in the column
table_df_trimmed['Region'] = table_df_trimmed['Region'].str[3:]

Add in samples with unverified identity¶

# Get our unverified location samples and count them
ui_samps = sample_metadata[sample_metadata['Region'] == '11. Unverified identity']
no_ui = len(ui_samps)
no_ui_qc = len(ui_samps.loc[ui_samps['QC pass'] == True])

# Create a mini dataframe to append to our main table
ui_sum = {
    'Region': ['Unverified identity'],
    'Country': [''],
    'Admin level 1': [''],
    'Sequenced samples': [no_ui],
    'Analysis set samples': [no_ui_qc]
}
ui_sum_df = pd.DataFrame(ui_sum)

# Add to the main table
table_df_ui = pd.concat([table_df_trimmed, ui_sum_df], ignore_index = True)
table_df_ui['Analysis set samples'] = table_df_ui['Analysis set samples'].fillna(0).astype('int') # Convert from float to integer

Add a totals row at the bottom¶

total_samps = table_df_ui['Sequenced samples'].agg('sum')
total_analysis = table_df_ui['Analysis set samples'].agg('sum')
totals =  {
    'Region': ['Total'],
    'Country': [''],
    'Admin level 1': [''],
    'Sequenced samples': [total_samps],
    'Analysis set samples': [total_analysis]
}
totals_df = pd.DataFrame(totals)
table_df_ui_totals = pd.concat([table_df_ui, totals_df], ignore_index = True)

# This has all the info we need - but it could still be clearer to read
# Make it so there is just one name per Region / Country
pd.options.display.max_rows = 300
table_df_final = table_df_ui_totals.set_index(['Region', 'Country', 'Admin level 1'])
table_df_final

			Sequenced samples	Analysis set samples
Region	Country	Admin level 1
South America (SA)	Peru	Loreto	21	21
	Colombia	Cauca	146	123
		Choco	3	3
		Nariño	7	6
		Valle del Cauca	3	3
	Venezuela	Bolivar	2	2
Africa - West (AF-W)	Gambia	North Bank	252	186
		Upper River	760	452
		Western	235	225
	Senegal	Dakar	93	91
	Senegal	Sedhiou	62	59
	Guinea	Faranah	60	37
	Guinea	Nzerekore	139	114
	Mauritania	Guidimaka	23	21
		Hodh ech Chargui	40	32
		Hodh el Gharbi	41	39
	Côte d'Ivoire	Abidjan	71	71
	Mali	Bamako	215	209
		Kayes	379	250
		Koulikoro	991	614
		Mopti	9	8
		Segou	49	29
		Sikasso	161	57
	Burkina Faso	Haut-Bassins	58	57
	Ghana	Ashanti	286	278
		Brong Ahafo	69	50
		Central	175	104
		Eastern	21	20
		Greater Accra	198	184
		Upper East	3300	2454
		Volta	41	41
	Benin	Atlantique	57	45
	Benin	Littoral	277	105
	Nigeria	Kwara	8	5
	Nigeria	Lagos	132	105
	Gabon	Wouleu-Ntem	59	55
	Cameroon	Sud-Ouest	294	264
Africa - Central (AF-C)	Democratic Republic of the Congo	Kinshasa	573	520
Africa - Northeast (AF-NE)	Sudan	Blue Nile	66	0
		Kassala	13	9
		Khartoum	124	67
	Uganda	Apac	15	12
	Ethiopia	Amhara	15	10
	Ethiopia	Oromia	19	11
	Kenya	Kisumu	64	63
Africa - East (AF-E)	Kenya	Kilifi	662	627
	Malawi	Chikwawa	319	231
	Malawi	Zomba	52	34
	Tanzania	Kagera	61	52
		Kigoma	199	143
		Lindi	79	65
		Morogoro	34	32
		Tanga	324	297
	Mozambique	Gaza	91	34
	Madagascar	Fianarantsoa	1	1
	Madagascar	Mahajanga	24	23
Asia - South - East (AS-S-E)	India	Odisha	122	114
Asia - South - East (AS-S-E)	India	West Bengal	122	119
Asia - South - Far East (AS-S-FE)	India	Tripura	72	67
Asia - South - Far East (AS-S-FE)	Bangladesh	Chittagong	1658	1310
Asia - Southeast - West (AS-SE-W)	Myanmar	Bago	124	89
		Kachin	28	26
		Kayin	760	631
		Mandalay	120	114
		Rakhine	19	7
		Sagaing	93	38
		Shan	65	30
		Tanintharyi	51	50
	Thailand	Ranong	27	20
	Thailand	Tak	967	875
Asia - Southeast - East (AS-SE-E)	Thailand	Sisakhet	112	59
	Laos	Attapeu	210	204
		Champasak	218	208
		Salavan	147	144
		Savannakhet	452	411
		Sekong	25	24
	Cambodia	Battambang	65	51
		Koh Kong	5	5
		Pailin	286	191
		Preah Vihear	216	150
		Pursat	671	460
		Ratanakiri	420	358
		Stueng Traeng	60	52
	Vietnam	Bac Lieu	4	1
		Binh Phuoc	751	657
		Binh Thuan	11	0
		Dak Lak	112	106
		Dak Nong	73	70
		Gia Lai	376	337
		Khanh Hoa	66	50
		Ninh Thuan	205	73
		Quang Nam	95	75
		Quang Tri	40	35
Oceania - New Guinea (OC-NG)	Indonesia	Papua	133	121
	Papua New Guinea	East Sepik	166	149
		Madang	55	43
		Milne Bay	30	29
Unverified identity			160	0
Total			20864	16203

Table Legend¶

Supplementary Table 1. Breakdown of analysis set samples by geography. Sites are divided into ten populations as described in the main text. Note that a) samples from Kisumu in western Kenya have been assigned to the Africa - Northeast (AF-NE) population, whereas samples from Kilifi in southern Kenya have been assigned to the Africa - East (AF-E) population, b) samples from Odisha and West Bengal in India to the west of Bangladesh have been assigned to the Asia - South Asia - West (AS-S-E) population, whereas samples from Tripura in India to the east of Bangladesh have been assigned to the Asia - South Asia - East (AS-S-FE) population and c) samples from Ranong and Tak in western Thailand have been assigned to the Western SE Asia (AS-SE-W) region, whereas samples from Sisakhet in eastern Thailand have been assigned to the Eastern SE Asia (AS-SE-E) region. 10 returning travellers were reported as returning from Ghana (3), Kenya (2), Uganda (2), Mozambique (1) and Papua, Indonesia (2) and were excluded from analysis. 65 samples which were identified as lab strains were also excluded from analysis.

Write the table to a file¶

We can output this to a location in Google Drive

First we need to connect Google Drive by running the following:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')

Mounted at /content/drive

# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
table_df_final.to_excel('/content/drive/My Drive/Pf7_supp_table1.xlsx')

MalariaGEN parasite data user guide

Summarise Samples by Administrative Division

Contents