Summarise Samples by Administrative Division

Introduction

This notebook recreates Supplementary Table 1 from the Pf7 paper - a summary of samples broken down by major sub-population, country, and first administrative division. Samples are sorted so that their corresponding countries appear in a West to East order in the table.

This notebook should take approximately one minute to run.

Setup

Install the malariagen Python package:

!pip install -q --no-warn-conflicts malariagen_data
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 kB 2.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 14.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 31.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 40.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 8.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 5.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 40.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 46.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 13.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.6/233.6 kB 10.2 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 37.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 57.1 MB/s eta 0:00:00
?25h  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required libraries:

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import malariagen_data
import pandas as pd
import collections
from google.colab import drive

Access Pf7 Data and Metadata

Here we define the data and metadata as pandas dataframes

release_data = malariagen_data.Pf7()
sample_metadata = release_data.sample_metadata()

Produce the Table of Samples per Admin Division Level 1

####Define a dictionary for sub-population names

To make the eventual table output easier to interpret, we add a column ‘Region’ to the dataframe. The dictionary matches the codes for each major sub-population to their full names. E.g. ‘SA’ becomes ‘South America (SA)’ in the ‘Region’ column.

We initially give a numeric prefix to the sub-populations so that they will appear in our desired order (west to east) in the table. This will be removed later on.

# Define a dictionary to link population code with the full name of each region
population_legend = collections.OrderedDict()
population_legend['SA']       = "01. South America"
population_legend['AF-W']     = "02. Africa - West"
population_legend['AF-C']     = "03. Africa - Central"
population_legend['AF-NE']    = "04. Africa - Northeast"
population_legend['AF-E']     = "05. Africa - East"
population_legend['AS-S-E']  = "06. Asia - South - East"
population_legend['AS-S-FE']  = "07. Asia - South - Far East"
population_legend['AS-SE-W'] = "08. Asia - Southeast - West"
population_legend['AS-SE-E'] = "09. Asia - Southeast - East"
population_legend['OC-NG']    = "10. Oceania - New Guinea"

# Include full name of each population as 'Region' in the pf7_metadata dataframe
sample_metadata['Region'] = sample_metadata['Population'].apply(lambda x: f'{population_legend[x]} ({x})' if x in population_legend else '11. Unverified identity')

####Add a longitude column for sorting the table We already have longitude information for country and admin division which we could use to sort the table in our desired west to east order. However, there are some locations who’s subpopulation region and country longitude mean they don’t sort in a consistent west-to-east manner.

Here we add a new longitude column for sorting the table ‘country_long_table’ which keeps all longitude values the same, except for locations which need a fake longitude to enable them to sort properly with the rest of their major sub-population or country.

We make the new column so that the original longitude values are retained, in case they were ever needed in the future.

sample_metadata['country_long_table'] = sample_metadata['Country longitude'] # Add the new column for sorting the table
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kisumu', 'country_long_table'] = 40 # Want it to appear last in AF-NE
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kilifi', 'country_long_table'] = 34 # Want it to appear first in AF-E

Aggregate the data by our fields of interest

Here we group by Region, Country, Admin level 1 and count the number of sequenced samples, then output to a dataframe, then order the locations from West to East based on country.

# Aggregate data for all samples and store in a dataframe 'agg1'
agg1 = (
    sample_metadata.groupby(
        ['Region','Country','Admin level 1','country_long_table'])
    .size()
    .to_frame('Sequenced samples')
    .reset_index()
    .sort_values(by = ['Region','country_long_table','Admin level 1'])
)
# Do this again, but just for samples passing QC rather than all samples
# We define new dataframes to keep each stage of data manipulation separate
agg2 = (
    sample_metadata.groupby(
        ['Region','Country','Admin level 1','country_long_table','QC pass']
        )
    .size()
    .to_frame('Analysis set samples')
    .reset_index()
    .sort_values(by = ['Region','country_long_table','Admin level 1'])
)
agg3 = agg2.loc[agg2['QC pass'] == True]

Merge these aggregated dataframes and drop duplicate columns

table_df = (
    pd.merge(agg1, agg3, on = 'Admin level 1', how = 'left',suffixes=('', '_y'))
    .sort_values(by = ['Region','country_long_table', 'Admin level 1'])
    ) # Merge
table_df.drop(table_df.filter(regex='_y$').columns, axis=1, inplace=True) # Drop duplicate columns
table_df_trimmed = table_df.drop(['country_long_table','QC pass'], axis = 1) # Drop further uneccesary columns and make a new df to avoid confusion

Trim off the number from Region column

# Remove the first 3 characters from each value in the column
table_df_trimmed['Region'] = table_df_trimmed['Region'].str[3:]

Add in samples with unverified identity

# Get our unverified location samples and count them
ui_samps = sample_metadata[sample_metadata['Region'] == '11. Unverified identity']
no_ui = len(ui_samps)
no_ui_qc = len(ui_samps.loc[ui_samps['QC pass'] == True])

# Create a mini dataframe to append to our main table
ui_sum = {
    'Region': ['Unverified identity'],
    'Country': [''],
    'Admin level 1': [''],
    'Sequenced samples': [no_ui],
    'Analysis set samples': [no_ui_qc]
}
ui_sum_df = pd.DataFrame(ui_sum)

# Add to the main table
table_df_ui = pd.concat([table_df_trimmed, ui_sum_df], ignore_index = True)
table_df_ui['Analysis set samples'] = table_df_ui['Analysis set samples'].fillna(0).astype('int') # Convert from float to integer

Add a totals row at the bottom

total_samps = table_df_ui['Sequenced samples'].agg('sum')
total_analysis = table_df_ui['Analysis set samples'].agg('sum')
totals =  {
    'Region': ['Total'],
    'Country': [''],
    'Admin level 1': [''],
    'Sequenced samples': [total_samps],
    'Analysis set samples': [total_analysis]
}
totals_df = pd.DataFrame(totals)
table_df_ui_totals = pd.concat([table_df_ui, totals_df], ignore_index = True)
# This has all the info we need - but it could still be clearer to read
# Make it so there is just one name per Region / Country
pd.options.display.max_rows = 300
table_df_final = table_df_ui_totals.set_index(['Region', 'Country', 'Admin level 1'])
table_df_final
Sequenced samples Analysis set samples
Region Country Admin level 1
South America (SA) Peru Loreto 21 21
Colombia Cauca 146 123
Choco 3 3
Nariño 7 6
Valle del Cauca 3 3
Venezuela Bolivar 2 2
Africa - West (AF-W) Gambia North Bank 252 186
Upper River 760 452
Western 235 225
Senegal Dakar 93 91
Sedhiou 62 59
Guinea Faranah 60 37
Nzerekore 139 114
Mauritania Guidimaka 23 21
Hodh ech Chargui 40 32
Hodh el Gharbi 41 39
Côte d'Ivoire Abidjan 71 71
Mali Bamako 215 209
Kayes 379 250
Koulikoro 991 614
Mopti 9 8
Segou 49 29
Sikasso 161 57
Burkina Faso Haut-Bassins 58 57
Ghana Ashanti 286 278
Brong Ahafo 69 50
Central 175 104
Eastern 21 20
Greater Accra 198 184
Upper East 3300 2454
Volta 41 41
Benin Atlantique 57 45
Littoral 277 105
Nigeria Kwara 8 5
Lagos 132 105
Gabon Wouleu-Ntem 59 55
Cameroon Sud-Ouest 294 264
Africa - Central (AF-C) Democratic Republic of the Congo Kinshasa 573 520
Africa - Northeast (AF-NE) Sudan Blue Nile 66 0
Kassala 13 9
Khartoum 124 67
Uganda Apac 15 12
Ethiopia Amhara 15 10
Oromia 19 11
Kenya Kisumu 64 63
Africa - East (AF-E) Kenya Kilifi 662 627
Malawi Chikwawa 319 231
Zomba 52 34
Tanzania Kagera 61 52
Kigoma 199 143
Lindi 79 65
Morogoro 34 32
Tanga 324 297
Mozambique Gaza 91 34
Madagascar Fianarantsoa 1 1
Mahajanga 24 23
Asia - South - East (AS-S-E) India Odisha 122 114
West Bengal 122 119
Asia - South - Far East (AS-S-FE) India Tripura 72 67
Bangladesh Chittagong 1658 1310
Asia - Southeast - West (AS-SE-W) Myanmar Bago 124 89
Kachin 28 26
Kayin 760 631
Mandalay 120 114
Rakhine 19 7
Sagaing 93 38
Shan 65 30
Tanintharyi 51 50
Thailand Ranong 27 20
Tak 967 875
Asia - Southeast - East (AS-SE-E) Thailand Sisakhet 112 59
Laos Attapeu 210 204
Champasak 218 208
Salavan 147 144
Savannakhet 452 411
Sekong 25 24
Cambodia Battambang 65 51
Koh Kong 5 5
Pailin 286 191
Preah Vihear 216 150
Pursat 671 460
Ratanakiri 420 358
Stueng Traeng 60 52
Vietnam Bac Lieu 4 1
Binh Phuoc 751 657
Binh Thuan 11 0
Dak Lak 112 106
Dak Nong 73 70
Gia Lai 376 337
Khanh Hoa 66 50
Ninh Thuan 205 73
Quang Nam 95 75
Quang Tri 40 35
Oceania - New Guinea (OC-NG) Indonesia Papua 133 121
Papua New Guinea East Sepik 166 149
Madang 55 43
Milne Bay 30 29
Unverified identity 160 0
Total 20864 16203

Table Legend

Supplementary Table 1. Breakdown of analysis set samples by geography. Sites are divided into ten populations as described in the main text. Note that a) samples from Kisumu in western Kenya have been assigned to the Africa - Northeast (AF-NE) population, whereas samples from Kilifi in southern Kenya have been assigned to the Africa - East (AF-E) population, b) samples from Odisha and West Bengal in India to the west of Bangladesh have been assigned to the Asia - South Asia - West (AS-S-E) population, whereas samples from Tripura in India to the east of Bangladesh have been assigned to the Asia - South Asia - East (AS-S-FE) population and c) samples from Ranong and Tak in western Thailand have been assigned to the Western SE Asia (AS-SE-W) region, whereas samples from Sisakhet in eastern Thailand have been assigned to the Eastern SE Asia (AS-SE-E) region. 10 returning travellers were reported as returning from Ghana (3), Kenya (2), Uganda (2), Mozambique (1) and Papua, Indonesia (2) and were excluded from analysis. 65 samples which were identified as lab strains were also excluded from analysis.

Write the table to a file

We can output this to a location in Google Drive

First we need to connect Google Drive by running the following:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
table_df_final.to_excel('/content/drive/My Drive/Pf7_supp_table1.xlsx')