Create sample location summary table

Introduction

This notebook creates a table - a summary of samples broken down by major sub-population, country, and first administrative division - using the publicly available Pf8 dataset for Plasmodium falciparum.

This notebook should take approximately one minute to run.

Setup

Install the malariagen Python package:

!pip install malariagen_data -q --no-warn-conflicts
  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 51.6 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
  Preparing metadata (setup.py) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 5.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 765.4/765.4 kB 40.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.4/24.4 MB 65.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 91.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.7/210.7 kB 14.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 77.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 81.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 83.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 47.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.4/77.4 kB 5.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 85.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 16.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.3/43.3 kB 2.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.6/12.6 MB 26.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 43.2 MB/s eta 0:00:00
?25h  Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
  Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required libraries:

import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import malariagen_data
import pandas as pd
import collections
from google.colab import drive

Access Pf8 Data and Metadata

Here we define the data and metadata as pandas dataframes

release_data = malariagen_data.Pf8()
sample_metadata = release_data.sample_metadata()

Produce the Table of Samples per Admin Division Level 1

Define a dictionary for sub-population names

To make the eventual table output easier to interpret, we add a column ‘Region’ to the dataframe. The dictionary matches the codes for each major sub-population to their full names. E.g. ‘SA’ becomes ‘South America (SA)’ in the ‘Region’ column.

We initially give a numeric prefix to the sub-populations so that they will appear in our desired order (west to east) in the table. This will be removed later on.

# Define a dictionary to link population code with the full name of each region
population_legend = collections.OrderedDict()
population_legend['SA']       = "01. South America"
population_legend['AF-W']     = "02. Africa - West"
population_legend['AF-C']     = "03. Africa - Central"
population_legend['AF-NE']    = "04. Africa - Northeast"
population_legend['AF-E']     = "05. Africa - East"
population_legend['AS-S-E']  = "06. Asia - South - East"
population_legend['AS-S-FE']  = "07. Asia - South - Far East"
population_legend['AS-SE-W'] = "08. Asia - Southeast - West"
population_legend['AS-SE-E'] = "09. Asia - Southeast - East"
population_legend['OC-NG']    = "10. Oceania - New Guinea"

# Include full name of each population as 'Region' in the sample_metadata dataframe
sample_metadata['Region'] = sample_metadata['Population'].apply(lambda x: f'{population_legend[x]} ({x})' if x in population_legend else '11. Unverified identity')

Add a longitude column for sorting the table

We already have longitude information for country and admin division which we could use to sort the table in our desired west to east order. However, there are some locations who’s subpopulation region and country longitude mean they don’t sort in a consistent west-to-east manner.

Here we add a new longitude column for sorting the table ‘country_long_table’ which keeps all longitude values the same, except for locations which need a fake longitude to enable them to sort properly with the rest of their major sub-population or country.

We make the new column so that the original longitude values are retained, in case they were ever needed in the future.

sample_metadata['country_long_table'] = sample_metadata['Country longitude'] # Add the new column for sorting the table
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kisumu', 'country_long_table'] = 40 # Want it to appear last in AF-NE
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kilifi', 'country_long_table'] = 34 # Want it to appear first in AF-E

Aggredate the data by our fields of interest

Here we group by Region, Country, Admin level 1 and count the number of sequenced samples, then output to a dataframe, then order the locations from West to East based on country.

# Aggregate data for all samples and store in a dataframe 'agg1'
agg1 = (
    sample_metadata.groupby(
        ['Region','Country','Admin level 1','country_long_table'])
    .size()
    .to_frame('Sequenced samples')
    .reset_index()
    .sort_values(by = ['Region','country_long_table','Admin level 1'])
)
# Do this again, but just for samples passing QC rather than all samples
# We define new dataframes to keep each stage of data manipulation separate
agg2 = (
    sample_metadata.groupby(
        ['Region','Country','Admin level 1','country_long_table','QC pass']
        )
    .size()
    .to_frame('Analysis set samples')
    .reset_index()
    .sort_values(by = ['Region','country_long_table','Admin level 1'])
)
agg3 = agg2.loc[agg2['QC pass'] == True]

Merge these aggregated dataframes and drop duplicate columns

table_df = (
    pd.merge(agg1, agg3, on = 'Admin level 1', how = 'left',suffixes=('', '_y'))
    .sort_values(by = ['Region','country_long_table', 'Admin level 1'])
    ) # Merge
table_df.drop(table_df.filter(regex='_y$').columns, axis=1, inplace=True) # Drop duplicate columns
table_df_trimmed = table_df.drop(['country_long_table','QC pass'], axis = 1) # Drop further uneccesary columns and make a new df to avoid confusion

Trim off the number from Region column

# Remove the first 3 characters from each value in the column
table_df_trimmed['Region'] = table_df_trimmed['Region'].str[3:]

Add in samples with unverified identity

# Get our unverified location samples and count them
ui_samps = sample_metadata[sample_metadata['Region'] == '11. Unverified identity']
no_ui = len(ui_samps)
no_ui_qc = len(ui_samps.loc[ui_samps['QC pass'] == True])

# Create a mini dataframe to append to our main table
ui_sum = {
    'Region': ['Unverified identity'],
    'Country': [''],
    'Admin level 1': [''],
    'Sequenced samples': [no_ui],
    'Analysis set samples': [no_ui_qc]
}
ui_sum_df = pd.DataFrame(ui_sum)

# Add to the main table
table_df_ui = pd.concat([table_df_trimmed, ui_sum_df], ignore_index = True)
table_df_ui['Analysis set samples'] = table_df_ui['Analysis set samples'].fillna(0).astype('int') # Convert from float to integer

Add a totals row at the bottom

total_samps = table_df_ui['Sequenced samples'].agg('sum')
total_analysis = table_df_ui['Analysis set samples'].agg('sum')
totals =  {
    'Region': ['Total'],
    'Country': [''],
    'Admin level 1': [''],
    'Sequenced samples': [total_samps],
    'Analysis set samples': [total_analysis]
}
totals_df = pd.DataFrame(totals)
table_df_ui_totals = pd.concat([table_df_ui, totals_df], ignore_index = True)
# This has all the info we need - but it could still be clearer to read
# Make it so there is just one name per Region / Country
pd.options.display.max_rows = 300
table_df_final = table_df_ui_totals.set_index(['Region', 'Country', 'Admin level 1'])
table_df_final
Sequenced samples Analysis set samples
Region Country Admin level 1
South America (SA) Honduras Francisco Morazan 8 0
Peru Loreto 106 85
Colombia Cauca 146 123
Choco 3 3
Nariño 7 6
Norte de Santander 8 5
Valle del Cauca 3 3
Venezuela Bolivar 2 2
Africa - West (AF-W) Gambia Banjul 61 55
North Bank 252 183
Upper River 1121 646
Western 564 492
Senegal Dakar 93 91
Sedhiou 62 60
Guinea Faranah 60 37
Nzerekore 139 112
Mauritania Guidimaka 23 21
Hodh ech Chargui 40 32
Hodh el Gharbi 41 39
Côte d'Ivoire Abidjan 71 71
Mali Bamako 215 209
Kayes 379 250
Koulikoro 1399 658
Mopti 9 8
Segou 265 71
Sikasso 161 57
Burkina Faso Haut-Bassins 58 56
Ghana Ashanti 452 391
Brong Ahafo 69 50
Central 686 309
Eastern 110 74
Greater Accra 474 435
Upper East 4821 3605
Volta 41 41
Benin Atlantique 57 45
Littoral 277 100
Nigeria Adamawa 72 37
Cross River 73 16
Delta 31 13
Federal Capital Territory 5 0
Kano 37 22
Kebbi 71 25
Kwara 8 5
Lagos 188 129
Ogun 96 38
Ondo 56 29
Osun 55 15
Oyo 611 71
Gabon Wouleu-Ntem 59 55
Cameroon Sud-Ouest 294 261
Africa - Central (AF-C) Democratic Republic of the Congo Kinshasa 1549 1212
Africa - Northeast (AF-NE) Sudan Al Jazirah 17 1
Blue Nile 128 1
Kassala 41 18
Khartoum 154 85
North Kordofan 16 10
Uganda Apac 15 10
Ethiopia Amhara 15 10
Oromia 19 12
Southern Nations, Nationalities and Peoples 1 0
Kenya Kisumu 64 62
Africa - East (AF-E) Kenya Kilifi 2078 1872
Malawi Chikwawa 629 254
Zomba 52 34
Tanzania Kagera 61 52
Kigoma 199 143
Lindi 79 65
Morogoro 34 32
Pwani 447 436
Tanga 324 297
Mozambique Cabo Delgado 171 142
Gaza 171 95
Inhambane 117 99
Maputo 603 417
Sofala 65 23
Tete 128 103
Zambezia 93 85
Madagascar Fianarantsoa 1 1
Mahajanga 24 23
Asia - South - East (AS-S-E) India Odisha 124 93
West Bengal 122 95
Asia - South - Far East (AS-S-FE) India Tripura 72 62
Bangladesh Chittagong 1658 1315
Asia - Southeast - West (AS-SE-W) Myanmar Bago 124 89
Kachin 28 26
Kayin 768 645
Mandalay 120 114
Rakhine 19 9
Sagaing 93 38
Shan 65 30
Tanintharyi 51 50
Thailand Ranong 27 20
Tak 967 867
Asia - Southeast - East (AS-SE-E) Thailand Sisakhet 152 97
Ubon Ratchathani 11 10
Laos Attapeu 538 444
Champasak 282 244
Salavan 240 214
Savannakhet 891 783
Sekong 43 38
Cambodia Battambang 65 51
Koh Kong 5 5
Pailin 286 187
Preah Vihear 567 457
Pursat 671 459
Ratanakiri 420 357
Stueng Traeng 268 232
Vietnam Bac Lieu 4 1
Binh Phuoc 892 754
Binh Thuan 11 0
Dak Lak 311 300
Dak Nong 116 112
Gia Lai 815 756
Khanh Hoa 95 82
Ninh Thuan 215 85
Phu Yen 94 89
Quang Nam 95 76
Quang Tri 52 47
Oceania - New Guinea (OC-NG) Indonesia Papua 133 119
Papua New Guinea East Sepik 166 149
Madang 55 44
Milne Bay 30 29
Unverified identity 330 0
Total 33325 24409

Table Legend

Breakdown of Pf8 samples by geography. Sites are divided into ten populations. Note that a) samples from Kisumu in western Kenya have been assigned to the Africa - Northeast (AF-NE) population, whereas samples from Kilifi in southern Kenya have been assigned to the Africa - East (AF-E) population, b) samples from Odisha and West Bengal in India to the west of Bangladesh have been assigned to the Asia - South Asia - West (AS-S-E) population, whereas samples from Tripura in India to the east of Bangladesh have been assigned to the Asia - South Asia - East (AS-S-FE) population and c) samples from Ranong and Tak in western Thailand have been assigned to the Western SE Asia (AS-SE-W) region, whereas samples from Sisakhet in eastern Thailand have been assigned to the Eastern SE Asia (AS-SE-E) region. Unverified identity refers to samples which contain suspected lab contaminants, are pure or mixed with lab strains, or had an incorrectly labelled continent (such as samples from returning travellers).

Write the table to a file

We can output this to a location in Google Drive

First we need to connect Google Drive by running the following:

# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
table_df_final.to_excel('/content/drive/My Drive/summary_table_sample_summary.xlsx')