Create sample location summary table
Contents
Create sample location summary table¶
Introduction¶
This notebook creates a table - a summary of samples broken down by major sub-population, country, and first administrative division - using the publicly available Pf8 dataset for Plasmodium falciparum.
This notebook should take approximately one minute to run.
Setup¶
Install the malariagen Python package:
!pip install malariagen_data -q --no-warn-conflicts
Installing build dependencies ... ?25l?25hdone
Getting requirements to build wheel ... ?25l?25hdone
Preparing metadata (pyproject.toml) ... ?25l?25hdone
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 51.6 MB/s eta 0:00:00
?25h Preparing metadata (setup.py) ... ?25l?25hdone
Preparing metadata (setup.py) ... ?25l?25hdone
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 kB 5.5 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 765.4/765.4 kB 40.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.4/24.4 MB 65.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 91.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 210.7/210.7 kB 14.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 77.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 81.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 83.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 47.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.4/77.4 kB 5.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 85.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 16.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.3/43.3 kB 2.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.6/12.6 MB 26.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 43.2 MB/s eta 0:00:00
?25h Building wheel for malariagen_data (pyproject.toml) ... ?25l?25hdone
Building wheel for dash-cytoscape (setup.py) ... ?25l?25hdone
Building wheel for asciitree (setup.py) ... ?25l?25hdone
Import required libraries:
import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import malariagen_data
import pandas as pd
import collections
from google.colab import drive
Access Pf8 Data and Metadata¶
Here we define the data and metadata as pandas dataframes
release_data = malariagen_data.Pf8()
sample_metadata = release_data.sample_metadata()
Produce the Table of Samples per Admin Division Level 1¶
Define a dictionary for sub-population names¶
To make the eventual table output easier to interpret, we add a column ‘Region’ to the dataframe. The dictionary matches the codes for each major sub-population to their full names. E.g. ‘SA’ becomes ‘South America (SA)’ in the ‘Region’ column.
We initially give a numeric prefix to the sub-populations so that they will appear in our desired order (west to east) in the table. This will be removed later on.
# Define a dictionary to link population code with the full name of each region
population_legend = collections.OrderedDict()
population_legend['SA'] = "01. South America"
population_legend['AF-W'] = "02. Africa - West"
population_legend['AF-C'] = "03. Africa - Central"
population_legend['AF-NE'] = "04. Africa - Northeast"
population_legend['AF-E'] = "05. Africa - East"
population_legend['AS-S-E'] = "06. Asia - South - East"
population_legend['AS-S-FE'] = "07. Asia - South - Far East"
population_legend['AS-SE-W'] = "08. Asia - Southeast - West"
population_legend['AS-SE-E'] = "09. Asia - Southeast - East"
population_legend['OC-NG'] = "10. Oceania - New Guinea"
# Include full name of each population as 'Region' in the sample_metadata dataframe
sample_metadata['Region'] = sample_metadata['Population'].apply(lambda x: f'{population_legend[x]} ({x})' if x in population_legend else '11. Unverified identity')
Add a longitude column for sorting the table¶
We already have longitude information for country and admin division which we could use to sort the table in our desired west to east order. However, there are some locations who’s subpopulation region and country longitude mean they don’t sort in a consistent west-to-east manner.
Here we add a new longitude column for sorting the table ‘country_long_table’ which keeps all longitude values the same, except for locations which need a fake longitude to enable them to sort properly with the rest of their major sub-population or country.
We make the new column so that the original longitude values are retained, in case they were ever needed in the future.
sample_metadata['country_long_table'] = sample_metadata['Country longitude'] # Add the new column for sorting the table
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kisumu', 'country_long_table'] = 40 # Want it to appear last in AF-NE
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kilifi', 'country_long_table'] = 34 # Want it to appear first in AF-E
Aggredate the data by our fields of interest¶
Here we group by Region, Country, Admin level 1 and count the number of sequenced samples, then output to a dataframe, then order the locations from West to East based on country.
# Aggregate data for all samples and store in a dataframe 'agg1'
agg1 = (
sample_metadata.groupby(
['Region','Country','Admin level 1','country_long_table'])
.size()
.to_frame('Sequenced samples')
.reset_index()
.sort_values(by = ['Region','country_long_table','Admin level 1'])
)
# Do this again, but just for samples passing QC rather than all samples
# We define new dataframes to keep each stage of data manipulation separate
agg2 = (
sample_metadata.groupby(
['Region','Country','Admin level 1','country_long_table','QC pass']
)
.size()
.to_frame('Analysis set samples')
.reset_index()
.sort_values(by = ['Region','country_long_table','Admin level 1'])
)
agg3 = agg2.loc[agg2['QC pass'] == True]
Merge these aggregated dataframes and drop duplicate columns¶
table_df = (
pd.merge(agg1, agg3, on = 'Admin level 1', how = 'left',suffixes=('', '_y'))
.sort_values(by = ['Region','country_long_table', 'Admin level 1'])
) # Merge
table_df.drop(table_df.filter(regex='_y$').columns, axis=1, inplace=True) # Drop duplicate columns
table_df_trimmed = table_df.drop(['country_long_table','QC pass'], axis = 1) # Drop further uneccesary columns and make a new df to avoid confusion
Trim off the number from Region column¶
# Remove the first 3 characters from each value in the column
table_df_trimmed['Region'] = table_df_trimmed['Region'].str[3:]
Add in samples with unverified identity¶
# Get our unverified location samples and count them
ui_samps = sample_metadata[sample_metadata['Region'] == '11. Unverified identity']
no_ui = len(ui_samps)
no_ui_qc = len(ui_samps.loc[ui_samps['QC pass'] == True])
# Create a mini dataframe to append to our main table
ui_sum = {
'Region': ['Unverified identity'],
'Country': [''],
'Admin level 1': [''],
'Sequenced samples': [no_ui],
'Analysis set samples': [no_ui_qc]
}
ui_sum_df = pd.DataFrame(ui_sum)
# Add to the main table
table_df_ui = pd.concat([table_df_trimmed, ui_sum_df], ignore_index = True)
table_df_ui['Analysis set samples'] = table_df_ui['Analysis set samples'].fillna(0).astype('int') # Convert from float to integer
Add a totals row at the bottom¶
total_samps = table_df_ui['Sequenced samples'].agg('sum')
total_analysis = table_df_ui['Analysis set samples'].agg('sum')
totals = {
'Region': ['Total'],
'Country': [''],
'Admin level 1': [''],
'Sequenced samples': [total_samps],
'Analysis set samples': [total_analysis]
}
totals_df = pd.DataFrame(totals)
table_df_ui_totals = pd.concat([table_df_ui, totals_df], ignore_index = True)
# This has all the info we need - but it could still be clearer to read
# Make it so there is just one name per Region / Country
pd.options.display.max_rows = 300
table_df_final = table_df_ui_totals.set_index(['Region', 'Country', 'Admin level 1'])
table_df_final
Sequenced samples | Analysis set samples | |||
---|---|---|---|---|
Region | Country | Admin level 1 | ||
South America (SA) | Honduras | Francisco Morazan | 8 | 0 |
Peru | Loreto | 106 | 85 | |
Colombia | Cauca | 146 | 123 | |
Choco | 3 | 3 | ||
Nariño | 7 | 6 | ||
Norte de Santander | 8 | 5 | ||
Valle del Cauca | 3 | 3 | ||
Venezuela | Bolivar | 2 | 2 | |
Africa - West (AF-W) | Gambia | Banjul | 61 | 55 |
North Bank | 252 | 183 | ||
Upper River | 1121 | 646 | ||
Western | 564 | 492 | ||
Senegal | Dakar | 93 | 91 | |
Sedhiou | 62 | 60 | ||
Guinea | Faranah | 60 | 37 | |
Nzerekore | 139 | 112 | ||
Mauritania | Guidimaka | 23 | 21 | |
Hodh ech Chargui | 40 | 32 | ||
Hodh el Gharbi | 41 | 39 | ||
Côte d'Ivoire | Abidjan | 71 | 71 | |
Mali | Bamako | 215 | 209 | |
Kayes | 379 | 250 | ||
Koulikoro | 1399 | 658 | ||
Mopti | 9 | 8 | ||
Segou | 265 | 71 | ||
Sikasso | 161 | 57 | ||
Burkina Faso | Haut-Bassins | 58 | 56 | |
Ghana | Ashanti | 452 | 391 | |
Brong Ahafo | 69 | 50 | ||
Central | 686 | 309 | ||
Eastern | 110 | 74 | ||
Greater Accra | 474 | 435 | ||
Upper East | 4821 | 3605 | ||
Volta | 41 | 41 | ||
Benin | Atlantique | 57 | 45 | |
Littoral | 277 | 100 | ||
Nigeria | Adamawa | 72 | 37 | |
Cross River | 73 | 16 | ||
Delta | 31 | 13 | ||
Federal Capital Territory | 5 | 0 | ||
Kano | 37 | 22 | ||
Kebbi | 71 | 25 | ||
Kwara | 8 | 5 | ||
Lagos | 188 | 129 | ||
Ogun | 96 | 38 | ||
Ondo | 56 | 29 | ||
Osun | 55 | 15 | ||
Oyo | 611 | 71 | ||
Gabon | Wouleu-Ntem | 59 | 55 | |
Cameroon | Sud-Ouest | 294 | 261 | |
Africa - Central (AF-C) | Democratic Republic of the Congo | Kinshasa | 1549 | 1212 |
Africa - Northeast (AF-NE) | Sudan | Al Jazirah | 17 | 1 |
Blue Nile | 128 | 1 | ||
Kassala | 41 | 18 | ||
Khartoum | 154 | 85 | ||
North Kordofan | 16 | 10 | ||
Uganda | Apac | 15 | 10 | |
Ethiopia | Amhara | 15 | 10 | |
Oromia | 19 | 12 | ||
Southern Nations, Nationalities and Peoples | 1 | 0 | ||
Kenya | Kisumu | 64 | 62 | |
Africa - East (AF-E) | Kenya | Kilifi | 2078 | 1872 |
Malawi | Chikwawa | 629 | 254 | |
Zomba | 52 | 34 | ||
Tanzania | Kagera | 61 | 52 | |
Kigoma | 199 | 143 | ||
Lindi | 79 | 65 | ||
Morogoro | 34 | 32 | ||
Pwani | 447 | 436 | ||
Tanga | 324 | 297 | ||
Mozambique | Cabo Delgado | 171 | 142 | |
Gaza | 171 | 95 | ||
Inhambane | 117 | 99 | ||
Maputo | 603 | 417 | ||
Sofala | 65 | 23 | ||
Tete | 128 | 103 | ||
Zambezia | 93 | 85 | ||
Madagascar | Fianarantsoa | 1 | 1 | |
Mahajanga | 24 | 23 | ||
Asia - South - East (AS-S-E) | India | Odisha | 124 | 93 |
West Bengal | 122 | 95 | ||
Asia - South - Far East (AS-S-FE) | India | Tripura | 72 | 62 |
Bangladesh | Chittagong | 1658 | 1315 | |
Asia - Southeast - West (AS-SE-W) | Myanmar | Bago | 124 | 89 |
Kachin | 28 | 26 | ||
Kayin | 768 | 645 | ||
Mandalay | 120 | 114 | ||
Rakhine | 19 | 9 | ||
Sagaing | 93 | 38 | ||
Shan | 65 | 30 | ||
Tanintharyi | 51 | 50 | ||
Thailand | Ranong | 27 | 20 | |
Tak | 967 | 867 | ||
Asia - Southeast - East (AS-SE-E) | Thailand | Sisakhet | 152 | 97 |
Ubon Ratchathani | 11 | 10 | ||
Laos | Attapeu | 538 | 444 | |
Champasak | 282 | 244 | ||
Salavan | 240 | 214 | ||
Savannakhet | 891 | 783 | ||
Sekong | 43 | 38 | ||
Cambodia | Battambang | 65 | 51 | |
Koh Kong | 5 | 5 | ||
Pailin | 286 | 187 | ||
Preah Vihear | 567 | 457 | ||
Pursat | 671 | 459 | ||
Ratanakiri | 420 | 357 | ||
Stueng Traeng | 268 | 232 | ||
Vietnam | Bac Lieu | 4 | 1 | |
Binh Phuoc | 892 | 754 | ||
Binh Thuan | 11 | 0 | ||
Dak Lak | 311 | 300 | ||
Dak Nong | 116 | 112 | ||
Gia Lai | 815 | 756 | ||
Khanh Hoa | 95 | 82 | ||
Ninh Thuan | 215 | 85 | ||
Phu Yen | 94 | 89 | ||
Quang Nam | 95 | 76 | ||
Quang Tri | 52 | 47 | ||
Oceania - New Guinea (OC-NG) | Indonesia | Papua | 133 | 119 |
Papua New Guinea | East Sepik | 166 | 149 | |
Madang | 55 | 44 | ||
Milne Bay | 30 | 29 | ||
Unverified identity | 330 | 0 | ||
Total | 33325 | 24409 |
Table Legend¶
Breakdown of Pf8 samples by geography. Sites are divided into ten populations. Note that a) samples from Kisumu in western Kenya have been assigned to the Africa - Northeast (AF-NE) population, whereas samples from Kilifi in southern Kenya have been assigned to the Africa - East (AF-E) population, b) samples from Odisha and West Bengal in India to the west of Bangladesh have been assigned to the Asia - South Asia - West (AS-S-E) population, whereas samples from Tripura in India to the east of Bangladesh have been assigned to the Asia - South Asia - East (AS-S-FE) population and c) samples from Ranong and Tak in western Thailand have been assigned to the Western SE Asia (AS-SE-W) region, whereas samples from Sisakhet in eastern Thailand have been assigned to the Eastern SE Asia (AS-SE-E) region. Unverified identity refers to samples which contain suspected lab contaminants, are pure or mixed with lab strains, or had an incorrectly labelled continent (such as samples from returning travellers).
Write the table to a file¶
We can output this to a location in Google Drive
First we need to connect Google Drive by running the following:
# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
table_df_final.to_excel('/content/drive/My Drive/summary_table_sample_summary.xlsx')