Summarise Samples by Administrative Division
Contents
Summarise Samples by Administrative Division¶
Introduction¶
This notebook recreates Supplementary Table 1 from the Pf7 paper - a summary of samples broken down by major sub-population, country, and first administrative division. Samples are sorted so that their corresponding countries appear in a West to East order in the table.
This notebook should take approximately one minute to run.
Setup¶
Install the malariagen Python package:
!pip install -q --no-warn-conflicts malariagen_data
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 kB 2.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 14.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 31.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 40.9 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 8.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 5.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 40.1 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 46.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 13.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.6/233.6 kB 10.2 MB/s eta 0:00:00
?25h Preparing metadata (setup.py) ... ?25l?25hdone
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 37.0 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 57.1 MB/s eta 0:00:00
?25h Building wheel for asciitree (setup.py) ... ?25l?25hdone
Import required libraries:
import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
import malariagen_data
import pandas as pd
import collections
from google.colab import drive
Access Pf7 Data and Metadata¶
Here we define the data and metadata as pandas dataframes
release_data = malariagen_data.Pf7()
sample_metadata = release_data.sample_metadata()
Produce the Table of Samples per Admin Division Level 1¶
####Define a dictionary for sub-population names
To make the eventual table output easier to interpret, we add a column ‘Region’ to the dataframe. The dictionary matches the codes for each major sub-population to their full names. E.g. ‘SA’ becomes ‘South America (SA)’ in the ‘Region’ column.
We initially give a numeric prefix to the sub-populations so that they will appear in our desired order (west to east) in the table. This will be removed later on.
# Define a dictionary to link population code with the full name of each region
population_legend = collections.OrderedDict()
population_legend['SA'] = "01. South America"
population_legend['AF-W'] = "02. Africa - West"
population_legend['AF-C'] = "03. Africa - Central"
population_legend['AF-NE'] = "04. Africa - Northeast"
population_legend['AF-E'] = "05. Africa - East"
population_legend['AS-S-E'] = "06. Asia - South - East"
population_legend['AS-S-FE'] = "07. Asia - South - Far East"
population_legend['AS-SE-W'] = "08. Asia - Southeast - West"
population_legend['AS-SE-E'] = "09. Asia - Southeast - East"
population_legend['OC-NG'] = "10. Oceania - New Guinea"
# Include full name of each population as 'Region' in the pf7_metadata dataframe
sample_metadata['Region'] = sample_metadata['Population'].apply(lambda x: f'{population_legend[x]} ({x})' if x in population_legend else '11. Unverified identity')
####Add a longitude column for sorting the table We already have longitude information for country and admin division which we could use to sort the table in our desired west to east order. However, there are some locations who’s subpopulation region and country longitude mean they don’t sort in a consistent west-to-east manner.
Here we add a new longitude column for sorting the table ‘country_long_table’ which keeps all longitude values the same, except for locations which need a fake longitude to enable them to sort properly with the rest of their major sub-population or country.
We make the new column so that the original longitude values are retained, in case they were ever needed in the future.
sample_metadata['country_long_table'] = sample_metadata['Country longitude'] # Add the new column for sorting the table
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kisumu', 'country_long_table'] = 40 # Want it to appear last in AF-NE
sample_metadata.loc[sample_metadata['Admin level 1'] == 'Kilifi', 'country_long_table'] = 34 # Want it to appear first in AF-E
Aggregate the data by our fields of interest¶
Here we group by Region, Country, Admin level 1 and count the number of sequenced samples, then output to a dataframe, then order the locations from West to East based on country.
# Aggregate data for all samples and store in a dataframe 'agg1'
agg1 = (
sample_metadata.groupby(
['Region','Country','Admin level 1','country_long_table'])
.size()
.to_frame('Sequenced samples')
.reset_index()
.sort_values(by = ['Region','country_long_table','Admin level 1'])
)
# Do this again, but just for samples passing QC rather than all samples
# We define new dataframes to keep each stage of data manipulation separate
agg2 = (
sample_metadata.groupby(
['Region','Country','Admin level 1','country_long_table','QC pass']
)
.size()
.to_frame('Analysis set samples')
.reset_index()
.sort_values(by = ['Region','country_long_table','Admin level 1'])
)
agg3 = agg2.loc[agg2['QC pass'] == True]
Merge these aggregated dataframes and drop duplicate columns¶
table_df = (
pd.merge(agg1, agg3, on = 'Admin level 1', how = 'left',suffixes=('', '_y'))
.sort_values(by = ['Region','country_long_table', 'Admin level 1'])
) # Merge
table_df.drop(table_df.filter(regex='_y$').columns, axis=1, inplace=True) # Drop duplicate columns
table_df_trimmed = table_df.drop(['country_long_table','QC pass'], axis = 1) # Drop further uneccesary columns and make a new df to avoid confusion
Trim off the number from Region column¶
# Remove the first 3 characters from each value in the column
table_df_trimmed['Region'] = table_df_trimmed['Region'].str[3:]
Add in samples with unverified identity¶
# Get our unverified location samples and count them
ui_samps = sample_metadata[sample_metadata['Region'] == '11. Unverified identity']
no_ui = len(ui_samps)
no_ui_qc = len(ui_samps.loc[ui_samps['QC pass'] == True])
# Create a mini dataframe to append to our main table
ui_sum = {
'Region': ['Unverified identity'],
'Country': [''],
'Admin level 1': [''],
'Sequenced samples': [no_ui],
'Analysis set samples': [no_ui_qc]
}
ui_sum_df = pd.DataFrame(ui_sum)
# Add to the main table
table_df_ui = pd.concat([table_df_trimmed, ui_sum_df], ignore_index = True)
table_df_ui['Analysis set samples'] = table_df_ui['Analysis set samples'].fillna(0).astype('int') # Convert from float to integer
Add a totals row at the bottom¶
total_samps = table_df_ui['Sequenced samples'].agg('sum')
total_analysis = table_df_ui['Analysis set samples'].agg('sum')
totals = {
'Region': ['Total'],
'Country': [''],
'Admin level 1': [''],
'Sequenced samples': [total_samps],
'Analysis set samples': [total_analysis]
}
totals_df = pd.DataFrame(totals)
table_df_ui_totals = pd.concat([table_df_ui, totals_df], ignore_index = True)
# This has all the info we need - but it could still be clearer to read
# Make it so there is just one name per Region / Country
pd.options.display.max_rows = 300
table_df_final = table_df_ui_totals.set_index(['Region', 'Country', 'Admin level 1'])
table_df_final
Sequenced samples | Analysis set samples | |||
---|---|---|---|---|
Region | Country | Admin level 1 | ||
South America (SA) | Peru | Loreto | 21 | 21 |
Colombia | Cauca | 146 | 123 | |
Choco | 3 | 3 | ||
Nariño | 7 | 6 | ||
Valle del Cauca | 3 | 3 | ||
Venezuela | Bolivar | 2 | 2 | |
Africa - West (AF-W) | Gambia | North Bank | 252 | 186 |
Upper River | 760 | 452 | ||
Western | 235 | 225 | ||
Senegal | Dakar | 93 | 91 | |
Sedhiou | 62 | 59 | ||
Guinea | Faranah | 60 | 37 | |
Nzerekore | 139 | 114 | ||
Mauritania | Guidimaka | 23 | 21 | |
Hodh ech Chargui | 40 | 32 | ||
Hodh el Gharbi | 41 | 39 | ||
Côte d'Ivoire | Abidjan | 71 | 71 | |
Mali | Bamako | 215 | 209 | |
Kayes | 379 | 250 | ||
Koulikoro | 991 | 614 | ||
Mopti | 9 | 8 | ||
Segou | 49 | 29 | ||
Sikasso | 161 | 57 | ||
Burkina Faso | Haut-Bassins | 58 | 57 | |
Ghana | Ashanti | 286 | 278 | |
Brong Ahafo | 69 | 50 | ||
Central | 175 | 104 | ||
Eastern | 21 | 20 | ||
Greater Accra | 198 | 184 | ||
Upper East | 3300 | 2454 | ||
Volta | 41 | 41 | ||
Benin | Atlantique | 57 | 45 | |
Littoral | 277 | 105 | ||
Nigeria | Kwara | 8 | 5 | |
Lagos | 132 | 105 | ||
Gabon | Wouleu-Ntem | 59 | 55 | |
Cameroon | Sud-Ouest | 294 | 264 | |
Africa - Central (AF-C) | Democratic Republic of the Congo | Kinshasa | 573 | 520 |
Africa - Northeast (AF-NE) | Sudan | Blue Nile | 66 | 0 |
Kassala | 13 | 9 | ||
Khartoum | 124 | 67 | ||
Uganda | Apac | 15 | 12 | |
Ethiopia | Amhara | 15 | 10 | |
Oromia | 19 | 11 | ||
Kenya | Kisumu | 64 | 63 | |
Africa - East (AF-E) | Kenya | Kilifi | 662 | 627 |
Malawi | Chikwawa | 319 | 231 | |
Zomba | 52 | 34 | ||
Tanzania | Kagera | 61 | 52 | |
Kigoma | 199 | 143 | ||
Lindi | 79 | 65 | ||
Morogoro | 34 | 32 | ||
Tanga | 324 | 297 | ||
Mozambique | Gaza | 91 | 34 | |
Madagascar | Fianarantsoa | 1 | 1 | |
Mahajanga | 24 | 23 | ||
Asia - South - East (AS-S-E) | India | Odisha | 122 | 114 |
West Bengal | 122 | 119 | ||
Asia - South - Far East (AS-S-FE) | India | Tripura | 72 | 67 |
Bangladesh | Chittagong | 1658 | 1310 | |
Asia - Southeast - West (AS-SE-W) | Myanmar | Bago | 124 | 89 |
Kachin | 28 | 26 | ||
Kayin | 760 | 631 | ||
Mandalay | 120 | 114 | ||
Rakhine | 19 | 7 | ||
Sagaing | 93 | 38 | ||
Shan | 65 | 30 | ||
Tanintharyi | 51 | 50 | ||
Thailand | Ranong | 27 | 20 | |
Tak | 967 | 875 | ||
Asia - Southeast - East (AS-SE-E) | Thailand | Sisakhet | 112 | 59 |
Laos | Attapeu | 210 | 204 | |
Champasak | 218 | 208 | ||
Salavan | 147 | 144 | ||
Savannakhet | 452 | 411 | ||
Sekong | 25 | 24 | ||
Cambodia | Battambang | 65 | 51 | |
Koh Kong | 5 | 5 | ||
Pailin | 286 | 191 | ||
Preah Vihear | 216 | 150 | ||
Pursat | 671 | 460 | ||
Ratanakiri | 420 | 358 | ||
Stueng Traeng | 60 | 52 | ||
Vietnam | Bac Lieu | 4 | 1 | |
Binh Phuoc | 751 | 657 | ||
Binh Thuan | 11 | 0 | ||
Dak Lak | 112 | 106 | ||
Dak Nong | 73 | 70 | ||
Gia Lai | 376 | 337 | ||
Khanh Hoa | 66 | 50 | ||
Ninh Thuan | 205 | 73 | ||
Quang Nam | 95 | 75 | ||
Quang Tri | 40 | 35 | ||
Oceania - New Guinea (OC-NG) | Indonesia | Papua | 133 | 121 |
Papua New Guinea | East Sepik | 166 | 149 | |
Madang | 55 | 43 | ||
Milne Bay | 30 | 29 | ||
Unverified identity | 160 | 0 | ||
Total | 20864 | 16203 |
Table Legend¶
Supplementary Table 1. Breakdown of analysis set samples by geography. Sites are divided into ten populations as described in the main text. Note that a) samples from Kisumu in western Kenya have been assigned to the Africa - Northeast (AF-NE) population, whereas samples from Kilifi in southern Kenya have been assigned to the Africa - East (AF-E) population, b) samples from Odisha and West Bengal in India to the west of Bangladesh have been assigned to the Asia - South Asia - West (AS-S-E) population, whereas samples from Tripura in India to the east of Bangladesh have been assigned to the Asia - South Asia - East (AS-S-FE) population and c) samples from Ranong and Tak in western Thailand have been assigned to the Western SE Asia (AS-SE-W) region, whereas samples from Sisakhet in eastern Thailand have been assigned to the Eastern SE Asia (AS-SE-E) region. 10 returning travellers were reported as returning from Ghana (3), Kenya (2), Uganda (2), Mozambique (1) and Papua, Indonesia (2) and were excluded from analysis. 65 samples which were identified as lab strains were also excluded from analysis.
Write the table to a file¶
We can output this to a location in Google Drive
First we need to connect Google Drive by running the following:
# You will need to authorise Google Colab access to Google Drive
drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else
table_df_final.to_excel('/content/drive/My Drive/Pf7_supp_table1.xlsx')