Explore the Pf6+ dataset

In this notebook we are going to:

  1. Describe how to setup your environment to run these notebooks for analysis and exploration

  2. Show how to directly access the data without the need to download them

  3. Explore some of the data and metadata available

  4. Showcase the richness of this dataset

Before we dig into the data, we want to recognise that this resource was possible because of enormous team efforts around the world over the last 18 years, encompassing 61 studies in 30 different countries. With over 16,500 samples available, we hope to continue disseminating these unique resources and increase their accessibility, so that they ultimately translate into improvements for public health.

If you use this resource please remember to also cite the following papers:

A massive thank you to all the partners and contributors involved:

Pf6 GenRe-Mekong
Ambroise Ahouidi, Mozam Ali, Jacob Almagro-Garcia,
Alfred Amambua-Ngwa, Chanaki Amaratunga, 
Roberto Amato, Lucas Amenga-Etego, 
Ben Andagalu, Tim J. C. Anderson,
Voahangy Andrianaranjaka, Tobias Apinjoh, 
Cristina Ariani, Elizabeth A Ashley, Sarah Auburn,
Gordon A. Awandare, Hampate Ba, Vito Baraka, 
Alyssa E. Barry, Philip Bejon, Gwladys I. Bertin, 
Maciej F. Boni, Steffen Borrmann, Teun Bousema, 
Oralee Branch, Peter C. Bull, George B. J. Busby, 
Thanat Chookajorn, Kesinee Chotivanich, 
Antoine Claessens, David Conway, Alister Craig, 
Umberto D Alessandro, Souleymane Dama, 
Nicholas PJ Day, Brigitte Denis, Mahamadou Diakite, 
Abdoulaye Djimdé, Christiane Dolecek, Arjen M Dondorp,
Chris Drakeley, Eleanor Drury, Patrick Duffy, 
Diego F. Echeverry, Thomas G. Egwang, Berhanu Erko,
Rick M. Fairhurst, Abdul Faiz, Caterina A. Fanello, 
Mark M. Fukuda, Dionicia Gamboa, Anita Ghansah, 
Lemu Golassa, Sonia Goncalves,  William L. Hamilton, 
G. L. Abby Harrison, Lee Hart, Christa Henrichs, 
Tran Tinh Hien, Catherine A. Hill, Abraham Hodgson, 
Christina Hubbart, Mallika Imwong, Deus S. Ishengoma,
Scott A. Jackson, Chris G. Jacob, Ben Jeffery, 
Anna E. Jeffreys, Kimberly J. Johnson, 
Dushyanth Jyothi, Claire Kamaliddin, Edwin Kamau,
Mihir Kekre, Krzysztof Kluczynski, Theerarat Kochakarn, 
Abibatou Konaté, Dominic P. Kwiatkowski, 
Myat Phone Kyaw, Pharath Lim, Chanthap Lon,
Kovana M. Loua, Oumou Maïga-Ascofaré, Cinzia Malangone, 
Magnus Manske, Jutta Marfurt, Kevin Marsh, 
Mayfong Mayxay, Alistair Miles, Olivo Miotto, 
Victor Mobegi, Olugbenga A. Mokuolu, Jacqui Montgomery, 
Ivo Mueller, Paul N. Newton, Thuy Nguyen,
Thuy-Nhien Nguyen, Harald Noed, François Nosten,
Rintis Noviyanti, Alexis Nzila, 
Lynette I. Ochola-Oyier, Harold Ocholla, 
Abraham Oduro, Irene Omedo, Marie A. Onyamboko, 
Jean-Bosco Ouedraogo, Kolapo Oyebola, 
Richard D. Pearson, Norbert Peshu, 
Aung Pyae Phyo, Chris V. Plowe, Ric N. Price, 
Sasithon Pukrittayakamee,
Milijaona Randrianarivelojosia, 
Julian C. Rayner, Pascal Ringwald, Kirk A. Rockett, 
Katherine Rowlands, Lastenia Ruiz, David Saunders, 
Alex Shayo, Peter Siba, Victoria J. Simpson, 
Jim Stalker, Xin-zhuan Su, Colin Sutherland, 
Shannon Takala-Harrison, Livingstone Tavu, 
Vandana Thathy, Antoinette Tshefu, Federica Verra, 
Joseph Vinetz, Thomas E. Wellems, Jason Wendler, 
Nicholas J. White, Ian Wright, William Yavo, Htut Ye 
    
Christopher G Jacob, Nguyen Thuy-Nhien, 
Mayfong Mayxay, Richard J Maude, 
Huynh Hong Quang, Bouasy Hongvanthong, 
Viengxay Vanisaveth, Thang Ngo Duc,
Huy Rekol, Rob van der Pluijm, 
Lorenz von Seidlein,Rick Fairhurst,
François Nosten, Md Amir Hossain, 
Naomi Park, Scott Goodwin, 
Pascal Ringwald, 
Keobouphaphone Chindavongsa,
Paul Newton, Elizabeth Ashley, 
Sonexay Phalivong, Rapeephan Maude, 
Rithea Leang, Cheah Huch, 
Le Thanh Dong, Kim-Tuyen Nguyen,
Tran Minh Nhat, Tran Tinh Hien, 
Hoa Nguyen, Nicole Zdrojewski, 
Sara Canavati, Abdullah Abu Sayeed, 
Didar Uddin, Caroline Buckee, 
Caterina I Fanello, Marie Onyamboko, 
Thomas Peto, Rupam Tripura, 
Chanaki Amaratunga, Aung Myint Thu, 
Gilles Delmas, Jordi Landier, 
Daniel M Parker, Nguyen Hoang Chau, 
Dysoley Lek, Seila Suon, 
James Callery, Podjanee Jittamala, 
Borimas Hanboonkunupakarn,
Sasithon Pukrittayakamee, 
Aung Pyae Phyo, Frank Smithuis, 
Khin Lin, Myo Thant, 
Tin Maung Hlaing, Parthasarathi Satpathi, 
Sanghamitra Satpathi, Prativa K Behera, 
Amar Tripura, Subrata Baidya, 
Neena Valecha, Anupkumar R Anvikar, 
Akhter Ul Islam, Abul Faiz, 
Chanon Kunasol, Eleanor Drury,
Mihir Kekre, Mozam Ali,
Katie Love, Shavanthi Rajatileka,
Anna E Jeffreys, Kate Rowlands, 
Christina S Hubbart, Mehul Dhorda,
Ranitha Vongpromek, Namfon Kotanan,
Phrutsamon Wongnak, Jacob Almagro Garcia,
Richard D Pearson, Cristina V Ariani,
Thanat Chookajorn, Cinzia Malangone,
T Nguyen, Jim Stalker,
Ben Jeffery, Jonathan Keatley, 
Kimberly J Johnson, Dawn Muddyman,
Xin Hui S Chan, John Sillitoe,
Roberto Amato, Victoria Simpson,
Sonia Gonçalves, Kirk Rockett,
Nicholas P Day, Arjen M Dondorp,
Dominic P Kwiatkowski, Olivo Miotto
    

Pf6+ Resource Exploration

Setting up your environment and notebooks

There are two ways to access the Pf6+ data resource and analysis notebooks, either:

  • Run notebooks with a Google Colaboratory environment Google colab. All you need is a Google account.

  • Run on a local jupyter instance on your computer.

If you are running this on Google Colab

If you are running the notebooks on Colab, then run the cell below to clone the Pf6+ repo:

!git clone https://github.com/malariagen/Pf6plus.git 
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .

If you are running this locally

If you prefer to run the notebooks locally, there are additional steps required. If you haven’t already, please follow these instructions from the Pf6+ repo in Github for more information on local setup.

Get started

In order to generate the plots as seen in these notebooks, first import the functions from the data_analysis directory, which contain all of the code you will need:

from data_analysis.map_samples import *
from data_analysis.plot_sample_info import *

# run to ensure plots show in static page
import bokeh.io 
bokeh.io.output_notebook()

Load the dataset

The Pf6+ data resource builds on the efforts of the latest Pf6 and the GenRe-Mekong data resources, integrating 7,000 whole-genome sequenced samples together with 10,000 genotyped samples.

Pf6+ is a highly heterogeneous data resource that contains samples from multiple independent studies and technologies (whole-genome sequencing, amplicon sequencing, and Agena genotyping): care needs to be taken when interpreting the aggregated analysis contained in these notebooks and any further work you may do.

Import the Pf6+ data

The command below reads the data directly from the online repository, without needing to download it first. However, you can still decide to download the dataset locally from https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv and modify the line below to point to the correct location.

# URL of the dataset; change accordingly if you prefer to work from a local, offline copy
pf6plus_fn = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'

# Read the dataset
pf6plus_all = pd.read_csv(pf6plus_fn, sep='\t', index_col=0, low_memory=False)

Which information is available to explore?

print(pf6plus_all.shape)
pf6plus_all.head()
(16720, 166)
Study Year Country AdmDiv1 Population Process IncludeInAnalysis Latitude_country Longitude_country Latitude_adm1 Longitude_adm1 PfCRT Kelch PfDHFR PfEXO PGB Plasmepsin2/3 PfDHPS PfMDR1 Species Pf3D7_02_v3:376222 Pf3D7_02_v3:470013 Pf3D7_03_v3:656861 Pf3D7_04_v3:110442 Pf3D7_04_v3:881571 Pf3D7_05_v3:350933 Pf3D7_05_v3:369740 Pf3D7_06_v3:900278 Pf3D7_07_v3:1044052 Pf3D7_08_v3:1314831 Pf3D7_08_v3:413067 Pf3D7_09_v3:900277 Pf3D7_11_v3:1018899 Pf3D7_11_v3:1815412 Pf3D7_13_v3:1056452 Pf3D7_13_v3:1466422 Pf3D7_14_v3:137622 Pf3D7_14_v3:2164225 Pf3D7_01_v3:145515 Pf3D7_03_v3:548178 ... Pyrimethamine Sulfadoxine S-P S-P-IPTp PfCRT:72 PfCRT:74 PfCRT:75 PfCRT:76 PfCRT:93 PfCRT:97 PfCRT:218 PfCRT:220 PfCRT:271 PfCRT:333 PfCRT:353 PfCRT:371 PfDHFR:16 PfDHFR:51 PfDHFR:59 PfDHFR:108 PfDHFR:164 PfDHFR:306 PfDHPS:436 PfDHPS:437 PfDHPS:540 PfDHPS:581 PfDHPS:613 PfEXO:415 PfMDR1:86 PfMDR1:184 PfMDR1:1034 PfMDR1:1042 PfMDR1:1226 PfMDR1:1246 PfARPS10:127 PfARPS10:128 PfFD:193 PfCRT:326 PfCRT:356 PfMDR2:484
SampleId
FP0008-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF WGS True 20.265149 -10.337093 16.565426 -9.832345 ----- WT ANC[S/N]IS E VDN[T/I]DT WT -[A/S][A/G]KAA N[F/Y]D Pf G G T T G G T G N N N A T G C G T A N C ... Undetermined Undetermined Sensitive Sensitive - - - [T/K] T H I [S/A] E T G [I/R] A N C [S/N] I S [A/S] [A/G] K A A E N [F/Y] S N F D V D D N [T/I] T
FP0009-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF WGS True 20.265149 -10.337093 16.565426 -9.832345 ----- WT AIRNIS E VDNTDT WT -AAKAA YFY Pf G A T C A G T G T G A G C C T G T G T C ... Resistant Sensitive Resistant Sensitive - - - T T H I S E T G I A I R N I S A A K A A E Y F S N F Y V D D N T T
FP0015-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF WGS True 20.265149 -10.337093 16.565426 -9.832345 ----- WT AIRNIS E VDNIDT WT -[A/S][A/G]KAA N[F/Y]D Pf G A T C G G T G C N G A N G T G T A T C ... Resistant Undetermined Resistant Sensitive - - - T T H I S E T G I A I R N I S [A/S] [A/G] K A A E N [F/Y] S N F D V D D N I T
FP0016-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF WGS True 20.265149 -10.337093 16.565426 -9.832345 ----- WT ANCSIS E VDNIDT WT -SGKAA NFD Pf A G T C G G T G C G N A T G T G C A N C ... Sensitive Resistant Sensitive Sensitive - - - [T/K] T H I S E T G [I/R] A N C S I S S G K A A E N F S N F D V D D N I T
FP0017-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF WGS True 20.265149 -10.337093 16.565426 -9.832345 CVMNK WT ANCSIS E VDNIDT WT -[A/F]AKA[A/S] NYD Pf G G T T A A T G C A G A N N T G T G T C ... Sensitive Sensitive Sensitive Sensitive C M N K T H I A Q T G R A N C S I S [A/F] A K A [A/S] E N Y S N F D V D D N I T

5 rows × 166 columns

The dataset has 16,720 samples (rows) and 166 columns:

print(list(pf6plus_all))
['Study', 'Year', 'Country', 'AdmDiv1', 'Population', 'Process', 'IncludeInAnalysis', 'Latitude_country', 'Longitude_country', 'Latitude_adm1', 'Longitude_adm1', 'PfCRT', 'Kelch', 'PfDHFR', 'PfEXO', 'PGB', 'Plasmepsin2/3', 'PfDHPS', 'PfMDR1', 'Species', 'Pf3D7_02_v3:376222', 'Pf3D7_02_v3:470013', 'Pf3D7_03_v3:656861', 'Pf3D7_04_v3:110442', 'Pf3D7_04_v3:881571', 'Pf3D7_05_v3:350933', 'Pf3D7_05_v3:369740', 'Pf3D7_06_v3:900278', 'Pf3D7_07_v3:1044052', 'Pf3D7_08_v3:1314831', 'Pf3D7_08_v3:413067', 'Pf3D7_09_v3:900277', 'Pf3D7_11_v3:1018899', 'Pf3D7_11_v3:1815412', 'Pf3D7_13_v3:1056452', 'Pf3D7_13_v3:1466422', 'Pf3D7_14_v3:137622', 'Pf3D7_14_v3:2164225', 'Pf3D7_01_v3:145515', 'Pf3D7_03_v3:548178', 'Pf3D7_04_v3:1102392', 'Pf3D7_04_v3:139051', 'Pf3D7_04_v3:286542', 'Pf3D7_04_v3:529500', 'Pf3D7_05_v3:796714', 'Pf3D7_07_v3:1256331', 'Pf3D7_07_v3:461139', 'Pf3D7_07_v3:619957', 'Pf3D7_08_v3:417335', 'Pf3D7_09_v3:163977', 'Pf3D7_10_v3:317581', 'Pf3D7_10_v3:336274', 'Pf3D7_11_v3:1020397', 'Pf3D7_11_v3:1294107', 'Pf3D7_11_v3:1935227', 'Pf3D7_11_v3:477922', 'Pf3D7_12_v3:1663492', 'Pf3D7_12_v3:2171901', 'Pf3D7_13_v3:1233218', 'Pf3D7_13_v3:1867630', 'Pf3D7_13_v3:2377887', 'Pf3D7_14_v3:2355751', 'Pf3D7_14_v3:3046108', 'Pf3D7_02_v3:529709', 'Pf3D7_02_v3:714480', 'Pf3D7_03_v3:155697', 'Pf3D7_04_v3:1037656', 'Pf3D7_04_v3:648101', 'Pf3D7_05_v3:1204155', 'Pf3D7_06_v3:1282691', 'Pf3D7_06_v3:1289212', 'Pf3D7_07_v3:1066698', 'Pf3D7_07_v3:1213486', 'Pf3D7_07_v3:704373', 'Pf3D7_08_v3:1313202', 'Pf3D7_08_v3:339406', 'Pf3D7_08_v3:701557', 'Pf3D7_09_v3:452690', 'Pf3D7_09_v3:599655', 'Pf3D7_10_v3:1383789', 'Pf3D7_10_v3:1385894', 'Pf3D7_11_v3:1006911', 'Pf3D7_11_v3:1295068', 'Pf3D7_11_v3:1802201', 'Pf3D7_12_v3:1667593', 'Pf3D7_12_v3:1934745', 'Pf3D7_12_v3:858501', 'Pf3D7_13_v3:1419519', 'Pf3D7_13_v3:159086', 'Pf3D7_13_v3:2161975', 'Pf3D7_13_v3:2573828', 'Pf3D7_13_v3:388365', 'Pf3D7_14_v3:2625887', 'Pf3D7_14_v3:3126219', 'Pf3D7_14_v3:438592', 'Pf3D7_01_v3:179347', 'Pf3D7_01_v3:180554', 'Pf3D7_01_v3:283144', 'Pf3D7_01_v3:535211', 'Pf3D7_02_v3:839620', 'Pf3D7_04_v3:426436', 'Pf3D7_04_v3:531138', 'Pf3D7_04_v3:891732', 'Pf3D7_05_v3:172801', 'Pf3D7_06_v3:574938', 'Pf3D7_07_v3:1308383', 'Pf3D7_07_v3:1358910', 'Pf3D7_07_v3:1359218', 'Pf3D7_07_v3:635985', 'Pf3D7_08_v3:1056829', 'Pf3D7_08_v3:150033', 'Pf3D7_08_v3:399774', 'Pf3D7_09_v3:1379145', 'Pf3D7_10_v3:1386850', 'Pf3D7_11_v3:1935031', 'Pf3D7_11_v3:408668', 'Pf3D7_11_v3:828596', 'Pf3D7_12_v3:857245', 'Pf3D7_14_v3:107014', 'Pf3D7_14_v3:1757603', 'Pf3D7_14_v3:2733656', 'GenBarcode', 'Artemisinin', 'Piperaquine', 'DHA-PPQ', 'Chloroquine', 'Pyrimethamine', 'Sulfadoxine', 'S-P', 'S-P-IPTp', 'PfCRT:72', 'PfCRT:74', 'PfCRT:75', 'PfCRT:76', 'PfCRT:93', 'PfCRT:97', 'PfCRT:218', 'PfCRT:220', 'PfCRT:271', 'PfCRT:333', 'PfCRT:353', 'PfCRT:371', 'PfDHFR:16', 'PfDHFR:51', 'PfDHFR:59', 'PfDHFR:108', 'PfDHFR:164', 'PfDHFR:306', 'PfDHPS:436', 'PfDHPS:437', 'PfDHPS:540', 'PfDHPS:581', 'PfDHPS:613', 'PfEXO:415', 'PfMDR1:86', 'PfMDR1:184', 'PfMDR1:1034', 'PfMDR1:1042', 'PfMDR1:1226', 'PfMDR1:1246', 'PfARPS10:127', 'PfARPS10:128', 'PfFD:193', 'PfCRT:326', 'PfCRT:356', 'PfMDR2:484']

Even though there is a lot of data to sift through here, in reality the columns can be divided into four main groups:

  1. Sample metadata

  2. Barcode SNPs

  3. Drug resistance SNPs and haplotypes

  4. Resistance classification

Let’s explore these groups one at a time.

1. Sample metadata

The metadata included here provide information on each sample, including the study (MalariaGEN partner) it belongs to, year of collection, country, level 1 administrative division, population, the process or type of technology used for sequencing the samples, geographical coordinates, and species.

metadata_ls = ['Study','Year','Country','AdmDiv1','Population',
               'Process','IncludeInAnalysis',
               'Latitude_country','Longitude_country','Latitude_adm1','Longitude_adm1',
               'Species']
pf6plus_all[metadata_ls[:5]].head()
Study Year Country AdmDiv1 Population
SampleId
FP0008-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF
FP0009-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF
FP0015-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF
FP0016-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF
FP0017-C 1147-PF-MR-CONWAY 2014 Mauritania Hodh el Gharbi WAF

As you can see here, every sample belongs to a study, which are all described in detail on the MalariaGEN website: explore Pf6 partner studies and GenRe partner studies. These descriptions are particularly useful to get context on the study characteristics and epidemiological design, as well as to know all people involved.

pd.options.display.max_colwidth = 100
pd.options.display.max_rows = 65
pf6plus_all.groupby('Study')['Country'].unique().agg(lambda x: ', '.join(x)).to_frame('Countries')
Countries
Study
1001-PF-ML-DJIMDE Mali
1004-PF-BF-OUEDRAOGO Burkina Faso
1006-PF-GM-CONWAY Gambia
1007-PF-TZ-DUFFY Tanzania
1008-PF-SEA-RINGWALD Vietnam, Myanmar, Laos
1010-PF-TH-ANDERSON Thailand
1011-PF-KH-SU Cambodia
1012-PF-KH-WHITE Cambodia
1013-PF-PEGB-BRANCH Peru
1014-PF-SSA-SUTHERLAND Ghana, Mozambique, Uganda, Kenya
1015-PF-KE-NZILA Kenya
1016-PF-TH-NOSTEN Thailand
1017-PF-GH-AMENGA-ETEGO Ghana
1020-PF-VN-BONI Vietnam
1021-PF-PG-MUELLER Papua New Guinea
1022-PF-MW-OCHOLLA Malawi
1023-PF-CO-ECHEVERRI-GARCIA Colombia
1024-PF-UG-BOUSEMA Uganda
1026-PF-GN-CONWAY Guinea
1027-PF-KE-BULL Kenya
1031-PF-SEA-PLOWE Thailand, Cambodia, Bangladesh
1044-PF-KH-FAIRHURST Cambodia
1052-PF-TRAC-WHITE Thailand, Cambodia, Bangladesh, Vietnam, Myanmar, Laos, Democratic Republic of the Congo, Nigeria
1062-PF-PG-BARRY Papua New Guinea
1083-PF-GH-CONWAY Ghana
1093-PF-CM-APINJOH Cameroon
1094-PF-GH-AMENGA-ETEGO Ghana
1095-PF-TZ-ISHENGOMA Tanzania
1096-PF-GH-GHANSAH Ghana
1097-PF-ML-MAIGA Mali
1098-PF-ET-GOLASSA Ethiopia
1100-PF-CI-YAVO Côte d'Ivoire
1101-PF-CD-ONYAMBOKO Democratic Republic of the Congo
1102-PF-MG-RANDRIANARIVELOJOSIA Madagascar
1103-PF-PDN-GMSN-NGWA Nigeria
1107-PF-KEN-KAMAU Kenya
1125-PF-TH-NOSTEN Thailand
1127-PF-ML-SOULEYMANE Mali
1131-PF-BJ-BERTIN Benin
1134-PF-ML-CONWAY Mali
1135-PF-SN-CONWAY Senegal
1136-PF-GM-NGWA Gambia
1137-PF-GM-DALESSANDRO Gambia
1138-PF-CD-FANELLO Democratic Republic of the Congo
1141-PF-GM-CLAESSENS Gambia
1145-PF-PE-GAMBOA Peru
1146-PF-MULTI-PRICE Indonesia
1147-PF-MR-CONWAY Mauritania
1148-PF-BD-MAUDE Bangladesh
1151-PF-GH-AMENGA-ETEGO Ghana
1172-PF-KH-FAIRHURST-SM Cambodia
1179-PF-KH-TME-VONSEIDLEIN Cambodia
1180-PF-TRAC2-DONDORP Thailand, Bangladesh, Vietnam, Cambodia, Myanmar, Democratic Republic of the Congo, Laos, India
1181-PF-VN-THUYNHIEN Vietnam
1195-PF-TRAC2-DONDORP Thailand, Cambodia, Vietnam, Myanmar
1198-PF-METF-NOSTEN Myanmar, Thailand
1207-PF-KH-CNM-GENRE Cambodia
1208-PF-LA-CMPE-GENRE Laos
1209-PF-VN-IMPEQN-GENRE Vietnam
1210-PF-TH-MAUDE Thailand
1238-PF-VN-NIMPE-GENRE Vietnam

If instead of a table, we wanted to have a look at all of the partner studies conducted within this dataset and how many samples they each contributed, can can plot them here, with each color representing a different study:

plot_samples_per_country_and_study_histogram(pf6plus_all)

You can filter out samples that fail QC using the IncludeInAnalysis filter, which is set to True for all the high-quality samples. Using this subset of samples allows you to have higher confidence in your analysis results as these samples have a good amount of reads, are primarily Plasmodium falciparum infections (e.g. reduced risk of cross-species reads mapping), are unique in the dataset (e.g. when technical replicates or time series exist).

pf6plus = pf6plus_all[pf6plus_all['IncludeInAnalysis']]
pf6plus_all['IncludeInAnalysis'].value_counts()
True     13596
False     3124
Name: IncludeInAnalysis, dtype: int64

We can see that of the 13,596 samples are high quality using the IncludeInAnalysis filter, so from here onward, all samples used for subsequent analysis are only high quality samples.

Samples are also conveniently grouped into an analysis Population that each consist of parasites with a high degree of genetic similarity, which clusters parasites into 8 geographic regions: Central Africa (CAF), East Africa (EAF), Eastern S.E. Asia (ESEA), Oceania (OCE), South America (SAM), South Asia (SAS), West Africa (WAF), Western S.E. Asia (WSEA). The countries that belong to each analysis Population are seen below.

pf6plus.groupby(['Population', 'Country']).size().to_frame('Number of samples')
Number of samples
Population Country
CAF Democratic Republic of the Congo 464
EAF Ethiopia 21
Kenya 110
Madagascar 24
Malawi 254
Mozambique 1
Tanzania 316
Uganda 13
ESEA Cambodia 1875
Laos 1518
Thailand 123
Vietnam 2302
OCE Indonesia 80
Papua New Guinea 121
SAM Colombia 16
Peru 21
SAS Bangladesh 1770
India 280
WAF Benin 36
Burkina Faso 56
Cameroon 235
Côte d'Ivoire 70
Gambia 219
Ghana 851
Guinea 149
Mali 426
Mauritania 76
Nigeria 29
Senegal 84
WSEA Myanmar 1160
Thailand 896

You can also see what Process (or technology) each sample was generated with, as this dataset consists of genotyping data from the following:

  • whole-genome sequencing

  • amplicon sequencing

  • Agena genotyping

pf6plus_all.groupby('Process').size().to_frame('Number of samples')
Number of samples
Process
Agena 6150
AmpSeq-V1 1041
AmpSeq-V2 2432
WGS 7097

2. Barcode SNPs

Includes genotypes on 101 SNPs with medium/high global frequency.

barcode_ls = [col for col in list(pf6plus.columns.values.tolist()) if col.startswith('Pf3D7_')]
print(len(barcode_ls))
pf6plus[barcode_ls].head()
101
Pf3D7_02_v3:376222 Pf3D7_02_v3:470013 Pf3D7_03_v3:656861 Pf3D7_04_v3:110442 Pf3D7_04_v3:881571 Pf3D7_05_v3:350933 Pf3D7_05_v3:369740 Pf3D7_06_v3:900278 Pf3D7_07_v3:1044052 Pf3D7_08_v3:1314831 Pf3D7_08_v3:413067 Pf3D7_09_v3:900277 Pf3D7_11_v3:1018899 Pf3D7_11_v3:1815412 Pf3D7_13_v3:1056452 Pf3D7_13_v3:1466422 Pf3D7_14_v3:137622 Pf3D7_14_v3:2164225 Pf3D7_01_v3:145515 Pf3D7_03_v3:548178 Pf3D7_04_v3:1102392 Pf3D7_04_v3:139051 Pf3D7_04_v3:286542 Pf3D7_04_v3:529500 Pf3D7_05_v3:796714 Pf3D7_07_v3:1256331 Pf3D7_07_v3:461139 Pf3D7_07_v3:619957 Pf3D7_08_v3:417335 Pf3D7_09_v3:163977 Pf3D7_10_v3:317581 Pf3D7_10_v3:336274 Pf3D7_11_v3:1020397 Pf3D7_11_v3:1294107 Pf3D7_11_v3:1935227 Pf3D7_11_v3:477922 Pf3D7_12_v3:1663492 Pf3D7_12_v3:2171901 Pf3D7_13_v3:1233218 Pf3D7_13_v3:1867630 ... Pf3D7_11_v3:1006911 Pf3D7_11_v3:1295068 Pf3D7_11_v3:1802201 Pf3D7_12_v3:1667593 Pf3D7_12_v3:1934745 Pf3D7_12_v3:858501 Pf3D7_13_v3:1419519 Pf3D7_13_v3:159086 Pf3D7_13_v3:2161975 Pf3D7_13_v3:2573828 Pf3D7_13_v3:388365 Pf3D7_14_v3:2625887 Pf3D7_14_v3:3126219 Pf3D7_14_v3:438592 Pf3D7_01_v3:179347 Pf3D7_01_v3:180554 Pf3D7_01_v3:283144 Pf3D7_01_v3:535211 Pf3D7_02_v3:839620 Pf3D7_04_v3:426436 Pf3D7_04_v3:531138 Pf3D7_04_v3:891732 Pf3D7_05_v3:172801 Pf3D7_06_v3:574938 Pf3D7_07_v3:1308383 Pf3D7_07_v3:1358910 Pf3D7_07_v3:1359218 Pf3D7_07_v3:635985 Pf3D7_08_v3:1056829 Pf3D7_08_v3:150033 Pf3D7_08_v3:399774 Pf3D7_09_v3:1379145 Pf3D7_10_v3:1386850 Pf3D7_11_v3:1935031 Pf3D7_11_v3:408668 Pf3D7_11_v3:828596 Pf3D7_12_v3:857245 Pf3D7_14_v3:107014 Pf3D7_14_v3:1757603 Pf3D7_14_v3:2733656
SampleId
FP0008-C G G T T G G T G N N N A T G C G T A N C A T G A G C N C C N T A T C N T G A T G ... A G G C G C N A T C C N C A G G G N C A T A G N C A T C A T C G N T T C A A A T
FP0009-C G A T C A G T G T G A G C C T G T G T C A T G A A T A G C T A A T T A T G A T C ... A A G T A C T A A A C G T A G A G C T A T A A C T G T C A T C G C T A C A A A C
FP0015-C G A T C G G T G C N G A N G T G T A T C A G G G N T G G C T A G N C A T G A T G ... A A G C G N C G T A C C N A G N G T C A T A G A C A A T N T N A T T T C A A N N
FP0016-C A G T C G G T G C G N A T G T G C A N C A G G A A T G G C T A N C C A T G A C G ... A A G C G C N G T N C C C A G G X T C A G A G N T A A T A T C G T T T C A A G C
FP0017-C G G T T A A T G C A G A N N T G T G T C A T G A N T G N C C A G T C A C N T C G ... A A G T A A N A A A C N N A G G G T T A G A A A N N A C A N C A C T T C A A A N

5 rows × 101 columns

Let’s look at the frequency of one of these SNPs:

pf6plus[barcode_ls[0]].value_counts()
A    6960
G    4741
N    1026
X     869
Name: Pf3D7_02_v3:376222, dtype: int64

If you want to explore all the SNPs together for each of the samples, you can use the GenBarcode field:

pf6plus['GenBarcode'].head()
SampleId
FP0008-C    GGTTGGTGNNNATGCGTANCATGAGCNCCNTATCNTGATGCANTTANANGAAGNNAGNGNNAGGCGCNATCCNCAGGGNCATAGNCATCATCGNTT...
FP0009-C    GATCAGTGTGAGCCTGTGTCATGAATAGCTAATTATGATCATCTTAAAAAAAGATATGGATAAGTACTAAACGTAGAGCTATAACTGTCATCGCTA...
FP0015-C    GATCGGTGCNGANGTGTATCAGGGNTGGCTAGNCATGATGATTTTATGNAAAGANANACATAAGCGNCGTACCNAGNGTCATAGACAATNTNATTT...
FP0016-C    AGTCGGTGCGNATGTGCANCAGGAATGGCTANCCATGACGATTTTANGCAAAAATAXGGATAAGCGCNGTNCCCAGGXTCAGAGNTAATATCGTTT...
FP0017-C    GGTTAATGCAGANNTGTGTCATGANTGNCCAGTCACNTCGCATTTAAGAAANGNTAGGGATAAGTAANAAACNNAGGGTTAGAAANNACANCACTT...
Name: GenBarcode, dtype: object

3. Drug resistance SNPs and haplotypes

Information on 36 SNPs from 10 genes associated with or relevant for antimalarial resistance:

dr_snp_ls = ['PfCRT:72','PfCRT:74','PfCRT:75','PfCRT:76','PfCRT:93','PfCRT:97','PfCRT:218','PfCRT:220',
          'PfCRT:271', 'PfCRT:326','PfCRT:333', 'PfCRT:353', 'PfCRT:356', 'PfCRT:371',
          'PfDHFR:16','PfDHFR:51','PfDHFR:59','PfDHFR:108','PfDHFR:164','PfDHFR:306',
          'PfDHPS:436','PfDHPS:437','PfDHPS:540','PfDHPS:581','PfDHPS:613',
          'PfEXO:415',
          'PfMDR1:86','PfMDR1:184','PfMDR1:1034','PfMDR1:1042','PfMDR1:1226','PfMDR1:1246',
          'PfARPS10:127','PfARPS10:128','PfFD:193','PfMDR2:484']

pf6plus[dr_snp_ls].head()
PfCRT:72 PfCRT:74 PfCRT:75 PfCRT:76 PfCRT:93 PfCRT:97 PfCRT:218 PfCRT:220 PfCRT:271 PfCRT:326 PfCRT:333 PfCRT:353 PfCRT:356 PfCRT:371 PfDHFR:16 PfDHFR:51 PfDHFR:59 PfDHFR:108 PfDHFR:164 PfDHFR:306 PfDHPS:436 PfDHPS:437 PfDHPS:540 PfDHPS:581 PfDHPS:613 PfEXO:415 PfMDR1:86 PfMDR1:184 PfMDR1:1034 PfMDR1:1042 PfMDR1:1226 PfMDR1:1246 PfARPS10:127 PfARPS10:128 PfFD:193 PfMDR2:484
SampleId
FP0008-C - - - [T/K] T H I [S/A] E N T G [T/I] [I/R] A N C [S/N] I S [A/S] [A/G] K A A E N [F/Y] S N F D V D D T
FP0009-C - - - T T H I S E N T G T I A I R N I S A A K A A E Y F S N F Y V D D T
FP0015-C - - - T T H I S E N T G I I A I R N I S [A/S] [A/G] K A A E N [F/Y] S N F D V D D T
FP0016-C - - - [T/K] T H I S E N T G I [I/R] A N C S I S S G K A A E N F S N F D V D D T
FP0017-C C M N K T H I A Q N T G I R A N C S I S [A/F] A K A [A/S] E N Y S N F D V D D T

Information on common haplotypes from 8 genes associated with or relevant for antimalarial resistance:

dr_hap_ls = ['PfCRT','Kelch','PfDHFR','PfEXO','PGB','Plasmepsin2/3','PfDHPS','PfMDR1']
pf6plus[dr_hap_ls].head()
PfCRT Kelch PfDHFR PfEXO PGB Plasmepsin2/3 PfDHPS PfMDR1
SampleId
FP0008-C ----- WT ANC[S/N]IS E VDN[T/I]DT WT -[A/S][A/G]KAA N[F/Y]D
FP0009-C ----- WT AIRNIS E VDNTDT WT -AAKAA YFY
FP0015-C ----- WT AIRNIS E VDNIDT WT -[A/S][A/G]KAA N[F/Y]D
FP0016-C ----- WT ANCSIS E VDNIDT WT -SGKAA NFD
FP0017-C CVMNK WT ANCSIS E VDNIDT WT -[A/F]AKA[A/S] NYD

4. Drug resistance classification (aka phenotypes)

Each sample is classified into Resistant, Sensitive, or Undetermined to eight major antimalarial drugs or combination therapy based on well-recognised markers of resistance. Details of the methods can be found in the Pf6 and GenRe-Mekong publications.

drugs_ls = ['Artemisinin','Piperaquine','DHA-PPQ','Chloroquine','Pyrimethamine','Sulfadoxine','S-P','S-P-IPTp']
pf6plus[drugs_ls].head()
Artemisinin Piperaquine DHA-PPQ Chloroquine Pyrimethamine Sulfadoxine S-P S-P-IPTp
SampleId
FP0008-C Sensitive Sensitive Sensitive Undetermined Undetermined Undetermined Sensitive Sensitive
FP0009-C Sensitive Sensitive Sensitive Resistant Resistant Sensitive Resistant Sensitive
FP0015-C Sensitive Sensitive Sensitive Resistant Resistant Undetermined Resistant Sensitive
FP0016-C Sensitive Sensitive Sensitive Undetermined Sensitive Resistant Sensitive Sensitive
FP0017-C Sensitive Sensitive Sensitive Sensitive Sensitive Sensitive Sensitive Sensitive

Geographical distribution of samples in Pf6+

The plots below show the geographical distribution of data, and how combining them can provide a greater global coverage.

Click on the pin to find out more about the data collected at each site.

Samples in the Pf6 data release

map_samples(pf6plus, 'Pf6')
Make this Notebook Trusted to load map: File -> Trust Notebook

Samples in the GenRe-Mekong data release

map_samples(pf6plus, 'GenRe')
Make this Notebook Trusted to load map: File -> Trust Notebook

Samples in Pf6+ (union of the above)

map_samples(pf6plus, 'Pf6+')
Make this Notebook Trusted to load map: File -> Trust Notebook

Temporal distribution of samples on Pf6+

As for the geographical coverage above, combining the two resources allows for a wider and more complete temporal spread.

plot_temporal_samples_histogram(pf6plus)