Explore the Pf6+ dataset¶
In this notebook we are going to:
Describe how to setup your environment to run these notebooks for analysis and exploration
Show how to directly access the data without the need to download them
Explore some of the data and metadata available
Showcase the richness of this dataset
Before we dig into the data, we want to recognise that this resource was possible because of enormous team efforts around the world over the last 18 years, encompassing 61 studies in 30 different countries. With over 16,500 samples available, we hope to continue disseminating these unique resources and increase their accessibility, so that they ultimately translate into improvements for public health.
If you use this resource please remember to also cite the following papers:
A massive thank you to all the partners and contributors involved:
Pf6 | GenRe-Mekong |
---|---|
Ambroise Ahouidi, Mozam Ali, Jacob Almagro-Garcia,
Alfred Amambua-Ngwa, Chanaki Amaratunga,
Roberto Amato, Lucas Amenga-Etego,
Ben Andagalu, Tim J. C. Anderson,
Voahangy Andrianaranjaka, Tobias Apinjoh,
Cristina Ariani, Elizabeth A Ashley, Sarah Auburn,
Gordon A. Awandare, Hampate Ba, Vito Baraka,
Alyssa E. Barry, Philip Bejon, Gwladys I. Bertin,
Maciej F. Boni, Steffen Borrmann, Teun Bousema,
Oralee Branch, Peter C. Bull, George B. J. Busby,
Thanat Chookajorn, Kesinee Chotivanich,
Antoine Claessens, David Conway, Alister Craig,
Umberto D Alessandro, Souleymane Dama,
Nicholas PJ Day, Brigitte Denis, Mahamadou Diakite,
Abdoulaye Djimdé, Christiane Dolecek, Arjen M Dondorp,
Chris Drakeley, Eleanor Drury, Patrick Duffy,
Diego F. Echeverry, Thomas G. Egwang, Berhanu Erko,
Rick M. Fairhurst, Abdul Faiz, Caterina A. Fanello,
Mark M. Fukuda, Dionicia Gamboa, Anita Ghansah,
Lemu Golassa, Sonia Goncalves, William L. Hamilton,
G. L. Abby Harrison, Lee Hart, Christa Henrichs,
Tran Tinh Hien, Catherine A. Hill, Abraham Hodgson,
Christina Hubbart, Mallika Imwong, Deus S. Ishengoma,
Scott A. Jackson, Chris G. Jacob, Ben Jeffery,
Anna E. Jeffreys, Kimberly J. Johnson,
Dushyanth Jyothi, Claire Kamaliddin, Edwin Kamau,
Mihir Kekre, Krzysztof Kluczynski, Theerarat Kochakarn,
Abibatou Konaté, Dominic P. Kwiatkowski,
Myat Phone Kyaw, Pharath Lim, Chanthap Lon,
Kovana M. Loua, Oumou Maïga-Ascofaré, Cinzia Malangone,
Magnus Manske, Jutta Marfurt, Kevin Marsh,
Mayfong Mayxay, Alistair Miles, Olivo Miotto,
Victor Mobegi, Olugbenga A. Mokuolu, Jacqui Montgomery,
Ivo Mueller, Paul N. Newton, Thuy Nguyen,
Thuy-Nhien Nguyen, Harald Noed, François Nosten,
Rintis Noviyanti, Alexis Nzila,
Lynette I. Ochola-Oyier, Harold Ocholla,
Abraham Oduro, Irene Omedo, Marie A. Onyamboko,
Jean-Bosco Ouedraogo, Kolapo Oyebola,
Richard D. Pearson, Norbert Peshu,
Aung Pyae Phyo, Chris V. Plowe, Ric N. Price,
Sasithon Pukrittayakamee,
Milijaona Randrianarivelojosia,
Julian C. Rayner, Pascal Ringwald, Kirk A. Rockett,
Katherine Rowlands, Lastenia Ruiz, David Saunders,
Alex Shayo, Peter Siba, Victoria J. Simpson,
Jim Stalker, Xin-zhuan Su, Colin Sutherland,
Shannon Takala-Harrison, Livingstone Tavu,
Vandana Thathy, Antoinette Tshefu, Federica Verra,
Joseph Vinetz, Thomas E. Wellems, Jason Wendler,
Nicholas J. White, Ian Wright, William Yavo, Htut Ye
|
Christopher G Jacob, Nguyen Thuy-Nhien,
Mayfong Mayxay, Richard J Maude,
Huynh Hong Quang, Bouasy Hongvanthong,
Viengxay Vanisaveth, Thang Ngo Duc,
Huy Rekol, Rob van der Pluijm,
Lorenz von Seidlein,Rick Fairhurst,
François Nosten, Md Amir Hossain,
Naomi Park, Scott Goodwin,
Pascal Ringwald,
Keobouphaphone Chindavongsa,
Paul Newton, Elizabeth Ashley,
Sonexay Phalivong, Rapeephan Maude,
Rithea Leang, Cheah Huch,
Le Thanh Dong, Kim-Tuyen Nguyen,
Tran Minh Nhat, Tran Tinh Hien,
Hoa Nguyen, Nicole Zdrojewski,
Sara Canavati, Abdullah Abu Sayeed,
Didar Uddin, Caroline Buckee,
Caterina I Fanello, Marie Onyamboko,
Thomas Peto, Rupam Tripura,
Chanaki Amaratunga, Aung Myint Thu,
Gilles Delmas, Jordi Landier,
Daniel M Parker, Nguyen Hoang Chau,
Dysoley Lek, Seila Suon,
James Callery, Podjanee Jittamala,
Borimas Hanboonkunupakarn,
Sasithon Pukrittayakamee,
Aung Pyae Phyo, Frank Smithuis,
Khin Lin, Myo Thant,
Tin Maung Hlaing, Parthasarathi Satpathi,
Sanghamitra Satpathi, Prativa K Behera,
Amar Tripura, Subrata Baidya,
Neena Valecha, Anupkumar R Anvikar,
Akhter Ul Islam, Abul Faiz,
Chanon Kunasol, Eleanor Drury,
Mihir Kekre, Mozam Ali,
Katie Love, Shavanthi Rajatileka,
Anna E Jeffreys, Kate Rowlands,
Christina S Hubbart, Mehul Dhorda,
Ranitha Vongpromek, Namfon Kotanan,
Phrutsamon Wongnak, Jacob Almagro Garcia,
Richard D Pearson, Cristina V Ariani,
Thanat Chookajorn, Cinzia Malangone,
T Nguyen, Jim Stalker,
Ben Jeffery, Jonathan Keatley,
Kimberly J Johnson, Dawn Muddyman,
Xin Hui S Chan, John Sillitoe,
Roberto Amato, Victoria Simpson,
Sonia Gonçalves, Kirk Rockett,
Nicholas P Day, Arjen M Dondorp,
Dominic P Kwiatkowski, Olivo Miotto
|
Pf6+ Resource Exploration¶
Setting up your environment and notebooks¶
There are two ways to access the Pf6+ data resource and analysis notebooks, either:
Run notebooks with a Google Colaboratory environment Google colab. All you need is a Google account.
Run on a local jupyter instance on your computer.
If you are running this on Google Colab¶
If you are running the notebooks on Colab, then run the cell below to clone the Pf6+ repo:
!git clone https://github.com/malariagen/Pf6plus.git
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .
If you are running this locally¶
If you prefer to run the notebooks locally, there are additional steps required. If you haven’t already, please follow these instructions from the Pf6+ repo in Github for more information on local setup.
Get started¶
In order to generate the plots as seen in these notebooks, first import the functions from the data_analysis
directory, which contain all of the code you will need:
from data_analysis.map_samples import *
from data_analysis.plot_sample_info import *
# run to ensure plots show in static page
import bokeh.io
bokeh.io.output_notebook()
Load the dataset¶
The Pf6+ data resource builds on the efforts of the latest Pf6 and the GenRe-Mekong data resources, integrating 7,000 whole-genome sequenced samples together with 10,000 genotyped samples.
Pf6+ is a highly heterogeneous data resource that contains samples from multiple independent studies and technologies (whole-genome sequencing, amplicon sequencing, and Agena genotyping): care needs to be taken when interpreting the aggregated analysis contained in these notebooks and any further work you may do.
Import the Pf6+ data¶
The command below reads the data directly from the online repository, without needing to download it first. However, you can still decide to download the dataset locally from https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv and modify the line below to point to the correct location.
# URL of the dataset; change accordingly if you prefer to work from a local, offline copy
pf6plus_fn = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'
# Read the dataset
pf6plus_all = pd.read_csv(pf6plus_fn, sep='\t', index_col=0, low_memory=False)
Which information is available to explore?¶
print(pf6plus_all.shape)
pf6plus_all.head()
(16720, 166)
Study | Year | Country | AdmDiv1 | Population | Process | IncludeInAnalysis | Latitude_country | Longitude_country | Latitude_adm1 | Longitude_adm1 | PfCRT | Kelch | PfDHFR | PfEXO | PGB | Plasmepsin2/3 | PfDHPS | PfMDR1 | Species | Pf3D7_02_v3:376222 | Pf3D7_02_v3:470013 | Pf3D7_03_v3:656861 | Pf3D7_04_v3:110442 | Pf3D7_04_v3:881571 | Pf3D7_05_v3:350933 | Pf3D7_05_v3:369740 | Pf3D7_06_v3:900278 | Pf3D7_07_v3:1044052 | Pf3D7_08_v3:1314831 | Pf3D7_08_v3:413067 | Pf3D7_09_v3:900277 | Pf3D7_11_v3:1018899 | Pf3D7_11_v3:1815412 | Pf3D7_13_v3:1056452 | Pf3D7_13_v3:1466422 | Pf3D7_14_v3:137622 | Pf3D7_14_v3:2164225 | Pf3D7_01_v3:145515 | Pf3D7_03_v3:548178 | ... | Pyrimethamine | Sulfadoxine | S-P | S-P-IPTp | PfCRT:72 | PfCRT:74 | PfCRT:75 | PfCRT:76 | PfCRT:93 | PfCRT:97 | PfCRT:218 | PfCRT:220 | PfCRT:271 | PfCRT:333 | PfCRT:353 | PfCRT:371 | PfDHFR:16 | PfDHFR:51 | PfDHFR:59 | PfDHFR:108 | PfDHFR:164 | PfDHFR:306 | PfDHPS:436 | PfDHPS:437 | PfDHPS:540 | PfDHPS:581 | PfDHPS:613 | PfEXO:415 | PfMDR1:86 | PfMDR1:184 | PfMDR1:1034 | PfMDR1:1042 | PfMDR1:1226 | PfMDR1:1246 | PfARPS10:127 | PfARPS10:128 | PfFD:193 | PfCRT:326 | PfCRT:356 | PfMDR2:484 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SampleId | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FP0008-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF | WGS | True | 20.265149 | -10.337093 | 16.565426 | -9.832345 | ----- | WT | ANC[S/N]IS | E | VDN[T/I]DT | WT | -[A/S][A/G]KAA | N[F/Y]D | Pf | G | G | T | T | G | G | T | G | N | N | N | A | T | G | C | G | T | A | N | C | ... | Undetermined | Undetermined | Sensitive | Sensitive | - | - | - | [T/K] | T | H | I | [S/A] | E | T | G | [I/R] | A | N | C | [S/N] | I | S | [A/S] | [A/G] | K | A | A | E | N | [F/Y] | S | N | F | D | V | D | D | N | [T/I] | T |
FP0009-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF | WGS | True | 20.265149 | -10.337093 | 16.565426 | -9.832345 | ----- | WT | AIRNIS | E | VDNTDT | WT | -AAKAA | YFY | Pf | G | A | T | C | A | G | T | G | T | G | A | G | C | C | T | G | T | G | T | C | ... | Resistant | Sensitive | Resistant | Sensitive | - | - | - | T | T | H | I | S | E | T | G | I | A | I | R | N | I | S | A | A | K | A | A | E | Y | F | S | N | F | Y | V | D | D | N | T | T |
FP0015-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF | WGS | True | 20.265149 | -10.337093 | 16.565426 | -9.832345 | ----- | WT | AIRNIS | E | VDNIDT | WT | -[A/S][A/G]KAA | N[F/Y]D | Pf | G | A | T | C | G | G | T | G | C | N | G | A | N | G | T | G | T | A | T | C | ... | Resistant | Undetermined | Resistant | Sensitive | - | - | - | T | T | H | I | S | E | T | G | I | A | I | R | N | I | S | [A/S] | [A/G] | K | A | A | E | N | [F/Y] | S | N | F | D | V | D | D | N | I | T |
FP0016-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF | WGS | True | 20.265149 | -10.337093 | 16.565426 | -9.832345 | ----- | WT | ANCSIS | E | VDNIDT | WT | -SGKAA | NFD | Pf | A | G | T | C | G | G | T | G | C | G | N | A | T | G | T | G | C | A | N | C | ... | Sensitive | Resistant | Sensitive | Sensitive | - | - | - | [T/K] | T | H | I | S | E | T | G | [I/R] | A | N | C | S | I | S | S | G | K | A | A | E | N | F | S | N | F | D | V | D | D | N | I | T |
FP0017-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF | WGS | True | 20.265149 | -10.337093 | 16.565426 | -9.832345 | CVMNK | WT | ANCSIS | E | VDNIDT | WT | -[A/F]AKA[A/S] | NYD | Pf | G | G | T | T | A | A | T | G | C | A | G | A | N | N | T | G | T | G | T | C | ... | Sensitive | Sensitive | Sensitive | Sensitive | C | M | N | K | T | H | I | A | Q | T | G | R | A | N | C | S | I | S | [A/F] | A | K | A | [A/S] | E | N | Y | S | N | F | D | V | D | D | N | I | T |
5 rows × 166 columns
The dataset has 16,720 samples (rows) and 166 columns:
print(list(pf6plus_all))
['Study', 'Year', 'Country', 'AdmDiv1', 'Population', 'Process', 'IncludeInAnalysis', 'Latitude_country', 'Longitude_country', 'Latitude_adm1', 'Longitude_adm1', 'PfCRT', 'Kelch', 'PfDHFR', 'PfEXO', 'PGB', 'Plasmepsin2/3', 'PfDHPS', 'PfMDR1', 'Species', 'Pf3D7_02_v3:376222', 'Pf3D7_02_v3:470013', 'Pf3D7_03_v3:656861', 'Pf3D7_04_v3:110442', 'Pf3D7_04_v3:881571', 'Pf3D7_05_v3:350933', 'Pf3D7_05_v3:369740', 'Pf3D7_06_v3:900278', 'Pf3D7_07_v3:1044052', 'Pf3D7_08_v3:1314831', 'Pf3D7_08_v3:413067', 'Pf3D7_09_v3:900277', 'Pf3D7_11_v3:1018899', 'Pf3D7_11_v3:1815412', 'Pf3D7_13_v3:1056452', 'Pf3D7_13_v3:1466422', 'Pf3D7_14_v3:137622', 'Pf3D7_14_v3:2164225', 'Pf3D7_01_v3:145515', 'Pf3D7_03_v3:548178', 'Pf3D7_04_v3:1102392', 'Pf3D7_04_v3:139051', 'Pf3D7_04_v3:286542', 'Pf3D7_04_v3:529500', 'Pf3D7_05_v3:796714', 'Pf3D7_07_v3:1256331', 'Pf3D7_07_v3:461139', 'Pf3D7_07_v3:619957', 'Pf3D7_08_v3:417335', 'Pf3D7_09_v3:163977', 'Pf3D7_10_v3:317581', 'Pf3D7_10_v3:336274', 'Pf3D7_11_v3:1020397', 'Pf3D7_11_v3:1294107', 'Pf3D7_11_v3:1935227', 'Pf3D7_11_v3:477922', 'Pf3D7_12_v3:1663492', 'Pf3D7_12_v3:2171901', 'Pf3D7_13_v3:1233218', 'Pf3D7_13_v3:1867630', 'Pf3D7_13_v3:2377887', 'Pf3D7_14_v3:2355751', 'Pf3D7_14_v3:3046108', 'Pf3D7_02_v3:529709', 'Pf3D7_02_v3:714480', 'Pf3D7_03_v3:155697', 'Pf3D7_04_v3:1037656', 'Pf3D7_04_v3:648101', 'Pf3D7_05_v3:1204155', 'Pf3D7_06_v3:1282691', 'Pf3D7_06_v3:1289212', 'Pf3D7_07_v3:1066698', 'Pf3D7_07_v3:1213486', 'Pf3D7_07_v3:704373', 'Pf3D7_08_v3:1313202', 'Pf3D7_08_v3:339406', 'Pf3D7_08_v3:701557', 'Pf3D7_09_v3:452690', 'Pf3D7_09_v3:599655', 'Pf3D7_10_v3:1383789', 'Pf3D7_10_v3:1385894', 'Pf3D7_11_v3:1006911', 'Pf3D7_11_v3:1295068', 'Pf3D7_11_v3:1802201', 'Pf3D7_12_v3:1667593', 'Pf3D7_12_v3:1934745', 'Pf3D7_12_v3:858501', 'Pf3D7_13_v3:1419519', 'Pf3D7_13_v3:159086', 'Pf3D7_13_v3:2161975', 'Pf3D7_13_v3:2573828', 'Pf3D7_13_v3:388365', 'Pf3D7_14_v3:2625887', 'Pf3D7_14_v3:3126219', 'Pf3D7_14_v3:438592', 'Pf3D7_01_v3:179347', 'Pf3D7_01_v3:180554', 'Pf3D7_01_v3:283144', 'Pf3D7_01_v3:535211', 'Pf3D7_02_v3:839620', 'Pf3D7_04_v3:426436', 'Pf3D7_04_v3:531138', 'Pf3D7_04_v3:891732', 'Pf3D7_05_v3:172801', 'Pf3D7_06_v3:574938', 'Pf3D7_07_v3:1308383', 'Pf3D7_07_v3:1358910', 'Pf3D7_07_v3:1359218', 'Pf3D7_07_v3:635985', 'Pf3D7_08_v3:1056829', 'Pf3D7_08_v3:150033', 'Pf3D7_08_v3:399774', 'Pf3D7_09_v3:1379145', 'Pf3D7_10_v3:1386850', 'Pf3D7_11_v3:1935031', 'Pf3D7_11_v3:408668', 'Pf3D7_11_v3:828596', 'Pf3D7_12_v3:857245', 'Pf3D7_14_v3:107014', 'Pf3D7_14_v3:1757603', 'Pf3D7_14_v3:2733656', 'GenBarcode', 'Artemisinin', 'Piperaquine', 'DHA-PPQ', 'Chloroquine', 'Pyrimethamine', 'Sulfadoxine', 'S-P', 'S-P-IPTp', 'PfCRT:72', 'PfCRT:74', 'PfCRT:75', 'PfCRT:76', 'PfCRT:93', 'PfCRT:97', 'PfCRT:218', 'PfCRT:220', 'PfCRT:271', 'PfCRT:333', 'PfCRT:353', 'PfCRT:371', 'PfDHFR:16', 'PfDHFR:51', 'PfDHFR:59', 'PfDHFR:108', 'PfDHFR:164', 'PfDHFR:306', 'PfDHPS:436', 'PfDHPS:437', 'PfDHPS:540', 'PfDHPS:581', 'PfDHPS:613', 'PfEXO:415', 'PfMDR1:86', 'PfMDR1:184', 'PfMDR1:1034', 'PfMDR1:1042', 'PfMDR1:1226', 'PfMDR1:1246', 'PfARPS10:127', 'PfARPS10:128', 'PfFD:193', 'PfCRT:326', 'PfCRT:356', 'PfMDR2:484']
Even though there is a lot of data to sift through here, in reality the columns can be divided into four main groups:
Sample metadata
Barcode SNPs
Drug resistance SNPs and haplotypes
Resistance classification
Let’s explore these groups one at a time.
1. Sample metadata¶
The metadata included here provide information on each sample, including the study (MalariaGEN partner) it belongs to, year of collection, country, level 1 administrative division, population, the process or type of technology used for sequencing the samples, geographical coordinates, and species.
metadata_ls = ['Study','Year','Country','AdmDiv1','Population',
'Process','IncludeInAnalysis',
'Latitude_country','Longitude_country','Latitude_adm1','Longitude_adm1',
'Species']
pf6plus_all[metadata_ls[:5]].head()
Study | Year | Country | AdmDiv1 | Population | |
---|---|---|---|---|---|
SampleId | |||||
FP0008-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF |
FP0009-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF |
FP0015-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF |
FP0016-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF |
FP0017-C | 1147-PF-MR-CONWAY | 2014 | Mauritania | Hodh el Gharbi | WAF |
As you can see here, every sample belongs to a study, which are all described in detail on the MalariaGEN website: explore Pf6 partner studies and GenRe partner studies. These descriptions are particularly useful to get context on the study characteristics and epidemiological design, as well as to know all people involved.
pd.options.display.max_colwidth = 100
pd.options.display.max_rows = 65
pf6plus_all.groupby('Study')['Country'].unique().agg(lambda x: ', '.join(x)).to_frame('Countries')
Countries | |
---|---|
Study | |
1001-PF-ML-DJIMDE | Mali |
1004-PF-BF-OUEDRAOGO | Burkina Faso |
1006-PF-GM-CONWAY | Gambia |
1007-PF-TZ-DUFFY | Tanzania |
1008-PF-SEA-RINGWALD | Vietnam, Myanmar, Laos |
1010-PF-TH-ANDERSON | Thailand |
1011-PF-KH-SU | Cambodia |
1012-PF-KH-WHITE | Cambodia |
1013-PF-PEGB-BRANCH | Peru |
1014-PF-SSA-SUTHERLAND | Ghana, Mozambique, Uganda, Kenya |
1015-PF-KE-NZILA | Kenya |
1016-PF-TH-NOSTEN | Thailand |
1017-PF-GH-AMENGA-ETEGO | Ghana |
1020-PF-VN-BONI | Vietnam |
1021-PF-PG-MUELLER | Papua New Guinea |
1022-PF-MW-OCHOLLA | Malawi |
1023-PF-CO-ECHEVERRI-GARCIA | Colombia |
1024-PF-UG-BOUSEMA | Uganda |
1026-PF-GN-CONWAY | Guinea |
1027-PF-KE-BULL | Kenya |
1031-PF-SEA-PLOWE | Thailand, Cambodia, Bangladesh |
1044-PF-KH-FAIRHURST | Cambodia |
1052-PF-TRAC-WHITE | Thailand, Cambodia, Bangladesh, Vietnam, Myanmar, Laos, Democratic Republic of the Congo, Nigeria |
1062-PF-PG-BARRY | Papua New Guinea |
1083-PF-GH-CONWAY | Ghana |
1093-PF-CM-APINJOH | Cameroon |
1094-PF-GH-AMENGA-ETEGO | Ghana |
1095-PF-TZ-ISHENGOMA | Tanzania |
1096-PF-GH-GHANSAH | Ghana |
1097-PF-ML-MAIGA | Mali |
1098-PF-ET-GOLASSA | Ethiopia |
1100-PF-CI-YAVO | Côte d'Ivoire |
1101-PF-CD-ONYAMBOKO | Democratic Republic of the Congo |
1102-PF-MG-RANDRIANARIVELOJOSIA | Madagascar |
1103-PF-PDN-GMSN-NGWA | Nigeria |
1107-PF-KEN-KAMAU | Kenya |
1125-PF-TH-NOSTEN | Thailand |
1127-PF-ML-SOULEYMANE | Mali |
1131-PF-BJ-BERTIN | Benin |
1134-PF-ML-CONWAY | Mali |
1135-PF-SN-CONWAY | Senegal |
1136-PF-GM-NGWA | Gambia |
1137-PF-GM-DALESSANDRO | Gambia |
1138-PF-CD-FANELLO | Democratic Republic of the Congo |
1141-PF-GM-CLAESSENS | Gambia |
1145-PF-PE-GAMBOA | Peru |
1146-PF-MULTI-PRICE | Indonesia |
1147-PF-MR-CONWAY | Mauritania |
1148-PF-BD-MAUDE | Bangladesh |
1151-PF-GH-AMENGA-ETEGO | Ghana |
1172-PF-KH-FAIRHURST-SM | Cambodia |
1179-PF-KH-TME-VONSEIDLEIN | Cambodia |
1180-PF-TRAC2-DONDORP | Thailand, Bangladesh, Vietnam, Cambodia, Myanmar, Democratic Republic of the Congo, Laos, India |
1181-PF-VN-THUYNHIEN | Vietnam |
1195-PF-TRAC2-DONDORP | Thailand, Cambodia, Vietnam, Myanmar |
1198-PF-METF-NOSTEN | Myanmar, Thailand |
1207-PF-KH-CNM-GENRE | Cambodia |
1208-PF-LA-CMPE-GENRE | Laos |
1209-PF-VN-IMPEQN-GENRE | Vietnam |
1210-PF-TH-MAUDE | Thailand |
1238-PF-VN-NIMPE-GENRE | Vietnam |
If instead of a table, we wanted to have a look at all of the partner studies conducted within this dataset and how many samples they each contributed, can can plot them here, with each color representing a different study:
plot_samples_per_country_and_study_histogram(pf6plus_all)
You can filter out samples that fail QC using the IncludeInAnalysis
filter, which is set to True
for all the high-quality samples. Using this subset of samples allows you to have higher confidence in your analysis results as these samples have a good amount of reads, are primarily Plasmodium falciparum infections (e.g. reduced risk of cross-species reads mapping), are unique in the dataset (e.g. when technical replicates or time series exist).
pf6plus = pf6plus_all[pf6plus_all['IncludeInAnalysis']]
pf6plus_all['IncludeInAnalysis'].value_counts()
True 13596
False 3124
Name: IncludeInAnalysis, dtype: int64
We can see that of the 13,596 samples are high quality using the IncludeInAnalysis
filter, so from here onward, all samples used for subsequent analysis are only high quality samples.
Samples are also conveniently grouped into an analysis Population
that each consist of parasites with a high degree of genetic similarity, which clusters parasites into 8 geographic regions: Central Africa (CAF), East Africa (EAF), Eastern S.E. Asia (ESEA), Oceania (OCE), South America (SAM), South Asia (SAS), West Africa (WAF), Western S.E. Asia (WSEA). The countries that belong to each analysis Population
are seen below.
pf6plus.groupby(['Population', 'Country']).size().to_frame('Number of samples')
Number of samples | ||
---|---|---|
Population | Country | |
CAF | Democratic Republic of the Congo | 464 |
EAF | Ethiopia | 21 |
Kenya | 110 | |
Madagascar | 24 | |
Malawi | 254 | |
Mozambique | 1 | |
Tanzania | 316 | |
Uganda | 13 | |
ESEA | Cambodia | 1875 |
Laos | 1518 | |
Thailand | 123 | |
Vietnam | 2302 | |
OCE | Indonesia | 80 |
Papua New Guinea | 121 | |
SAM | Colombia | 16 |
Peru | 21 | |
SAS | Bangladesh | 1770 |
India | 280 | |
WAF | Benin | 36 |
Burkina Faso | 56 | |
Cameroon | 235 | |
Côte d'Ivoire | 70 | |
Gambia | 219 | |
Ghana | 851 | |
Guinea | 149 | |
Mali | 426 | |
Mauritania | 76 | |
Nigeria | 29 | |
Senegal | 84 | |
WSEA | Myanmar | 1160 |
Thailand | 896 |
You can also see what Process
(or technology) each sample was generated with, as this dataset consists of genotyping data from the following:
whole-genome sequencing
amplicon sequencing
Agena genotyping
pf6plus_all.groupby('Process').size().to_frame('Number of samples')
Number of samples | |
---|---|
Process | |
Agena | 6150 |
AmpSeq-V1 | 1041 |
AmpSeq-V2 | 2432 |
WGS | 7097 |
2. Barcode SNPs¶
Includes genotypes on 101 SNPs with medium/high global frequency.
barcode_ls = [col for col in list(pf6plus.columns.values.tolist()) if col.startswith('Pf3D7_')]
print(len(barcode_ls))
pf6plus[barcode_ls].head()
101
Pf3D7_02_v3:376222 | Pf3D7_02_v3:470013 | Pf3D7_03_v3:656861 | Pf3D7_04_v3:110442 | Pf3D7_04_v3:881571 | Pf3D7_05_v3:350933 | Pf3D7_05_v3:369740 | Pf3D7_06_v3:900278 | Pf3D7_07_v3:1044052 | Pf3D7_08_v3:1314831 | Pf3D7_08_v3:413067 | Pf3D7_09_v3:900277 | Pf3D7_11_v3:1018899 | Pf3D7_11_v3:1815412 | Pf3D7_13_v3:1056452 | Pf3D7_13_v3:1466422 | Pf3D7_14_v3:137622 | Pf3D7_14_v3:2164225 | Pf3D7_01_v3:145515 | Pf3D7_03_v3:548178 | Pf3D7_04_v3:1102392 | Pf3D7_04_v3:139051 | Pf3D7_04_v3:286542 | Pf3D7_04_v3:529500 | Pf3D7_05_v3:796714 | Pf3D7_07_v3:1256331 | Pf3D7_07_v3:461139 | Pf3D7_07_v3:619957 | Pf3D7_08_v3:417335 | Pf3D7_09_v3:163977 | Pf3D7_10_v3:317581 | Pf3D7_10_v3:336274 | Pf3D7_11_v3:1020397 | Pf3D7_11_v3:1294107 | Pf3D7_11_v3:1935227 | Pf3D7_11_v3:477922 | Pf3D7_12_v3:1663492 | Pf3D7_12_v3:2171901 | Pf3D7_13_v3:1233218 | Pf3D7_13_v3:1867630 | ... | Pf3D7_11_v3:1006911 | Pf3D7_11_v3:1295068 | Pf3D7_11_v3:1802201 | Pf3D7_12_v3:1667593 | Pf3D7_12_v3:1934745 | Pf3D7_12_v3:858501 | Pf3D7_13_v3:1419519 | Pf3D7_13_v3:159086 | Pf3D7_13_v3:2161975 | Pf3D7_13_v3:2573828 | Pf3D7_13_v3:388365 | Pf3D7_14_v3:2625887 | Pf3D7_14_v3:3126219 | Pf3D7_14_v3:438592 | Pf3D7_01_v3:179347 | Pf3D7_01_v3:180554 | Pf3D7_01_v3:283144 | Pf3D7_01_v3:535211 | Pf3D7_02_v3:839620 | Pf3D7_04_v3:426436 | Pf3D7_04_v3:531138 | Pf3D7_04_v3:891732 | Pf3D7_05_v3:172801 | Pf3D7_06_v3:574938 | Pf3D7_07_v3:1308383 | Pf3D7_07_v3:1358910 | Pf3D7_07_v3:1359218 | Pf3D7_07_v3:635985 | Pf3D7_08_v3:1056829 | Pf3D7_08_v3:150033 | Pf3D7_08_v3:399774 | Pf3D7_09_v3:1379145 | Pf3D7_10_v3:1386850 | Pf3D7_11_v3:1935031 | Pf3D7_11_v3:408668 | Pf3D7_11_v3:828596 | Pf3D7_12_v3:857245 | Pf3D7_14_v3:107014 | Pf3D7_14_v3:1757603 | Pf3D7_14_v3:2733656 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SampleId | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FP0008-C | G | G | T | T | G | G | T | G | N | N | N | A | T | G | C | G | T | A | N | C | A | T | G | A | G | C | N | C | C | N | T | A | T | C | N | T | G | A | T | G | ... | A | G | G | C | G | C | N | A | T | C | C | N | C | A | G | G | G | N | C | A | T | A | G | N | C | A | T | C | A | T | C | G | N | T | T | C | A | A | A | T |
FP0009-C | G | A | T | C | A | G | T | G | T | G | A | G | C | C | T | G | T | G | T | C | A | T | G | A | A | T | A | G | C | T | A | A | T | T | A | T | G | A | T | C | ... | A | A | G | T | A | C | T | A | A | A | C | G | T | A | G | A | G | C | T | A | T | A | A | C | T | G | T | C | A | T | C | G | C | T | A | C | A | A | A | C |
FP0015-C | G | A | T | C | G | G | T | G | C | N | G | A | N | G | T | G | T | A | T | C | A | G | G | G | N | T | G | G | C | T | A | G | N | C | A | T | G | A | T | G | ... | A | A | G | C | G | N | C | G | T | A | C | C | N | A | G | N | G | T | C | A | T | A | G | A | C | A | A | T | N | T | N | A | T | T | T | C | A | A | N | N |
FP0016-C | A | G | T | C | G | G | T | G | C | G | N | A | T | G | T | G | C | A | N | C | A | G | G | A | A | T | G | G | C | T | A | N | C | C | A | T | G | A | C | G | ... | A | A | G | C | G | C | N | G | T | N | C | C | C | A | G | G | X | T | C | A | G | A | G | N | T | A | A | T | A | T | C | G | T | T | T | C | A | A | G | C |
FP0017-C | G | G | T | T | A | A | T | G | C | A | G | A | N | N | T | G | T | G | T | C | A | T | G | A | N | T | G | N | C | C | A | G | T | C | A | C | N | T | C | G | ... | A | A | G | T | A | A | N | A | A | A | C | N | N | A | G | G | G | T | T | A | G | A | A | A | N | N | A | C | A | N | C | A | C | T | T | C | A | A | A | N |
5 rows × 101 columns
Let’s look at the frequency of one of these SNPs:
pf6plus[barcode_ls[0]].value_counts()
A 6960
G 4741
N 1026
X 869
Name: Pf3D7_02_v3:376222, dtype: int64
If you want to explore all the SNPs together for each of the samples, you can use the GenBarcode
field:
pf6plus['GenBarcode'].head()
SampleId
FP0008-C GGTTGGTGNNNATGCGTANCATGAGCNCCNTATCNTGATGCANTTANANGAAGNNAGNGNNAGGCGCNATCCNCAGGGNCATAGNCATCATCGNTT...
FP0009-C GATCAGTGTGAGCCTGTGTCATGAATAGCTAATTATGATCATCTTAAAAAAAGATATGGATAAGTACTAAACGTAGAGCTATAACTGTCATCGCTA...
FP0015-C GATCGGTGCNGANGTGTATCAGGGNTGGCTAGNCATGATGATTTTATGNAAAGANANACATAAGCGNCGTACCNAGNGTCATAGACAATNTNATTT...
FP0016-C AGTCGGTGCGNATGTGCANCAGGAATGGCTANCCATGACGATTTTANGCAAAAATAXGGATAAGCGCNGTNCCCAGGXTCAGAGNTAATATCGTTT...
FP0017-C GGTTAATGCAGANNTGTGTCATGANTGNCCAGTCACNTCGCATTTAAGAAANGNTAGGGATAAGTAANAAACNNAGGGTTAGAAANNACANCACTT...
Name: GenBarcode, dtype: object
3. Drug resistance SNPs and haplotypes¶
Information on 36 SNPs from 10 genes associated with or relevant for antimalarial resistance:
dr_snp_ls = ['PfCRT:72','PfCRT:74','PfCRT:75','PfCRT:76','PfCRT:93','PfCRT:97','PfCRT:218','PfCRT:220',
'PfCRT:271', 'PfCRT:326','PfCRT:333', 'PfCRT:353', 'PfCRT:356', 'PfCRT:371',
'PfDHFR:16','PfDHFR:51','PfDHFR:59','PfDHFR:108','PfDHFR:164','PfDHFR:306',
'PfDHPS:436','PfDHPS:437','PfDHPS:540','PfDHPS:581','PfDHPS:613',
'PfEXO:415',
'PfMDR1:86','PfMDR1:184','PfMDR1:1034','PfMDR1:1042','PfMDR1:1226','PfMDR1:1246',
'PfARPS10:127','PfARPS10:128','PfFD:193','PfMDR2:484']
pf6plus[dr_snp_ls].head()
PfCRT:72 | PfCRT:74 | PfCRT:75 | PfCRT:76 | PfCRT:93 | PfCRT:97 | PfCRT:218 | PfCRT:220 | PfCRT:271 | PfCRT:326 | PfCRT:333 | PfCRT:353 | PfCRT:356 | PfCRT:371 | PfDHFR:16 | PfDHFR:51 | PfDHFR:59 | PfDHFR:108 | PfDHFR:164 | PfDHFR:306 | PfDHPS:436 | PfDHPS:437 | PfDHPS:540 | PfDHPS:581 | PfDHPS:613 | PfEXO:415 | PfMDR1:86 | PfMDR1:184 | PfMDR1:1034 | PfMDR1:1042 | PfMDR1:1226 | PfMDR1:1246 | PfARPS10:127 | PfARPS10:128 | PfFD:193 | PfMDR2:484 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SampleId | ||||||||||||||||||||||||||||||||||||
FP0008-C | - | - | - | [T/K] | T | H | I | [S/A] | E | N | T | G | [T/I] | [I/R] | A | N | C | [S/N] | I | S | [A/S] | [A/G] | K | A | A | E | N | [F/Y] | S | N | F | D | V | D | D | T |
FP0009-C | - | - | - | T | T | H | I | S | E | N | T | G | T | I | A | I | R | N | I | S | A | A | K | A | A | E | Y | F | S | N | F | Y | V | D | D | T |
FP0015-C | - | - | - | T | T | H | I | S | E | N | T | G | I | I | A | I | R | N | I | S | [A/S] | [A/G] | K | A | A | E | N | [F/Y] | S | N | F | D | V | D | D | T |
FP0016-C | - | - | - | [T/K] | T | H | I | S | E | N | T | G | I | [I/R] | A | N | C | S | I | S | S | G | K | A | A | E | N | F | S | N | F | D | V | D | D | T |
FP0017-C | C | M | N | K | T | H | I | A | Q | N | T | G | I | R | A | N | C | S | I | S | [A/F] | A | K | A | [A/S] | E | N | Y | S | N | F | D | V | D | D | T |
Information on common haplotypes from 8 genes associated with or relevant for antimalarial resistance:
dr_hap_ls = ['PfCRT','Kelch','PfDHFR','PfEXO','PGB','Plasmepsin2/3','PfDHPS','PfMDR1']
pf6plus[dr_hap_ls].head()
PfCRT | Kelch | PfDHFR | PfEXO | PGB | Plasmepsin2/3 | PfDHPS | PfMDR1 | |
---|---|---|---|---|---|---|---|---|
SampleId | ||||||||
FP0008-C | ----- | WT | ANC[S/N]IS | E | VDN[T/I]DT | WT | -[A/S][A/G]KAA | N[F/Y]D |
FP0009-C | ----- | WT | AIRNIS | E | VDNTDT | WT | -AAKAA | YFY |
FP0015-C | ----- | WT | AIRNIS | E | VDNIDT | WT | -[A/S][A/G]KAA | N[F/Y]D |
FP0016-C | ----- | WT | ANCSIS | E | VDNIDT | WT | -SGKAA | NFD |
FP0017-C | CVMNK | WT | ANCSIS | E | VDNIDT | WT | -[A/F]AKA[A/S] | NYD |
4. Drug resistance classification (aka phenotypes)¶
Each sample is classified into Resistant
, Sensitive
, or Undetermined
to eight major antimalarial drugs or combination therapy based on well-recognised markers of resistance. Details of the methods can be found in the Pf6 and GenRe-Mekong publications.
drugs_ls = ['Artemisinin','Piperaquine','DHA-PPQ','Chloroquine','Pyrimethamine','Sulfadoxine','S-P','S-P-IPTp']
pf6plus[drugs_ls].head()
Artemisinin | Piperaquine | DHA-PPQ | Chloroquine | Pyrimethamine | Sulfadoxine | S-P | S-P-IPTp | |
---|---|---|---|---|---|---|---|---|
SampleId | ||||||||
FP0008-C | Sensitive | Sensitive | Sensitive | Undetermined | Undetermined | Undetermined | Sensitive | Sensitive |
FP0009-C | Sensitive | Sensitive | Sensitive | Resistant | Resistant | Sensitive | Resistant | Sensitive |
FP0015-C | Sensitive | Sensitive | Sensitive | Resistant | Resistant | Undetermined | Resistant | Sensitive |
FP0016-C | Sensitive | Sensitive | Sensitive | Undetermined | Sensitive | Resistant | Sensitive | Sensitive |
FP0017-C | Sensitive | Sensitive | Sensitive | Sensitive | Sensitive | Sensitive | Sensitive | Sensitive |
Geographical distribution of samples in Pf6+¶
The plots below show the geographical distribution of data, and how combining them can provide a greater global coverage.
Click on the pin to find out more about the data collected at each site.
Samples in the Pf6 data release¶
map_samples(pf6plus, 'Pf6')
Samples in the GenRe-Mekong data release¶
map_samples(pf6plus, 'GenRe')
Samples in Pf6+ (union of the above)¶
map_samples(pf6plus, 'Pf6+')
Temporal distribution of samples on Pf6+¶
As for the geographical coverage above, combining the two resources allows for a wider and more complete temporal spread.
plot_temporal_samples_histogram(pf6plus)