Analysing patterns of drug resistance¶
This notebook introduces a series of convenience functions that allow you to:
Tabulate and plot the frequency the prevalence of drug resistance over time, e.g. to compare the distribution in different countries at specific time points
Tabulate and plot haplotype frequencies at different geographical scale (population, country and level 1 administrative division) over time, e.g. to analyse regional diversity and haplotype compositions
Setup¶
The utility functions are available from the Pf6+ repo. Before we can start our analysis, we will have to clone the repository to the execution environment. This is a very simple step that depends on where your running the notebook
If you are running this on Google Colab¶
!git clone https://github.com/malariagen/Pf6plus.git
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .
If you are running this locally¶
There are some steps you need to follow to run the notebooks locally. If you haven’t already, please follow these instructions.
Get started¶
In order to generate the plots as seen in these notebooks, first import the functions from the data_analysis
directory, which contain all of the code you will need:
from data_analysis.plot_dr_prevalence import *
from data_analysis.plot_haplotype_frequency import *
from data_analysis.tabulate_drug_resistance import *
Now we load bokeh
to make the plots in this notebook interactive.
Load the dataset¶
The command below reads the data directly from the online repository, without needing to download it first. However, you can still decide to download the dataset locally from https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv and modify the line below to point to the correct location.
# URL of the dataset; change accordingly if you prefer to work from a local, offline copy
pf6plus_fn = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'
# Read the dataset
pf6plus_all = pd.read_csv(pf6plus_fn, sep='\t', index_col=0, low_memory=False)
Next, we filter out the samples which have IncludeInAnalysis
set to True
. This will retain only high-quality samples (from WGS, AmpSeq and Agena).
pf6plus = pf6plus_all.loc[pf6plus_all['IncludeInAnalysis']]
Prevalence of drug resistance¶
Each sample is classified into Resistant
, Sensitive
, or Undetermined
for the eight major antimalarial drugs or combination therapies, based on well-recognised markers of resistance. Details of the methods can be found in the Pf6 and GenRe-Mekong publications.
The next sections show some of the utility functions available to explore these data. You can also use these functions to analyse your own data and the next notebook (called Phenotyper) will show how to do that.
In interpreting these results, keep in mind that the number of samples collected across different locations in different years varies widely in the data resource. To minimise some of the biases, in the plots shown below a threshold is set to only include country (or) population/year combinations with n_samples > 25
. You can change this default value by using the threshold
flag, but please be cautious. Also keep in mind that the epidemiological context of each individual study varies a lot, so not all aggregations are equally meaningful.
Tabulate prevalence of drug resistance¶
To start exploring the dataset we can tabulate different combinations of drugs in different countries and years. Let’s imagine we are particularly interested in sulfadoxine-pyrimethamine resistance (S-P
) and want to know its prevalence in all sampled countries within pf6plus
. The method value_counts()
from the the Python library pandas
is particularly useful:
pf6plus['S-P'].value_counts()
Resistant 8564
Sensitive 2862
Undetermined 2170
Name: S-P, dtype: int64
We can see here that the vast majority of the ~13,000 high-quality samples in pf6plus
are resistant to S-P
.
We could then decide to focus on one specific country, in this case we will look at Ghana:
pf6plus[pf6plus['Country']=='Ghana']['S-P'].value_counts()
Resistant 440
Undetermined 212
Sensitive 199
Name: S-P, dtype: int64
If we exclude the 212
Undetermined
samples, we see that 440
out of 639
(440
resistant and 199
sensitive) samples from Ghana (or 69%) are classified as resistant to S-P according to their genetic status.
While we could build more complex expressions to, for example, stratify by year, analyse multiple countries, etc, the function tabulate_drug_resistant
simplifies it a lot:
# use help(name_of_function) to access the documentation notes
help(tabulate_drug_resistant)
Help on function tabulate_drug_resistant in module data_analysis.tabulate_drug_resistance:
tabulate_drug_resistant(data, drug, country=None, population=None, years=None, bin=False)
Tabulate the frequency of drug resistant samples per country/year
Parameters:
- drug: Any of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
- country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
- population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
- years: An array with the year(s) in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
- bin: If True, all the years between the specified values will be used. If False, individual years are used.
Returns:
A dataframe showing the number of Resistant, Sensitive & Undetermined
samples using the drug/country/year (or) drug/country/year combination provided. . The total number
of samples and drug resistant frequency is also provided.
tabulate_drug_resistant(pf6plus, 'S-P')
S-P resistant samples
S-P | Resistant | Sensitive | Undetermined | Total | Resistant Frequency |
---|---|---|---|---|---|
Country | |||||
Bangladesh | 715 | 834 | 221 | 1770 | 0.46 |
Benin | 34 | 1 | 1 | 36 | 0.97 |
Burkina Faso | 10 | 27 | 19 | 56 | 0.27 |
Cambodia | 1697 | 96 | 82 | 1875 | 0.95 |
Cameroon | 230 | 2 | 3 | 235 | 0.99 |
Colombia | 0 | 16 | 0 | 16 | 0.00 |
Côte d'Ivoire | 34 | 28 | 8 | 70 | 0.55 |
Democratic Republic of the Congo | 268 | 55 | 141 | 464 | 0.83 |
Ethiopia | 18 | 2 | 1 | 21 | 0.90 |
Gambia | 182 | 25 | 12 | 219 | 0.88 |
Ghana | 440 | 199 | 212 | 851 | 0.69 |
Guinea | 108 | 18 | 23 | 149 | 0.86 |
India | 46 | 229 | 5 | 280 | 0.17 |
Indonesia | 0 | 80 | 0 | 80 | 0.00 |
Kenya | 86 | 17 | 7 | 110 | 0.83 |
Laos | 730 | 493 | 295 | 1518 | 0.60 |
Madagascar | 17 | 7 | 0 | 24 | 0.71 |
Malawi | 243 | 3 | 8 | 254 | 0.99 |
Mali | 193 | 142 | 91 | 426 | 0.58 |
Mauritania | 47 | 20 | 9 | 76 | 0.70 |
Mozambique | 1 | 0 | 0 | 1 | 1.00 |
Myanmar | 581 | 224 | 355 | 1160 | 0.72 |
Nigeria | 27 | 2 | 0 | 29 | 0.93 |
Papua New Guinea | 0 | 121 | 0 | 121 | 0.00 |
Peru | 0 | 21 | 0 | 21 | 0.00 |
Senegal | 76 | 6 | 2 | 84 | 0.93 |
Tanzania | 256 | 31 | 29 | 316 | 0.89 |
Thailand | 905 | 69 | 45 | 1019 | 0.93 |
Uganda | 11 | 1 | 1 | 13 | 0.92 |
Vietnam | 1609 | 93 | 600 | 2302 | 0.95 |
The proportion of samples classified as S-P resistant in this dataset seems to be high in the West Africa: has this always been the case? Let’s aggregate all samples in this population and stratify by year this time:
tabulate_drug_resistant(pf6plus, 'S-P', population='WAF')
S-P resistant samples in WAF
S-P | Resistant | Sensitive | Undetermined | Total | Resistant Frequency |
---|---|---|---|---|---|
Year | |||||
2007 | 26 | 35 | 24 | 170 | 0.43 |
2008 | 70 | 38 | 21 | 258 | 0.65 |
2009 | 51 | 32 | 35 | 236 | 0.61 |
2010 | 66 | 45 | 45 | 312 | 0.59 |
2011 | 157 | 50 | 42 | 498 | 0.76 |
2012 | 51 | 11 | 19 | 162 | 0.82 |
2013 | 595 | 169 | 134 | 1796 | 0.78 |
2014 | 336 | 81 | 37 | 908 | 0.81 |
2015 | 29 | 9 | 23 | 122 | 0.76 |
Plot prevalence of drug resistance¶
This function explores the frequencies of drug resistant variants within different spatial (country/analysis population) and temporal (year) combinations. This is quite a useful way of visualising time series data in geographic regions & years of interest.
You can also hover over the data points to see the specific numbers being plotted, and zoom in and out to change the scale.
help(plot_dr_prevalence)
Help on function plot_dr_prevalence in module data_analysis.plot_dr_prevalence:
plot_dr_prevalence(data, drugs, country=None, population=None, years=None, bin=False, threshold=25)
Plot the prevalence of resistant samples per country/year
Parameters:
- drug: Any/list of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
- country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
- population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
- year: Any/list of the years in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
- bin: If True, all the years between the specified values will be used. If False, individual years are used.
- threshold: To increase confidence on disperse data, only use country (or) population/year combinations with n_samples>threshold (default=25)
Returns:
A series of plots (one per drug) showing the prevalence of resistant variants the drug/country/year (or) drug/country/year combination provided.
We can continue by exploring prevalance patterns specific to West Africa, by looking at the information available for specific drugs in Ghana. By comparing an individual country versus the analysis population’s average, we can make it easier to spot different trends emerging in the data.
plot_dr_prevalence(pf6plus, drugs=['S-P','Chloroquine'], country = 'Ghana', population = 'WAF')
/Users/ra4/Documents/GitHub/Pf6plus/pf6plus_documentation/notebooks/data_analysis/plot_dr_prevalence.py:37: RuntimeWarning: invalid value encountered in long_scalars
round(row[0] / (row[0] + row[1]), 2)
BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.
We can see that overall in Ghana, the prevalence of S-P
resistance is quite similar to the overall prevalence in the rest of West Africa in this time period (left plot). When plotting Chloroquine
resistance, however, we can see that chloroquine resistance is considerably lower when compared to the rest of West Africa.
The drug resistance pattern shown here warrants further investigation, so next we can take a look at another country (Mali) and see how this trend looks within the same analysis population of West Africa (WAF).
plot_dr_prevalence(pf6plus, drugs=['S-P','Chloroquine'], country = 'Mali', population = 'WAF')
BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.
In this case, Mali has similar, although slightly lower S-P
prevalence compared to West Africa overall (left plot). The Chloroquine
resistance prevalence shows a more complex picture from prevalence fluctuations between sampling years in WAF, but we can see that overall prevalence between Mali and the rest of WAF are roughly comparable for the years in which samples were collected in Mali.
Plot the most common haplotypes per population/country¶
Drug resistance haplotypes allow us to visualise key drug resistance positions in the genome together, and look out for changes within these structures. By plotting the top haplotypes in a region, we can spot emerging trends.
help(plot_haplotype_frequency)
Help on function plot_haplotype_frequency in module data_analysis.plot_haplotype_frequency:
plot_haplotype_frequency(data, gene, num_top_haplotypes=5, threshold=25, countries=None, populations=None, years=None, bin=False)
Plot the top n haplotypes on a specife gene per country (or) population per year
Parameters:
- gene: Any of the genes in the Pf6+ dataframe ['PfCRT', 'Kelch', 'PfDHFR', 'PfEXO', 'PGB', 'Plasmepsin2/3', 'PfDHPS', 'PfMDR1']
- country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
- population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
- year: Any/list of the years in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
- bin: If True, all the years between the specified values will be used. If False, individual years are used.
- threshold: To increase confidence on disperse data, only use country (or) population/year combinations with n_samples>threshold (default=25)
Returns:
A series of plots (one per gene) showing the frequency of the top n haplotypes the drug/country/year (or) drug/country/year combination provided. Haplotypes outside the top n, are showed as part of the `Other` category.
Let’s take a look at a common drug resistance locus that is involved in mefloquine resistance, PfMDR1
. We can compare the haplotype frequencies for PfMDR1
in two analysis populations: WSEA
and ESEA
.
plot_haplotype_frequency(pf6plus, 'PfMDR1', num_top_haplotypes=2, populations = ['WSEA', 'ESEA'])
BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.
<data_analysis.plotting.Subplots at 0x122240f10>
Interestingly, the predominant PfMDR1 haplotype in WSEA (left plot) is NYD
in blue, which maintains a stable high frequency over the entire sampling period. ESEA, on the other hand (right plot), has two moderate frequency PfMDR1
haplotypes for several years, followed by the orange NFD
haplotype eventually gaining dominance over the NYD
haplotype over time. Despite these two regions being quite geographically close, the PfMDR1
haplotypes are experiencing what seems to be very different selective pressures at this locus.