Analysing patterns of drug resistance¶

This notebook introduces a series of convenience functions that allow you to:

Tabulate and plot the frequency the prevalence of drug resistance over time, e.g. to compare the distribution in different countries at specific time points
Tabulate and plot haplotype frequencies at different geographical scale (population, country and level 1 administrative division) over time, e.g. to analyse regional diversity and haplotype compositions

Setup¶

The utility functions are available from the Pf6+ repo. Before we can start our analysis, we will have to clone the repository to the execution environment. This is a very simple step that depends on where your running the notebook

If you are running this on Google Colab¶

!git clone https://github.com/malariagen/Pf6plus.git 
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .

If you are running this locally¶

There are some steps you need to follow to run the notebooks locally. If you haven’t already, please follow these instructions.

Get started¶

In order to generate the plots as seen in these notebooks, first import the functions from the data_analysis directory, which contain all of the code you will need:

from data_analysis.plot_dr_prevalence import *
from data_analysis.plot_haplotype_frequency import *
from data_analysis.tabulate_drug_resistance import *

Now we load bokeh to make the plots in this notebook interactive.

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

Load the dataset¶

The command below reads the data directly from the online repository, without needing to download it first. However, you can still decide to download the dataset locally from https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv and modify the line below to point to the correct location.

# URL of the dataset; change accordingly if you prefer to work from a local, offline copy
pf6plus_fn = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'

# Read the dataset
pf6plus_all = pd.read_csv(pf6plus_fn, sep='\t', index_col=0, low_memory=False)

Next, we filter out the samples which have IncludeInAnalysis set to True. This will retain only high-quality samples (from WGS, AmpSeq and Agena).

pf6plus = pf6plus_all.loc[pf6plus_all['IncludeInAnalysis']]

Prevalence of drug resistance¶

Each sample is classified into Resistant, Sensitive, or Undetermined for the eight major antimalarial drugs or combination therapies, based on well-recognised markers of resistance. Details of the methods can be found in the Pf6 and GenRe-Mekong publications.

The next sections show some of the utility functions available to explore these data. You can also use these functions to analyse your own data and the next notebook (called Phenotyper) will show how to do that.

In interpreting these results, keep in mind that the number of samples collected across different locations in different years varies widely in the data resource. To minimise some of the biases, in the plots shown below a threshold is set to only include country (or) population/year combinations with n_samples > 25. You can change this default value by using the threshold flag, but please be cautious. Also keep in mind that the epidemiological context of each individual study varies a lot, so not all aggregations are equally meaningful.

Tabulate prevalence of drug resistance¶

To start exploring the dataset we can tabulate different combinations of drugs in different countries and years. Let’s imagine we are particularly interested in sulfadoxine-pyrimethamine resistance (S-P) and want to know its prevalence in all sampled countries within pf6plus. The method value_counts() from the the Python library pandas is particularly useful:

pf6plus['S-P'].value_counts()

Resistant       8564
Sensitive       2862
Undetermined    2170
Name: S-P, dtype: int64

We can see here that the vast majority of the ~13,000 high-quality samples in pf6plus are resistant to S-P.

We could then decide to focus on one specific country, in this case we will look at Ghana:

pf6plus[pf6plus['Country']=='Ghana']['S-P'].value_counts()

Resistant       440
Undetermined    212
Sensitive       199
Name: S-P, dtype: int64

If we exclude the 212 Undetermined samples, we see that 440 out of 639 (440 resistant and 199 sensitive) samples from Ghana (or 69%) are classified as resistant to S-P according to their genetic status.

While we could build more complex expressions to, for example, stratify by year, analyse multiple countries, etc, the function tabulate_drug_resistant simplifies it a lot:

# use help(name_of_function) to access the documentation notes
help(tabulate_drug_resistant)

Help on function tabulate_drug_resistant in module data_analysis.tabulate_drug_resistance:

tabulate_drug_resistant(data, drug, country=None, population=None, years=None, bin=False)
    Tabulate the frequency of drug resistant samples per country/year
    
    Parameters:
      - drug: Any of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
      - years: An array with the year(s) in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
      - bin: If True, all the years between the specified values will be used. If False, individual years are used.
    
    Returns:
      A dataframe showing the number of Resistant, Sensitive & Undetermined
      samples using the drug/country/year (or) drug/country/year combination provided. . The total number
      of samples and drug resistant frequency is also provided.

tabulate_drug_resistant(pf6plus, 'S-P')

S-P resistant samples

S-P	Resistant	Sensitive	Undetermined	Total	Resistant Frequency
Country
Bangladesh	715	834	221	1770	0.46
Benin	34	1	1	36	0.97
Burkina Faso	10	27	19	56	0.27
Cambodia	1697	96	82	1875	0.95
Cameroon	230	2	3	235	0.99
Colombia	0	16	0	16	0.00
Côte d'Ivoire	34	28	8	70	0.55
Democratic Republic of the Congo	268	55	141	464	0.83
Ethiopia	18	2	1	21	0.90
Gambia	182	25	12	219	0.88
Ghana	440	199	212	851	0.69
Guinea	108	18	23	149	0.86
India	46	229	5	280	0.17
Indonesia	0	80	0	80	0.00
Kenya	86	17	7	110	0.83
Laos	730	493	295	1518	0.60
Madagascar	17	7	0	24	0.71
Malawi	243	3	8	254	0.99
Mali	193	142	91	426	0.58
Mauritania	47	20	9	76	0.70
Mozambique	1	0	0	1	1.00
Myanmar	581	224	355	1160	0.72
Nigeria	27	2	0	29	0.93
Papua New Guinea	0	121	0	121	0.00
Peru	0	21	0	21	0.00
Senegal	76	6	2	84	0.93
Tanzania	256	31	29	316	0.89
Thailand	905	69	45	1019	0.93
Uganda	11	1	1	13	0.92
Vietnam	1609	93	600	2302	0.95

The proportion of samples classified as S-P resistant in this dataset seems to be high in the West Africa: has this always been the case? Let’s aggregate all samples in this population and stratify by year this time:

tabulate_drug_resistant(pf6plus, 'S-P', population='WAF')

S-P resistant samples in WAF

S-P	Resistant	Sensitive	Undetermined	Total	Resistant Frequency
Year
2007	26	35	24	170	0.43
2008	70	38	21	258	0.65
2009	51	32	35	236	0.61
2010	66	45	45	312	0.59
2011	157	50	42	498	0.76
2012	51	11	19	162	0.82
2013	595	169	134	1796	0.78
2014	336	81	37	908	0.81
2015	29	9	23	122	0.76

Plot prevalence of drug resistance¶

This function explores the frequencies of drug resistant variants within different spatial (country/analysis population) and temporal (year) combinations. This is quite a useful way of visualising time series data in geographic regions & years of interest.

You can also hover over the data points to see the specific numbers being plotted, and zoom in and out to change the scale.

help(plot_dr_prevalence)

Help on function plot_dr_prevalence in module data_analysis.plot_dr_prevalence:

plot_dr_prevalence(data, drugs, country=None, population=None, years=None, bin=False, threshold=25)
    Plot the prevalence of resistant samples per country/year
    
    Parameters:
      - drug: Any/list of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
      - year: Any/list of the years in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
      - bin: If True, all the years between the specified values will be used. If False, individual years are used.
      - threshold: To increase confidence on disperse data, only use country (or) population/year combinations with n_samples>threshold (default=25)
    
    Returns:
      A series of plots (one per drug) showing the prevalence of resistant variants the drug/country/year (or) drug/country/year combination provided.

We can continue by exploring prevalance patterns specific to West Africa, by looking at the information available for specific drugs in Ghana. By comparing an individual country versus the analysis population’s average, we can make it easier to spot different trends emerging in the data.

plot_dr_prevalence(pf6plus, drugs=['S-P','Chloroquine'], country = 'Ghana', population = 'WAF')

/Users/ra4/Documents/GitHub/Pf6plus/pf6plus_documentation/notebooks/data_analysis/plot_dr_prevalence.py:37: RuntimeWarning: invalid value encountered in long_scalars
  round(row[0] / (row[0] + row[1]), 2)
BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.

We can see that overall in Ghana, the prevalence of S-P resistance is quite similar to the overall prevalence in the rest of West Africa in this time period (left plot). When plotting Chloroquine resistance, however, we can see that chloroquine resistance is considerably lower when compared to the rest of West Africa.

The drug resistance pattern shown here warrants further investigation, so next we can take a look at another country (Mali) and see how this trend looks within the same analysis population of West Africa (WAF).

plot_dr_prevalence(pf6plus, drugs=['S-P','Chloroquine'], country = 'Mali', population = 'WAF')

BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.

In this case, Mali has similar, although slightly lower S-P prevalence compared to West Africa overall (left plot). The Chloroquine resistance prevalence shows a more complex picture from prevalence fluctuations between sampling years in WAF, but we can see that overall prevalence between Mali and the rest of WAF are roughly comparable for the years in which samples were collected in Mali.

Plot the most common haplotypes per population/country¶

Drug resistance haplotypes allow us to visualise key drug resistance positions in the genome together, and look out for changes within these structures. By plotting the top haplotypes in a region, we can spot emerging trends.

help(plot_haplotype_frequency)

Help on function plot_haplotype_frequency in module data_analysis.plot_haplotype_frequency:

plot_haplotype_frequency(data, gene, num_top_haplotypes=5, threshold=25, countries=None, populations=None, years=None, bin=False)
    Plot the top n haplotypes on a specife gene per country (or) population per year
    
    Parameters:
      - gene: Any of the genes in the Pf6+ dataframe ['PfCRT', 'Kelch', 'PfDHFR', 'PfEXO', 'PGB', 'Plasmepsin2/3', 'PfDHPS', 'PfMDR1']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
      - year: Any/list of the years in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
      - bin: If True, all the years between the specified values will be used. If False, individual years are used.
      - threshold: To increase confidence on disperse data, only use country (or) population/year combinations with n_samples>threshold (default=25)
    
    Returns:
      A series of plots (one per gene) showing the frequency of the top n haplotypes the drug/country/year (or) drug/country/year combination provided. Haplotypes outside the top n, are showed as part of the `Other` category.

Let’s take a look at a common drug resistance locus that is involved in mefloquine resistance, PfMDR1. We can compare the haplotype frequencies for PfMDR1 in two analysis populations: WSEA and ESEA.

plot_haplotype_frequency(pf6plus, 'PfMDR1', num_top_haplotypes=2, populations = ['WSEA', 'ESEA'])

BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.

<data_analysis.plotting.Subplots at 0x122240f10>

Interestingly, the predominant PfMDR1 haplotype in WSEA (left plot) is NYD in blue, which maintains a stable high frequency over the entire sampling period. ESEA, on the other hand (right plot), has two moderate frequency PfMDR1 haplotypes for several years, followed by the orange NFD haplotype eventually gaining dominance over the NYD haplotype over time. Despite these two regions being quite geographically close, the PfMDR1 haplotypes are experiencing what seems to be very different selective pressures at this locus.

Pf6+ User Guide