Analysing patterns of drug resistance

This notebook introduces a series of convenience functions that allow you to:

  • Tabulate and plot the frequency the prevalence of drug resistance over time, e.g. to compare the distribution in different countries at specific time points

  • Tabulate and plot haplotype frequencies at different geographical scale (population, country and level 1 administrative division) over time, e.g. to analyse regional diversity and haplotype compositions

Setup

The utility functions are available from the Pf6+ repo. Before we can start our analysis, we will have to clone the repository to the execution environment. This is a very simple step that depends on where your running the notebook

If you are running this on Google Colab

!git clone https://github.com/malariagen/Pf6plus.git 
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .

If you are running this locally

There are some steps you need to follow to run the notebooks locally. If you haven’t already, please follow these instructions.

Get started

In order to generate the plots as seen in these notebooks, first import the functions from the data_analysis directory, which contain all of the code you will need:

from data_analysis.plot_dr_prevalence import *
from data_analysis.plot_haplotype_frequency import *
from data_analysis.tabulate_drug_resistance import *

Now we load bokeh to make the plots in this notebook interactive.

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Load the dataset

The command below reads the data directly from the online repository, without needing to download it first. However, you can still decide to download the dataset locally from https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv and modify the line below to point to the correct location.

# URL of the dataset; change accordingly if you prefer to work from a local, offline copy
pf6plus_fn = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'

# Read the dataset
pf6plus_all = pd.read_csv(pf6plus_fn, sep='\t', index_col=0, low_memory=False)

Next, we filter out the samples which have IncludeInAnalysis set to True. This will retain only high-quality samples (from WGS, AmpSeq and Agena).

pf6plus = pf6plus_all.loc[pf6plus_all['IncludeInAnalysis']]

Prevalence of drug resistance

Each sample is classified into Resistant, Sensitive, or Undetermined for the eight major antimalarial drugs or combination therapies, based on well-recognised markers of resistance. Details of the methods can be found in the Pf6 and GenRe-Mekong publications.

The next sections show some of the utility functions available to explore these data. You can also use these functions to analyse your own data and the next notebook (called Phenotyper) will show how to do that.

In interpreting these results, keep in mind that the number of samples collected across different locations in different years varies widely in the data resource. To minimise some of the biases, in the plots shown below a threshold is set to only include country (or) population/year combinations with n_samples > 25. You can change this default value by using the threshold flag, but please be cautious. Also keep in mind that the epidemiological context of each individual study varies a lot, so not all aggregations are equally meaningful.

Tabulate prevalence of drug resistance

To start exploring the dataset we can tabulate different combinations of drugs in different countries and years. Let’s imagine we are particularly interested in sulfadoxine-pyrimethamine resistance (S-P) and want to know its prevalence in all sampled countries within pf6plus. The method value_counts() from the the Python library pandas is particularly useful:

pf6plus['S-P'].value_counts()
Resistant       8564
Sensitive       2862
Undetermined    2170
Name: S-P, dtype: int64

We can see here that the vast majority of the ~13,000 high-quality samples in pf6plus are resistant to S-P.

We could then decide to focus on one specific country, in this case we will look at Ghana:

pf6plus[pf6plus['Country']=='Ghana']['S-P'].value_counts()
Resistant       440
Undetermined    212
Sensitive       199
Name: S-P, dtype: int64

If we exclude the 212 Undetermined samples, we see that 440 out of 639 (440 resistant and 199 sensitive) samples from Ghana (or 69%) are classified as resistant to S-P according to their genetic status.

While we could build more complex expressions to, for example, stratify by year, analyse multiple countries, etc, the function tabulate_drug_resistant simplifies it a lot:

# use help(name_of_function) to access the documentation notes
help(tabulate_drug_resistant)
Help on function tabulate_drug_resistant in module data_analysis.tabulate_drug_resistance:

tabulate_drug_resistant(data, drug, country=None, population=None, years=None, bin=False)
    Tabulate the frequency of drug resistant samples per country/year
    
    Parameters:
      - drug: Any of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
      - years: An array with the year(s) in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
      - bin: If True, all the years between the specified values will be used. If False, individual years are used.
    
    Returns:
      A dataframe showing the number of Resistant, Sensitive & Undetermined
      samples using the drug/country/year (or) drug/country/year combination provided. . The total number
      of samples and drug resistant frequency is also provided.
tabulate_drug_resistant(pf6plus, 'S-P')
S-P resistant samples 
S-P Resistant Sensitive Undetermined Total Resistant Frequency
Country
Bangladesh 715 834 221 1770 0.46
Benin 34 1 1 36 0.97
Burkina Faso 10 27 19 56 0.27
Cambodia 1697 96 82 1875 0.95
Cameroon 230 2 3 235 0.99
Colombia 0 16 0 16 0.00
Côte d'Ivoire 34 28 8 70 0.55
Democratic Republic of the Congo 268 55 141 464 0.83
Ethiopia 18 2 1 21 0.90
Gambia 182 25 12 219 0.88
Ghana 440 199 212 851 0.69
Guinea 108 18 23 149 0.86
India 46 229 5 280 0.17
Indonesia 0 80 0 80 0.00
Kenya 86 17 7 110 0.83
Laos 730 493 295 1518 0.60
Madagascar 17 7 0 24 0.71
Malawi 243 3 8 254 0.99
Mali 193 142 91 426 0.58
Mauritania 47 20 9 76 0.70
Mozambique 1 0 0 1 1.00
Myanmar 581 224 355 1160 0.72
Nigeria 27 2 0 29 0.93
Papua New Guinea 0 121 0 121 0.00
Peru 0 21 0 21 0.00
Senegal 76 6 2 84 0.93
Tanzania 256 31 29 316 0.89
Thailand 905 69 45 1019 0.93
Uganda 11 1 1 13 0.92
Vietnam 1609 93 600 2302 0.95

The proportion of samples classified as S-P resistant in this dataset seems to be high in the West Africa: has this always been the case? Let’s aggregate all samples in this population and stratify by year this time:

tabulate_drug_resistant(pf6plus, 'S-P', population='WAF')
S-P resistant samples in WAF 
S-P Resistant Sensitive Undetermined Total Resistant Frequency
Year
2007 26 35 24 170 0.43
2008 70 38 21 258 0.65
2009 51 32 35 236 0.61
2010 66 45 45 312 0.59
2011 157 50 42 498 0.76
2012 51 11 19 162 0.82
2013 595 169 134 1796 0.78
2014 336 81 37 908 0.81
2015 29 9 23 122 0.76

Plot prevalence of drug resistance

This function explores the frequencies of drug resistant variants within different spatial (country/analysis population) and temporal (year) combinations. This is quite a useful way of visualising time series data in geographic regions & years of interest.

You can also hover over the data points to see the specific numbers being plotted, and zoom in and out to change the scale.

help(plot_dr_prevalence)
Help on function plot_dr_prevalence in module data_analysis.plot_dr_prevalence:

plot_dr_prevalence(data, drugs, country=None, population=None, years=None, bin=False, threshold=25)
    Plot the prevalence of resistant samples per country/year
    
    Parameters:
      - drug: Any/list of the drugs in the Pf6+ dataframe ['Artemisinin', 'Chloroquine', 'DHA-PPQ', 'Piperaquine', 'Pyrimethamine', 'S-P', 'S-P-IPTp', 'Sulfadoxine']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
      - year: Any/list of the years in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
      - bin: If True, all the years between the specified values will be used. If False, individual years are used.
      - threshold: To increase confidence on disperse data, only use country (or) population/year combinations with n_samples>threshold (default=25)
    
    Returns:
      A series of plots (one per drug) showing the prevalence of resistant variants the drug/country/year (or) drug/country/year combination provided.

We can continue by exploring prevalance patterns specific to West Africa, by looking at the information available for specific drugs in Ghana. By comparing an individual country versus the analysis population’s average, we can make it easier to spot different trends emerging in the data.

plot_dr_prevalence(pf6plus, drugs=['S-P','Chloroquine'], country = 'Ghana', population = 'WAF')
/Users/ra4/Documents/GitHub/Pf6plus/pf6plus_documentation/notebooks/data_analysis/plot_dr_prevalence.py:37: RuntimeWarning: invalid value encountered in long_scalars
  round(row[0] / (row[0] + row[1]), 2)
BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.

We can see that overall in Ghana, the prevalence of S-P resistance is quite similar to the overall prevalence in the rest of West Africa in this time period (left plot). When plotting Chloroquine resistance, however, we can see that chloroquine resistance is considerably lower when compared to the rest of West Africa.

The drug resistance pattern shown here warrants further investigation, so next we can take a look at another country (Mali) and see how this trend looks within the same analysis population of West Africa (WAF).

plot_dr_prevalence(pf6plus, drugs=['S-P','Chloroquine'], country = 'Mali', population = 'WAF')
BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.

In this case, Mali has similar, although slightly lower S-P prevalence compared to West Africa overall (left plot). The Chloroquine resistance prevalence shows a more complex picture from prevalence fluctuations between sampling years in WAF, but we can see that overall prevalence between Mali and the rest of WAF are roughly comparable for the years in which samples were collected in Mali.

Plot the most common haplotypes per population/country

Drug resistance haplotypes allow us to visualise key drug resistance positions in the genome together, and look out for changes within these structures. By plotting the top haplotypes in a region, we can spot emerging trends.

help(plot_haplotype_frequency)
Help on function plot_haplotype_frequency in module data_analysis.plot_haplotype_frequency:

plot_haplotype_frequency(data, gene, num_top_haplotypes=5, threshold=25, countries=None, populations=None, years=None, bin=False)
    Plot the top n haplotypes on a specife gene per country (or) population per year
    
    Parameters:
      - gene: Any of the genes in the Pf6+ dataframe ['PfCRT', 'Kelch', 'PfDHFR', 'PfEXO', 'PGB', 'Plasmepsin2/3', 'PfDHPS', 'PfMDR1']
      - country: Any of the countries in the Pf6+ dataframe (if specified, population value is not used) ['Bangladesh', 'Benin', 'Burkina Faso', 'Cambodia', 'Cameroon',
       'Colombia', "Côte d'Ivoire", 'Democratic Republic of the Congo',
       'Ethiopia', 'Gambia', 'Ghana', 'Guinea', 'India', 'Indonesia',
       'Kenya', 'Laos', 'Madagascar', 'Malawi', 'Mali', 'Mauritania',
       'Mozambique', 'Myanmar', 'Nigeria', 'Papua New Guinea', 'Peru',
       'Senegal', 'Tanzania', 'Thailand', 'Uganda', 'Vietnam']
      - population: Any of the populations in the Pf6+ dataframe ['CAF', 'EAF', 'ESEA', 'OCE', 'SAM', 'SAS', 'WAF', 'WSEA']
      - year: Any/list of the years in the Pf6+ dataframe [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
      - bin: If True, all the years between the specified values will be used. If False, individual years are used.
      - threshold: To increase confidence on disperse data, only use country (or) population/year combinations with n_samples>threshold (default=25)
    
    Returns:
      A series of plots (one per gene) showing the frequency of the top n haplotypes the drug/country/year (or) drug/country/year combination provided. Haplotypes outside the top n, are showed as part of the `Other` category.

Let’s take a look at a common drug resistance locus that is involved in mefloquine resistance, PfMDR1. We can compare the haplotype frequencies for PfMDR1 in two analysis populations: WSEA and ESEA.

plot_haplotype_frequency(pf6plus, 'PfMDR1', num_top_haplotypes=2, populations = ['WSEA', 'ESEA'])
BokehDeprecationWarning: plot_width and plot_height was deprecated in Bokeh 2.4.0 and will be removed, use width or height instead.
<data_analysis.plotting.Subplots at 0x122240f10>

Interestingly, the predominant PfMDR1 haplotype in WSEA (left plot) is NYD in blue, which maintains a stable high frequency over the entire sampling period. ESEA, on the other hand (right plot), has two moderate frequency PfMDR1 haplotypes for several years, followed by the orange NFD haplotype eventually gaining dominance over the NYD haplotype over time. Despite these two regions being quite geographically close, the PfMDR1 haplotypes are experiencing what seems to be very different selective pressures at this locus.