Inference of drug resistance status

This notebook describes how to use phenotyper, a tool to predict drug resistance status from genetic data.

In brief, the tool applies a collection of rules, which have been curated from the literature, to determine if a set of genotypes confer resistance to a particular drug. Although we provide an up-to-date list of phenotyping rules (as a JSON file), users can create their own rules or adopt rules from a different source.

Check the full documentation here.

When to use phenotyper?

When your Genetic Report Card (GRC) doesn’t have information regarding resistant phenotypes or you would like to update previous inferred phenotypes with new rules. Note that any file with information on genotypes (as described below) can be used with phenotyper to infer drug resistance status.\n”

How to use?

Usage: ./phenotyper.r [options]
Evaluates a set of phenotyping rules on all the samples of the given data file.

Options:
        --datafile=DATAFILE
                Path to the data file, a tabular file with the required fields (first column reserverd for sample names).

        --rulesfile=RULESFILE
                Path to the JSON file that encodes the rules to be used for evaluating each drug.

        --ruleout=RULEOUT
                The field to be extracted from a rule when evaluated as true. Helps with debugging.

        --samplefield=SAMPLEFIELD
                The field on the data file that designates the sample id (optional).

        --verbose
                A flag to obtain verbose detailed output in the standard output (can generate very big logs, for tracebacks).

        --output=OUTPUT
                Base path for the output files.

        -h, --help
                Show this help message and exit

Which are the required fields for --datafile ?

You will need a tabulated file with the following fields (i.e. header):

crt_76[K]       crt_72-76[CVMNK]        dhfr_51[N]      dhfr_59[C]      dhfr_108[S]     dhfr_164[I]     dhps_437[G]     dhps_540[K]     dhps_581[A]     dhps_613[A]     k13_class       k13_alleles     mdr1_dup_call   pm2_dup_call    gch1_dup_call 

There is no need to include any metadata such as location given that phenotyper only uses genetic data for predicting drug resistance status. For clarity, we recommend the first field to refer to the sample id, e.g. sample, so you can easily cross-reference results with other metadata you might have.

For an example of the complete file used you can check mygrc_metadata_nophenotypes.tsv. If you have your GRC data at hand, this notebook will create the required input file for you, before moving on to using phenotyper.

What can I do with the output?

An easy way to visualise your results is by using 2_prevalence_DR.ipynb, In that notebook you can compare your data with the Pf6+ dataset (more than 13,000 samples worldwide), which will help in contextualizing your results.

If you are happy with the default phenotyping rules we’ve curated from the literature, you can follow with the procedure explained below. Otherwise, please check the Further technical details section for information on how the rules can be modified.

Setup

First, you will have to download phenotyper if you haven’t already.

!git clone https://github.com/malariagen/phenotyper.git

Running the following will allow you to run R and install the dependencies that phenotyper requires.

#required to use R inside the Jupyter Notebook
%load_ext rpy2.ipython

Then we need to install the R modules required by phenotyper.

%%R 

install.packages('optparse', quiet = TRUE)
install.packages('stringr', quiet = TRUE)
install.packages('stringi', quiet = TRUE)
install.packages('jsonlite', quiet = TRUE)

If you are running this on Google Colab

If you are running the notebooks on Colab, then run the cell below to clone the Pf6+ repo:

!git clone https://github.com/malariagen/Pf6plus.git 
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .

If you are running this locally

There are some steps you need to follow to run the notebooks locally. If you haven’t already, please follow these instructions.

Running phenotyper on your own data

You have two main options:

  1. Run the tool on your local machine, or

  2. Run the tool on Google Colab, in which case you can either (a) upload the GRC from your local machine, or (b) read it from Google Drive

Note: We have also included an example GRC in the repository so you can run through this example even if you don’t have a GRC. In this case, run the cell below and go straight to the section “Run phenotyper”.

%env PATH_TO_GRC=./phenotyper/ext/fake_grc/ghana_grc.tsv
env: PATH_TO_GRC=./phenotyper/ext/fake_grc/ghana_grc.tsv

Option 1 - Running locally

You need to specify where to find the GRC you would like to process. To do so, set the path to the GRC file here:

%env PATH_TO_GRC=#Path to GRC 

Option 2 - Running on Google Colab

Uploading the GRC from your local machine

You can upload a GRC file from your own machine to run phenotyper. Run the cell below and select the file from your local machine (it will open a dialog box).

from google.colab import files
uploaded = files.upload()

Once the file has been uploaded you should see the path: copy it into the cell below.

%env PATH_TO_GRC=#Enter the path here

Read the GRC from Google Drive

To use a file that is stored on your google drive, first connect to your personal Google Drive by running the cell below and following the instructions that are printed as output.

from google.colab import drive
drive.mount('/content/drive')

Copy the GRC data to google colab by running the following, replacing <Path to GRC on your drive> with the path where your file is stored on your google drive.

!cp -r <Path to GRC on your drive> .

Finally, run the following to set PATH_TO_GRC to where your GRC data is stored on google colab.

%env PATH_TO_GRC=/content/drive/<GRC file name>

Run phenotyper

Below, we set the paths to the phenotyper.r script, the example GRC, the rules file, and the output file containing the phenotypes. PATH_TO_PHENOTYPER, PATH_TO_GRC, and PATH_TO_RULES are set to the files cloned from git above (you may need to change the paths if you downloaded the tool somewhere else). Please set PATH_TO_OUTPUT to the path of the file you would like phenotyper to use as output.

If you just want to test the tool, use the examples below, no need to change any of the paths.

#SET VARIABLES FOR RUNNING PHENOTYPER HERE 
%env PATH_TO_PHENOTYPER=./phenotyper/phenotyper.r 
%env PATH_TO_RULES=./phenotyper/ext/rules/test_rules.json
%env PATH_TO_OUTPUT=./phenotypes.out

--verbose returns an execution log; which is useful to document and track how the samples have been classified

! $PATH_TO_PHENOTYPER \
  --datafile $PATH_TO_GRC  \
  --rulesfile $PATH_TO_RULES \
  --ruleout phenotype \
  --output $PATH_TO_OUTPUT \
Parsing rules file: ./phenotyper/ext/rules/test_rules.json
Loading data file: ./phenotyper/ext/fake_grc/ghana_grc.tsv
Applying rules, this may take a while...
5% done...
Elapsed time: 0.085 secs
Estimated remaining time: 1.35 secs
11% done...
Elapsed time: 0.123 secs
Estimated remaining time: 0.91 secs
17% done...
Elapsed time: 0.155 secs
Estimated remaining time: 0.72 secs
23% done...
Elapsed time: 0.184 secs
Estimated remaining time: 0.59 secs
29% done...
Elapsed time: 0.212 secs
Estimated remaining time: 0.5 secs
35% done...
Elapsed time: 0.243 secs
Estimated remaining time: 0.44 secs
41% done...
Elapsed time: 0.271 secs
Estimated remaining time: 0.38 secs
47% done...
Elapsed time: 0.3 secs
Estimated remaining time: 0.33 secs
53% done...
Elapsed time: 0.328 secs
Estimated remaining time: 0.29 secs
59% done...
Elapsed time: 0.358 secs
Estimated remaining time: 0.24 secs
65% done...
Elapsed time: 0.386 secs
Estimated remaining time: 0.2 secs
71% done...
Elapsed time: 0.414 secs
Estimated remaining time: 0.17 secs
77% done...
Elapsed time: 0.44 secs
Estimated remaining time: 0.13 secs
83% done...
Elapsed time: 0.473 secs
Estimated remaining time: 0.1 secs
89% done...
Elapsed time: 0.506 secs
Estimated remaining time: 0.06 secs
95% done...
Elapsed time: 0.538 secs
Estimated remaining time: 0.03 secs
Writing output at ./phenotypes.out

Explore the detailed output

# import packages to visualise results
import pandas as pd
from data_analysis.plot_dr_prevalence import plot_phenotype_bar_chart_compared_to_pf6plus
from data_analysis.map_samples import map_samples
import bokeh.io
import os 
# import results from Phenotyper run above
phenotyper_results = os.environ.get('PATH_TO_OUTPUT')
results = pd.read_csv(phenotyper_results, sep='\t', index_col=0, low_memory=False).reset_index()

# import pf6+ data 
pf6plus_metadata = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'
pf6plus = pd.read_csv(pf6plus_metadata, sep='\t', index_col=0, low_memory=False)

# import GRC 
grc_metadata = os.environ.get('PATH_TO_GRC')
grc = pd.read_csv(grc_metadata, sep='\t', index_col=0, low_memory=False)
bokeh.io.output_notebook()

Let’s map the phenotypes we have found back to the sample metadata

results['SampleId'] = grc.index
results = results.reset_index().set_index('SampleId')
results_with_geo = pd.concat([results, pf6plus], axis=1, join="inner")

The maps below show how including the Pf6+ data can increase the number of samples and the coverage that you have for a country. Click on the icons to find out more.

map_samples(results_with_geo, zoom_to_start=4)
Make this Notebook Trusted to load map: File -> Trust Notebook
map_samples(pf6plus[pf6plus.Country == 'Ghana'].copy(),zoom_to_start=4)
Make this Notebook Trusted to load map: File -> Trust Notebook

The plots below compare the phenotypes found for the GRC data, which contains samples from Ghana, to the samples in the Pf6+ dataset that come from Ghana.

plot_phenotype_bar_chart_compared_to_pf6plus(results,pf6plus[pf6plus.Country == 'Ghana'].copy())

What’s next?

After obtaining phenotypes, you can go back to the 2_variants_prevalence.ipynb notebook to easily visualise your results and compare them with the Pf6+ dataset (over 13,500 samples with inferred phenotypes, collected across the world).


Further technical details

Rule example

Drugs are evaluated according to a set of rules specified in a JSON document. The only required fields for a rule are name and evaluation. Within evaluation one can access any of the fields/columns provided in the input data file using an R valid expression (see example below).

"Chloroquine":{
         "rules":[
            {
               "name":"Chloroquine-1",
               "change":"76-Het",
               "evaluation":"`crt_76[K]` %contains% ','",
               "interpretation":"Het",
               "phenotype":"Resistant/het",
               "analytics":"Resistant"
            },
            {
               "name":"Chloroquine-2",
               "change":"76-Missing",
               "evaluation":"`crt_76[K]` %==% '-'",
               "interpretation":"Missing",
               "phenotype":"Undetermined/missing",
               "analytics":"Undetermined"
            },
            {
               "name":"Chloroquine-3",
               "change":"K76",
               "evaluation":"`crt_76[K]` %==% 'K'",
               "interpretation":"WT",
               "phenotype":"Sensitive",
               "analytics":"Sensitive"
            },
            {
               "name":"Chloroquine-4",
               "change":"76T",
               "evaluation":"`crt_76[K]` %==% 'T'",
               "interpretation":"Mutant",
               "phenotype":"Resistant [1]",
               "analytics":"Resistant"
            },
            {
               "name":"Chloroquine-5",
               "change":"76-nonT",
               "evaluation":"T",
               "interpretation":"Unknown mutant",
               "phenotype":"Undetermined/unknown",
               "analytics":"Undetermined"
            }
         ]
      }

Technicalities

It is recommended to protect field names with backticks (`) in the JSON file as you’ll find weird characters there (for instance if they come from Excel files).

The software interprets the rules executing the R code given in evaluation with the help of some custom-made operators, like %==%, which means loose equality (symmetric any operation, as in any(a in b) or any(b in a)), to ease the evaluation. Please check there is no malicious code present when using rules files from other sources.

There are metarules that check the output status of a previously executed rule by referencing the name of the drug with @, for instance:

           {
               "name":"Dihydroartemisinin-piperaquine-3",
               "change":"Missing",
               "evaluation":"@Artemisinin@ %contains% 'Missing' || @Piperaquine@ %contains% 'Missing'",
               "interpretation":"Undetermined/Missing",
               "phenotype":"Missing",
               "analytics":"Undetermined"
            },

One can run the program in --verbose mode and save the log. The file will be huge (and the execution is very slow), but it can be useful for debugging.

###Important things to bear in mind

  1. For each drug the set of rules are executed sequentially.

  2. Order is important (top-down execution until finding one that evaluates to true).

  3. There should be always a default catch-up rule at the end (it will be executed if nothing else has been executed. A default rule just evaluates to True or T.

           {
               "name":"Chloroquine-5",
               "change":"76-nonT",
               "evaluation":"T",
               "interpretation":"Unknown mutant",
               "phenotype":"Undetermined/unknown",
               "analytics":"Undetermined"
            }
  1. The script will evaluate each rule according to the evaluation field (required).

  2. The output that will be obtained from a rule that evaluated to True must be specified using the --ruleout option. This allows for different outputs even when working with the same set of rules. Example: in our current phenotyping rules we made some arbitrary decisions (samples with het resistance markers are considered undetermined instead of resistant, you can imagine a different world where hets are resistant or dominant hets are resistant). However, one can check what difference these decisions make with the same set of rules by providing different output fields and indicating so when calling the program. For instance, using --ruleout het_phenotype instead of --ruleout phenotype (see example rule definition below)

         "rules":[
            {
               "name":"Chloroquine-1",
               "change":"76-Het",
               "evaluation":"`PfCRT:76` %==% 'K/T' || `PfCRT:76` %==% 'I/T'",
               "het_phenotype":"Resistant",
               "phenotype":"Undetermined",
               "analytics":"Resistant/Het"
            }, (...)