Inference of drug resistance status¶
This notebook describes how to use phenotyper
, a tool to predict drug resistance status from genetic data.
In brief, the tool applies a collection of rules, which have been curated from the literature, to determine if a set of genotypes confer resistance to a particular drug. Although we provide an up-to-date list of phenotyping rules (as a JSON file), users can create their own rules or adopt rules from a different source.
Check the full documentation here.
When to use phenotyper
?¶
When your Genetic Report Card (GRC) doesn’t have information regarding resistant phenotypes or you would like to update previous inferred phenotypes with new rules. Note that any file with information on genotypes (as described below) can be used with phenotyper
to infer drug resistance status.\n”
How to use?¶
Usage: ./phenotyper.r [options]
Evaluates a set of phenotyping rules on all the samples of the given data file.
Options:
--datafile=DATAFILE
Path to the data file, a tabular file with the required fields (first column reserverd for sample names).
--rulesfile=RULESFILE
Path to the JSON file that encodes the rules to be used for evaluating each drug.
--ruleout=RULEOUT
The field to be extracted from a rule when evaluated as true. Helps with debugging.
--samplefield=SAMPLEFIELD
The field on the data file that designates the sample id (optional).
--verbose
A flag to obtain verbose detailed output in the standard output (can generate very big logs, for tracebacks).
--output=OUTPUT
Base path for the output files.
-h, --help
Show this help message and exit
Which are the required fields for --datafile
?¶
You will need a tabulated file with the following fields (i.e. header):
crt_76[K] crt_72-76[CVMNK] dhfr_51[N] dhfr_59[C] dhfr_108[S] dhfr_164[I] dhps_437[G] dhps_540[K] dhps_581[A] dhps_613[A] k13_class k13_alleles mdr1_dup_call pm2_dup_call gch1_dup_call
There is no need to include any metadata such as location given that phenotyper
only uses genetic data for predicting drug resistance status. For clarity, we recommend the first field to refer to the sample id, e.g. sample
, so you can easily cross-reference results with other metadata you might have.
For an example of the complete file used you can check mygrc_metadata_nophenotypes.tsv
. If you have your GRC data at hand, this notebook will create the required input file for you, before moving on to using phenotyper
.
What can I do with the output?¶
An easy way to visualise your results is by using 2_prevalence_DR.ipynb
, In that notebook you can compare your data with the Pf6+ dataset (more than 13,000 samples worldwide), which will help in contextualizing your results.
If you are happy with the default phenotyping rules we’ve curated from the literature, you can follow with the procedure explained below. Otherwise, please check the Further technical details section for information on how the rules can be modified.
Setup¶
First, you will have to download phenotyper
if you haven’t already.
!git clone https://github.com/malariagen/phenotyper.git
Running the following will allow you to run R
and install the dependencies that phenotyper
requires.
#required to use R inside the Jupyter Notebook
%load_ext rpy2.ipython
Then we need to install the R
modules required by phenotyper
.
%%R
install.packages('optparse', quiet = TRUE)
install.packages('stringr', quiet = TRUE)
install.packages('stringi', quiet = TRUE)
install.packages('jsonlite', quiet = TRUE)
If you are running this on Google Colab¶
If you are running the notebooks on Colab, then run the cell below to clone the Pf6+ repo:
!git clone https://github.com/malariagen/Pf6plus.git
!cp -r /content/Pf6plus/pf6plus_documentation/notebooks/data_analysis .
If you are running this locally¶
There are some steps you need to follow to run the notebooks locally. If you haven’t already, please follow these instructions.
Running phenotyper
on your own data¶
You have two main options:
Run the tool on your local machine, or
Run the tool on Google Colab, in which case you can either (a) upload the GRC from your local machine, or (b) read it from Google Drive
Note: We have also included an example GRC in the repository so you can run through this example even if you don’t have a GRC. In this case, run the cell below and go straight to the section “Run phenotyper
”.
%env PATH_TO_GRC=./phenotyper/ext/fake_grc/ghana_grc.tsv
env: PATH_TO_GRC=./phenotyper/ext/fake_grc/ghana_grc.tsv
Option 1 - Running locally¶
You need to specify where to find the GRC you would like to process. To do so, set the path to the GRC file here:
%env PATH_TO_GRC=#Path to GRC
Option 2 - Running on Google Colab¶
Uploading the GRC from your local machine¶
You can upload a GRC file from your own machine to run phenotyper
. Run the cell below and select the file from your local machine (it will open a dialog box).
from google.colab import files
uploaded = files.upload()
Once the file has been uploaded you should see the path: copy it into the cell below.
%env PATH_TO_GRC=#Enter the path here
Read the GRC from Google Drive¶
To use a file that is stored on your google drive, first connect to your personal Google Drive by running the cell below and following the instructions that are printed as output.
from google.colab import drive
drive.mount('/content/drive')
Copy the GRC data to google colab by running the following, replacing <Path to GRC on your drive>
with the path where your file is stored on your google drive.
!cp -r <Path to GRC on your drive> .
Finally, run the following to set PATH_TO_GRC
to where your GRC data is stored on google colab.
%env PATH_TO_GRC=/content/drive/<GRC file name>
Run phenotyper
¶
Below, we set the paths to the phenotyper.r
script, the example GRC, the rules file, and the output file containing the phenotypes. PATH_TO_PHENOTYPER
, PATH_TO_GRC
, and PATH_TO_RULES
are set to the files cloned from git above (you may need to change the paths if you downloaded the tool somewhere else). Please set PATH_TO_OUTPUT
to the path of the file you would like phenotyper
to use as output.
If you just want to test the tool, use the examples below, no need to change any of the paths.
#SET VARIABLES FOR RUNNING PHENOTYPER HERE
%env PATH_TO_PHENOTYPER=./phenotyper/phenotyper.r
%env PATH_TO_RULES=./phenotyper/ext/rules/test_rules.json
%env PATH_TO_OUTPUT=./phenotypes.out
--verbose
returns an execution log; which is useful to document and track how the samples have been classified
! $PATH_TO_PHENOTYPER \
--datafile $PATH_TO_GRC \
--rulesfile $PATH_TO_RULES \
--ruleout phenotype \
--output $PATH_TO_OUTPUT \
Parsing rules file: ./phenotyper/ext/rules/test_rules.json
Loading data file: ./phenotyper/ext/fake_grc/ghana_grc.tsv
Applying rules, this may take a while...
5% done...
Elapsed time: 0.085 secs
Estimated remaining time: 1.35 secs
11% done...
Elapsed time: 0.123 secs
Estimated remaining time: 0.91 secs
17% done...
Elapsed time: 0.155 secs
Estimated remaining time: 0.72 secs
23% done...
Elapsed time: 0.184 secs
Estimated remaining time: 0.59 secs
29% done...
Elapsed time: 0.212 secs
Estimated remaining time: 0.5 secs
35% done...
Elapsed time: 0.243 secs
Estimated remaining time: 0.44 secs
41% done...
Elapsed time: 0.271 secs
Estimated remaining time: 0.38 secs
47% done...
Elapsed time: 0.3 secs
Estimated remaining time: 0.33 secs
53% done...
Elapsed time: 0.328 secs
Estimated remaining time: 0.29 secs
59% done...
Elapsed time: 0.358 secs
Estimated remaining time: 0.24 secs
65% done...
Elapsed time: 0.386 secs
Estimated remaining time: 0.2 secs
71% done...
Elapsed time: 0.414 secs
Estimated remaining time: 0.17 secs
77% done...
Elapsed time: 0.44 secs
Estimated remaining time: 0.13 secs
83% done...
Elapsed time: 0.473 secs
Estimated remaining time: 0.1 secs
89% done...
Elapsed time: 0.506 secs
Estimated remaining time: 0.06 secs
95% done...
Elapsed time: 0.538 secs
Estimated remaining time: 0.03 secs
Writing output at ./phenotypes.out
Explore the detailed output¶
# import packages to visualise results
import pandas as pd
from data_analysis.plot_dr_prevalence import plot_phenotype_bar_chart_compared_to_pf6plus
from data_analysis.map_samples import map_samples
import bokeh.io
import os
# import results from Phenotyper run above
phenotyper_results = os.environ.get('PATH_TO_OUTPUT')
results = pd.read_csv(phenotyper_results, sep='\t', index_col=0, low_memory=False).reset_index()
# import pf6+ data
pf6plus_metadata = 'https://pf6plus.cog.sanger.ac.uk/pf6plus_metadata.tsv'
pf6plus = pd.read_csv(pf6plus_metadata, sep='\t', index_col=0, low_memory=False)
# import GRC
grc_metadata = os.environ.get('PATH_TO_GRC')
grc = pd.read_csv(grc_metadata, sep='\t', index_col=0, low_memory=False)
bokeh.io.output_notebook()
Let’s map the phenotypes we have found back to the sample metadata
results['SampleId'] = grc.index
results = results.reset_index().set_index('SampleId')
results_with_geo = pd.concat([results, pf6plus], axis=1, join="inner")
The maps below show how including the Pf6+ data can increase the number of samples and the coverage that you have for a country. Click on the icons to find out more.
map_samples(results_with_geo, zoom_to_start=4)
map_samples(pf6plus[pf6plus.Country == 'Ghana'].copy(),zoom_to_start=4)
The plots below compare the phenotypes found for the GRC data, which contains samples from Ghana, to the samples in the Pf6+ dataset that come from Ghana.
plot_phenotype_bar_chart_compared_to_pf6plus(results,pf6plus[pf6plus.Country == 'Ghana'].copy())
What’s next?¶
After obtaining phenotypes, you can go back to the 2_variants_prevalence.ipynb
notebook to easily visualise your results and compare them with the Pf6+ dataset (over 13,500 samples with inferred phenotypes, collected across the world).
Further technical details¶
Rule example¶
Drugs are evaluated according to a set of rules specified in a JSON document. The only required fields for a rule are name and evaluation. Within evaluation one can access any of the fields/columns provided in the input data file using an R
valid expression (see example below).
"Chloroquine":{
"rules":[
{
"name":"Chloroquine-1",
"change":"76-Het",
"evaluation":"`crt_76[K]` %contains% ','",
"interpretation":"Het",
"phenotype":"Resistant/het",
"analytics":"Resistant"
},
{
"name":"Chloroquine-2",
"change":"76-Missing",
"evaluation":"`crt_76[K]` %==% '-'",
"interpretation":"Missing",
"phenotype":"Undetermined/missing",
"analytics":"Undetermined"
},
{
"name":"Chloroquine-3",
"change":"K76",
"evaluation":"`crt_76[K]` %==% 'K'",
"interpretation":"WT",
"phenotype":"Sensitive",
"analytics":"Sensitive"
},
{
"name":"Chloroquine-4",
"change":"76T",
"evaluation":"`crt_76[K]` %==% 'T'",
"interpretation":"Mutant",
"phenotype":"Resistant [1]",
"analytics":"Resistant"
},
{
"name":"Chloroquine-5",
"change":"76-nonT",
"evaluation":"T",
"interpretation":"Unknown mutant",
"phenotype":"Undetermined/unknown",
"analytics":"Undetermined"
}
]
}
Technicalities¶
It is recommended to protect field names with backticks (`) in the JSON file as you’ll find weird characters there (for instance if they come from Excel files).
The software interprets the rules executing the R
code given in evaluation with the help of some custom-made operators, like %==%
, which means loose equality (symmetric any
operation, as in any(a in b) or any(b in a)
), to ease the evaluation. Please check there is no malicious code present when using rules files from other sources.
There are metarules that check the output status of a previously executed rule by referencing the name of the drug with @
, for instance:
{
"name":"Dihydroartemisinin-piperaquine-3",
"change":"Missing",
"evaluation":"@Artemisinin@ %contains% 'Missing' || @Piperaquine@ %contains% 'Missing'",
"interpretation":"Undetermined/Missing",
"phenotype":"Missing",
"analytics":"Undetermined"
},
One can run the program in --verbose
mode and save the log. The file will be huge (and the execution is very slow), but it can be useful for debugging.
###Important things to bear in mind
For each drug the set of rules are executed sequentially.
Order is important (top-down execution until finding one that evaluates to true).
There should be always a default catch-up rule at the end (it will be executed if nothing else has been executed. A default rule just evaluates to
True
orT
.
{
"name":"Chloroquine-5",
"change":"76-nonT",
"evaluation":"T",
"interpretation":"Unknown mutant",
"phenotype":"Undetermined/unknown",
"analytics":"Undetermined"
}
The script will evaluate each rule according to the
evaluation
field (required).The output that will be obtained from a rule that evaluated to
True
must be specified using the--ruleout
option. This allows for different outputs even when working with the same set of rules. Example: in our current phenotyping rules we made some arbitrary decisions (samples with het resistance markers are considered undetermined instead of resistant, you can imagine a different world where hets are resistant or dominant hets are resistant). However, one can check what difference these decisions make with the same set of rules by providing different output fields and indicating so when calling the program. For instance, using--ruleout het_phenotype
instead of--ruleout phenotype
(see example rule definition below)
"rules":[
{
"name":"Chloroquine-1",
"change":"76-Het",
"evaluation":"`PfCRT:76` %==% 'K/T' || `PfCRT:76` %==% 'I/T'",
"het_phenotype":"Resistant",
"phenotype":"Undetermined",
"analytics":"Resistant/Het"
}, (...)