Drug Resistance Frequencies Across Pf7¶

Introduction¶

This notebook recreates main Table 2, Supplementary Table 4 from the Pf7 paper.

These tables summarise the number of samples associated with malarial drug resistance (DR) in different populations.

This notebook takes 1 minute to run.

Setup¶

Install and import the malariagen Python package:

!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.5/148.5 kB 3.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 11.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.8/52.8 MB 12.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 52.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 45.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.5/302.5 kB 21.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.7/138.7 kB 13.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.9/20.9 MB 14.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 24.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 206.9/206.9 kB 12.7 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 86.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 68.3 MB/s eta 0:00:00
?25h  Building wheel for asciitree (setup.py) ... ?25l?25hdone

Import required python libraries that are installed at colab by default.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections
import re
import scipy

Access Pf7 Data¶

We use the malariagen data package to load the release data.

release_data = malariagen_data.Pf7()
sample_metadata = release_data.sample_metadata()
# take a glance at the metadata dataframe
sample_metadata.head(3)

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	All samples same case	Population	% callable	QC pass	Exclusion reason	Sample type	Sample was in Pf6
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	FP0008-C	AF-W	82.16	True	Analysis_set	gDNA	True
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	FP0009-C	AF-W	88.85	True	Analysis_set	gDNA	True
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	FP0010-CW	AF-W	86.46	True	Analysis_set	sWGA	False

We will only examine the QC pass samples in the analysis of these notebooks.

# Retain QC pass samples
qc_sample_metadata = sample_metadata.loc[sample_metadata['QC pass']]

Access DR Classification and Genotype Data¶

We can access inferred resistance status classifications of QC-pass Pf7 samples from the MalariaGEN web resources.

This dataset includes the samples that are resistant to 10 drugs or combinations of drugs and to rapid diagnostic tests (RDT) detection: chloroquine, pyrimethamine, sulfadoxine, mefloquine, artemisinin, piperaquine, sulfadoxine- pyrimethamine for treatment of uncomplicated malaria, sulfadoxine- pyrimethamine for intermittent preventive treatment in pregnancy, artesunate-mefloquine, dihydroartemisinin-piperaquine, hrp2 and hrp3 gene deletions.

# Read the data from the MalariaGEN.net resources website
resistance_classification_fn = pd.read_csv('https://www.malariagen.net/wp-content/uploads/2023/11/Pf7_inferred_resistance_status_classification.txt', sep='\t')

# Print the first rows
resistance_classification_fn.head()

	Sample	Chloroquine	Pyrimethamine	Sulfadoxine	Mefloquine	Artemisinin	Piperaquine	SP (uncomplicated)	SP (IPTp)	AS-MQ	DHA-PPQ	HRP2	HRP3	HRP2 and HRP3
0	FP0008-C	Undetermined	Undetermined	Undetermined	Sensitive	Sensitive	Sensitive	Sensitive	Sensitive	Sensitive	Sensitive	nodel	nodel	nodel
1	FP0009-C	Resistant	Resistant	Sensitive	Sensitive	Sensitive	Sensitive	Resistant	Sensitive	Sensitive	Sensitive	nodel	nodel	nodel
2	FP0010-CW	Undetermined	Resistant	Resistant	Sensitive	Sensitive	Sensitive	Resistant	Sensitive	Sensitive	Sensitive	uncallable	uncallable	uncallable
3	FP0011-CW	Undetermined	Resistant	Undetermined	Sensitive	Undetermined	Sensitive	Resistant	Sensitive	Sensitive	Sensitive	uncallable	uncallable	uncallable
4	FP0012-CW	Resistant	Resistant	Sensitive	Sensitive	Sensitive	Sensitive	Resistant	Sensitive	Sensitive	Sensitive	nodel	nodel	nodel

We will also use the genotypes utilised in drug resistance status classification containing amino acid and copy number genotypes at six loci: crt, dhfr, dhps, mdr1, kelch13, plasmepsin 2-3. This dataset is also available witin the MalariaGEN web resources.

# Read the data
drm_calls_fn = pd.read_csv('https://www.malariagen.net/wp-content/uploads/2023/11/Pf7_drug_resistance_marker_genotypes.txt', sep='\t', )

# Rename the first column as 'Sample'
drm_calls_fn = drm_calls_fn.rename(columns={drm_calls_fn.columns[0]: 'Sample'})

# Print the first rows
drm_calls_fn.head()

	Sample	crt_76[K]	crt_72-76[CVMNK]	dhfr_51[N]	dhfr_59[C]	dhfr_108[S]	dhfr_164[I]	dhps_437[G]	dhps_540[K]	dhps_581[A]	dhps_613[A]	kelch13_349-726_ns_changes	mdr1_breakpoint	pm2_breakpoint
0	FP0008-C	T,K	CVIET,CVMNK	N	C	S,N	I	A,G	K	A	A	NaN	NaN	NaN
1	FP0009-C	T	CVIET	I	R	N	I	A	K	A	A	NaN	NaN	NaN
2	FP0010-CW	T,K	CVIET,CVMNK	I	R	N	I	G	K	A	A	NaN	NaN	NaN
3	FP0011-CW	T,K	CVIET,CVMNK	I	R	N	I	G,A	K	A	A	-	NaN	NaN
4	FP0012-CW	T	CVIET	I	R	N	I	A	K	A	A	NaN	NaN	NaN

Load all data into single DataFrame¶

We can merge these 3 datasets (metadata, drug resistance genotype and classification) to facilitate streamlined analysis.

# Merge the dataframes on "Sample" column
df_all_sample_metadata = pd.merge(pd.merge(qc_sample_metadata,drm_calls_fn,on='Sample'),resistance_classification_fn,on='Sample')

# Print first 3 rows
df_all_sample_metadata.head(3)

	Sample	Study	Country	Admin level 1	Country latitude	Country longitude	Admin level 1 latitude	Admin level 1 longitude	Year	ENA	...	Mefloquine	Artemisinin	Piperaquine	SP (uncomplicated)	SP (IPTp)	AS-MQ	DHA-PPQ	HRP2	HRP3	HRP2 and HRP3
0	FP0008-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081237	...	Sensitive	Sensitive	Sensitive	Sensitive	Sensitive	Sensitive	Sensitive	nodel	nodel	nodel
1	FP0009-C	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR1081238	...	Sensitive	Sensitive	Sensitive	Resistant	Sensitive	Sensitive	Sensitive	nodel	nodel	nodel
2	FP0010-CW	1147-PF-MR-CONWAY	Mauritania	Hodh el Gharbi	20.265149	-10.337093	16.565426	-9.832345	2014.0	ERR2889621	...	Sensitive	Sensitive	Sensitive	Resistant	Sensitive	Sensitive	Sensitive	uncallable	uncallable	uncallable

3 rows × 45 columns

Sulfadoxine-Pyrimethamine (SP) is used for the treatment of uncomplicated cases and Intermittent Preventive Treatment in Pregnancy (IPTp). To make these two status clear in the dataset, we rename the columns.

# Define a dictionary of old&new column names
column_name_changes = collections.OrderedDict()
column_name_changes['SP (uncomplicated)'] = 'SP (treatment)'
column_name_changes['SP-IPTp'] = 'SP (IPTp)'

# Rename the columns using the dictionary
df_all_sample_metadata.rename(columns=column_name_changes, inplace=True)

Table setups¶

We proceed by defining a dictionary where each key represents a drug along with its associated gene, mutation, column name in the dataset, and sensitivity/resistance indicators.

For details on the heuristics employed to map genetic markers to resistance status classification, please refer to here.

drm_dict = collections.OrderedDict()

drm_dict['Chloroquine'] = {
    'gene': 'crt',
    'mutation': '76T',
    'column_name': 'crt_76[K]',
    'sensitive': 'K',
    'resistant': 'T',
}

drm_dict['Pyrimethamine'] = {
    'gene': 'dhfr',
    'mutation': '108N',
    'column_name': 'dhfr_108[S]',
    'sensitive': 'S',
    'resistant': 'N',
}

drm_dict['Sulfadoxine'] = {
    'gene': 'dhps',
    'mutation': '437G',
    'column_name': 'dhps_437[G]',
    'sensitive': 'A',
    'resistant': 'G',
}

drm_dict['Mefloquine'] = {
    'gene': 'mdr1',
    'mutation': '2+ copies',
    'column_name': 'dup_mdr1',
    'sensitive': 0,
    'resistant': 1,
}

drm_dict['Artemisinin'] = {
    'gene': 'kelch13',
    'mutation': 'WHO list',
    'column_name': None,
}

drm_dict['Piperaquine'] = {
    'gene': 'plasmepsin 2-3',
    'mutation': '2+ copies',
    'column_name': 'dup_pm2',
    'sensitive': 0,
    'resistant': 1,
}

drm_dict['SP (treatment)'] = {
    'gene': 'dhfr',
    'mutation': 'triple mutant',
    'column_name': None
}

drm_dict['SP (IPTp)'] = {
    'gene': 'dhfr and dhps',
    'mutation': 'sextuple mutant',
    'column_name': None
}

drm_dict['AS-MQ'] = {
    'gene': 'kelch13 and mdr1',
    'mutation': '',
    'column_name': None
}

drm_dict['DHA-PPQ'] = {
    'gene': 'kelch13 and plasmepsin 2-3',
    'mutation': '',
    'column_name': None
}

To elaborate on population names in the final table, we define a dictionary listing ten populations in Pf7 with the following abbreviations:

populations = collections.OrderedDict()
populations['SA']       = "South America"
populations['AF-W']     = "Africa - West"
populations['AF-C']     = "Africa - Central"
populations['AF-NE']    = "Africa - Northeast"
populations['AF-E']     = "Africa - East"
populations['AS-S-E']  = "Asia - South - East"
populations['AS-S-FE']  = "Asia - South - Far East"
populations['AS-SE-W'] = "Asia - Southeast - West"
populations['AS-SE-E'] = "Asia - Southeast - East"
populations['OC-NG']    = "Oceania - New Guinea"

Summary of drug resistance sample sizes across populations¶

We want to create a table summarizing the number of samples by population.

To do this, we will write a function that counts number of samples that are resistant and sensitive for a given drug.

def n_agg(x):
    """
    Aggregate function to calculate the count of non-undetermined samples for each drug in a given population.

    Parameters:
    - x: DataFrame

    Returns:
    - pd.Series: Counts of sensitive samples that are resistant to each drug.
    """

    # Initialize an ordered dictionary to store drug names and their corresponding counts
    names = collections.OrderedDict()

    # Iterate through each drug in drm_dict
    for drug in drm_dict:
        # Count the number of non-undetermined samples for the current drug
        n = np.count_nonzero((x[drug] != 'Undetermined'))

        # Store the drug name and its count in the dictionary
        names[drug] = n

    # Return the result as a pandas Series
    return pd.Series(names)

As an addition to table, we would like to include minimum and maximum drug resistance sample size in the population names.

# Calculate counts of non-undetermined samples for each drug in each population
df_drm_n_table = (
    df_all_sample_metadata
    .groupby('Population')
    .apply(n_agg)
    .rename_axis(None)
    .transpose()
    .loc[:, populations.keys()]
    .reset_index()
)

# Calculate the minimum and maximum sample sizes across populations
min_n = df_drm_n_table.min()
max_n = df_drm_n_table.max()

# Customize the index to include gene and mutation information for each drug
df_drm_n_table.index = ["%s %s" % (drm_dict[drug]['gene'], drm_dict[drug]['mutation']) for drug in drm_dict]
df_drm_n_table.index.names = ['Marker']

# Rename the columns to indicate their association with drug resistance and sample counts
df_drm_n_table.rename(columns={'index': 'Associated with resistance to'}, inplace=True)

# Update column names to include population names and corresponding sample size ranges
for population in populations:
    new_column_name = f'{populations[population]} (n={min_n[population]}-{max_n[population]})'
    df_drm_n_table.rename(columns={population: new_column_name}, inplace=True)

# Display the final drug resistance count table
df_drm_n_table

	Associated with resistance to	South America (n=154-158)	Africa - West (n=5234-6233)	Africa - Central (n=397-520)	Africa - Northeast (n=120-170)	Africa - East (n=1373-1532)	Asia - South - East (n=164-233)	Asia - South - Far East (n=1212-1369)	Asia - Southeast - West (n=1657-1876)	Asia - Southeast - East (n=2059-3684)	Oceania - New Guinea (n=298-341)
Marker
crt 76T	Chloroquine	155	5660	397	157	1388	217	1326	1871	3665	333
dhfr 108N	Pyrimethamine	154	5589	517	170	1476	201	1369	1876	3684	333
dhps 437G	Sulfadoxine	154	5529	501	162	1424	220	1291	1875	3609	333
mdr1 2+ copies	Mefloquine	158	5515	478	123	1509	164	1268	1782	3461	314
kelch13 WHO list	Artemisinin	158	5595	505	144	1513	189	1341	1768	3475	305
plasmepsin 2-3 2+ copies	Piperaquine	158	5464	519	120	1512	172	1272	1790	3428	298
dhfr triple mutant	SP (treatment)	158	5234	440	167	1373	230	1212	1833	3619	339
dhfr and dhps sextuple mutant	SP (IPTp)	158	6233	510	170	1446	233	1217	1657	2059	341
kelch13 and mdr1	AS-MQ	158	5915	519	150	1532	203	1354	1798	3495	324
kelch13 and plasmepsin 2-3	DHA-PPQ	158	5823	520	148	1526	199	1355	1829	3442	316

Supplementary Table. Numbers of samples used to determine proportions in the main table.

To save the table:

df_drm_n_table.to_excel("DRM_table_sample_numbers.xlsx")

Summary of drug resistance frequencies across populations¶

Now, we would like to calculate drug resistance proportions in each population.

We can easily adjust the function to count number of samples that are resistant and sensitive for a given drug, and calculate the proportion.

def proportion_agg(x):
    """
    Aggregate function to calculate the proportion of resistance for each drug in a given population.

    Parameters:
    - x: DataFrame

    Returns:
    - pd.Series: Proportions of resistance for each drug.
    """

    # Initialize an ordered dictionary to store drug names and their corresponding proportions
    names = collections.OrderedDict()

    # Iterate through each drug in drm_dict
    for drug in drm_dict:
        # Count the number of non-undetermined samples for the current drug
        n = np.count_nonzero((x[drug] != 'Undetermined'))

        # Check if there are no non-undetermined samples for the drug
        if n == 0:
            proportion = np.nan  # Set proportion to NaN to avoid division by zero
        else:
            # Calculate the proportion of resistant samples for the drug
            proportion = round(np.count_nonzero(
                (x[drug] == 'Resistant')
            ) / np.count_nonzero(
                (x[drug] != 'Undetermined')
            )*100)

        # Store the drug name and its proportion in the dictionary
        names[drug] = proportion

    # Return the result as a pandas Series
    return pd.Series(names)

Let’s apply this function and create a table summarizing drug resistance proportions by population.

# Create a table summarizing drug resistance proportions by population

# Group the DataFrame by 'Population' and apply the 'proportion_agg' function to calculate resistance proportions
df_drm_table = (
    df_all_sample_metadata
    .groupby('Population')
    .apply(proportion_agg)
    .rename_axis(None)
    .transpose()
    .loc[:, populations.keys()]
    .reset_index()
)

# Customize the index to include gene and mutation information for each drug
df_drm_table.index = ["%s %s" % (drm_dict[drug]['gene'], drm_dict[drug]['mutation']) for drug in drm_dict]
df_drm_table.index.names = ['Marker']

# Rename the columns to indicate their association with drug resistance
df_drm_table.rename(columns={'index': 'Associated with resistance to'}, inplace=True)

# Update column names to include population names and corresponding sample size ranges
for population in populations:
    new_column_name = f'{populations[population]} (n={min_n[population]}-{max_n[population]})'
    df_drm_table.rename(columns={population: new_column_name}, inplace=True)

# Add '%' symbol to all values in the table
df_drm_table = df_drm_table.applymap(lambda x: f"{int(x)}%" if isinstance(x, (int, float)) else x)

# Display the final drug resistance table
df_drm_table

	Associated with resistance to	South America (n=154-158)	Africa - West (n=5234-6233)	Africa - Central (n=397-520)	Africa - Northeast (n=120-170)	Africa - East (n=1373-1532)	Asia - South - East (n=164-233)	Asia - South - Far East (n=1212-1369)	Asia - Southeast - West (n=1657-1876)	Asia - Southeast - East (n=2059-3684)	Oceania - New Guinea (n=298-341)
Marker
crt 76T	Chloroquine	100%	29%	61%	40%	24%	31%	94%	99%	95%	96%
dhfr 108N	Pyrimethamine	64%	87%	100%	98%	96%	64%	100%	100%	99%	99%
dhps 437G	Sulfadoxine	60%	78%	97%	82%	83%	8%	89%	100%	83%	69%
mdr1 2+ copies	Mefloquine	0%	0%	0%	0%	0%	0%	0%	29%	5%	1%
kelch13 WHO list	Artemisinin	0%	0%	0%	0%	0%	0%	0%	36%	58%	1%
plasmepsin 2-3 2+ copies	Piperaquine	0%	0%	0%	0%	0%	0%	0%	0%	37%	0%
dhfr triple mutant	SP (treatment)	0%	77%	85%	61%	80%	1%	46%	86%	88%	0%
dhfr and dhps sextuple mutant	SP (IPTp)	0%	0%	2%	2%	9%	0%	13%	79%	14%	0%
kelch13 and mdr1	AS-MQ	0%	0%	0%	0%	0%	0%	0%	10%	4%	0%
kelch13 and plasmepsin 2-3	DHA-PPQ	0%	0%	0%	0%	0%	0%	0%	0%	34%	0%

Table Legend. Frequency of different sets of polymorphisms associated with drug resistance in samples from different geographical regions. All samples were classified into different types of drug resistance based on published genetic markers, and represent best attempt based on the available data. Each type of resistance was considered to be either present, absent or unknown for a given sample. For each resistance type, the table reports: the genetic markers considered; the drug they are associated with; the proportion of samples in each major sub-population classified as resistant out of the samples where the type was not unknown. The number of samples classified as either resistant or not resistant varies for each type of resistance considered (e.g. due to different levels of genomic accessibility); numbers in brackets report the minimum and maximum number analysed while the exact numbers considered are reported in Supplementary Table 4. SP: sulfadoxine-pyrimethamine; treatment: SP used for the clinical treatment of uncomplicated malaria; IPTp: SP used for intermittent preventive treatment in pregnancy; AS-MQ: artesunate + mefloquine combination therapy; DHA-PPQ: dihydroartemisinin + piperaquine combination therapy. dhfr triple mutant refers to having all three of 51I, 59R and 108N in dhfr. dhfr and dhps sextuple mutant refers to having all five of 51I, 59R and 108N in dhfr and 437G and 540E in dhps, plus one of dhfr:164L, dhps:581G, dhps:613S or dhps:613T. Full details of the rules used to infer resistance status from genetic markers can be found on the resource page at https://www.malariagen.net/resource/34.

To save the table:

df_drm_table.to_excel("Main_DRM_table.xlsx")

Summary of drug resistance frequencies between time periods across populations¶

We are interested to compare the drug resistance between periods of 2001-2014 and 2015-2019.

To compare drug resistance frequencies between 2001-2014 and 2015-2019, a new column (Year group) has been added to categorize samples accordingly.

# Create a new column
# Catagorize samples collected in 2001-2014
df_all_sample_metadata.loc[( df_all_sample_metadata['Year'] >= 2001 ) & ( df_all_sample_metadata['Year'] <= 2014 ), 'Year group'] = "2001-2014"

# Catagorize samples collected in 2015-2019
df_all_sample_metadata.loc[( df_all_sample_metadata['Year'] >= 2015 ) & ( df_all_sample_metadata['Year'] <= 2019 ), 'Year group'] = "2015-2019"

# Convert 'Year' column into integer format
df_all_sample_metadata['Year'] = pd.to_numeric(df_all_sample_metadata['Year'], downcast = 'integer')

Next, we adjust the function to count samples within each group. As an addition, the resulting table will provide the proportions and counts for both year groups combined in each cell value.

def my_agg(x):
    """
    Aggregate function to calculate proportions of resistance and counts for each drug in two year groups.

    Parameters:
    - x: DataFrame

    Returns:
    - pd.Series: Series containing proportions and counts of resistance for each drug in two year groups.
    """

    # Initialize an ordered dictionary to store drug names and their corresponding proportions and counts
    names = collections.OrderedDict()

    # Iterate through each drug in drm_dict
    for drug in drm_dict:
        # Calculate counts and proportions for Year Group 1 (2001-2014)
        n_year_group_1 = np.count_nonzero((x[drug] != 'Undetermined') & (x['Year group'] == '2001-2014'))
        proportion_year_group_1 = (
            np.count_nonzero((x[drug] == 'Resistant') & (x['Year group'] == '2001-2014')) /
            np.count_nonzero((x[drug] != 'Undetermined') & (x['Year group'] == '2001-2014'))
        ) if n_year_group_1 != 0 else np.nan

        # Calculate counts and proportions for Year Group 2 (2015-2019)
        n_year_group_2 = np.count_nonzero((x[drug] != 'Undetermined') & (x['Year group'] == '2015-2019'))
        proportion_year_group_2 = (
            np.count_nonzero((x[drug] == 'Resistant') & (x['Year group'] == '2015-2019')) /
            np.count_nonzero((x[drug] != 'Undetermined') & (x['Year group'] == '2015-2019'))
        ) if n_year_group_2 != 0 else np.nan

        # Store drug name, proportions, and counts in the dictionary
        names[drug] = f"{proportion_year_group_1:.2f} (n={n_year_group_1})   {proportion_year_group_2:.2f} (n={n_year_group_2})"

    # Return the result as a pandas Series
    return pd.Series(names)

Let’s create the table in a similar way to before.

# Create a table summarizing proportions and counts of resistance for each drug in two year groups by population

df_drm_table_with_n = (
    df_all_sample_metadata
    .groupby('Population')
    .apply(my_agg)
    .rename_axis(None)
    .transpose()
    .loc[:, populations.keys()]
    .reset_index()
)

# Customize the index to include gene and mutation information for each drug
df_drm_table_with_n.index = ["%s %s" % (drm_dict[drug]['gene'], drm_dict[drug]['mutation']) for drug in drm_dict]
df_drm_table_with_n.index.names = ['Marker']

# Rename the columns to indicate their association with drug resistance and sample counts
df_drm_table_with_n.rename(columns={'index': 'Associated with resistance to'}, inplace=True)

# Update column names to include population names
for population in populations:
    new_column_name = populations[population]
    df_drm_table_with_n.rename(columns={population: new_column_name}, inplace=True)

# Display the final drug resistance table with sample counts
df_drm_table_with_n

	Associated with resistance to	South America	Africa - West	Africa - Central	Africa - Northeast	Africa - East	Asia - South - East	Asia - South - Far East	Asia - Southeast - West	Asia - Southeast - East	Oceania - New Guinea
Marker
crt 76T	Chloroquine	1.00 (n=44) 1.00 (n=111)	0.42 (n=2500) 0.19 (n=3008)	0.65 (n=292) 0.52 (n=105)	0.19 (n=78) 0.61 (n=79)	0.18 (n=1244) 0.05 (n=19)	nan (n=0) 0.31 (n=217)	0.93 (n=74) 0.94 (n=1252)	1.00 (n=1158) 0.99 (n=713)	0.94 (n=1427) 0.95 (n=2232)	0.99 (n=195) 0.91 (n=138)
dhfr 108N	Pyrimethamine	0.93 (n=44) 0.53 (n=110)	0.85 (n=2547) 0.92 (n=2902)	1.00 (n=383) 0.99 (n=134)	0.99 (n=86) 0.96 (n=84)	0.99 (n=1346) 1.00 (n=19)	nan (n=0) 0.64 (n=201)	1.00 (n=77) 1.00 (n=1292)	1.00 (n=1160) 1.00 (n=716)	0.99 (n=1438) 0.99 (n=2240)	0.99 (n=200) 0.98 (n=133)
dhps 437G	Sulfadoxine	0.35 (n=43) 0.69 (n=111)	0.75 (n=2513) 0.84 (n=2866)	0.97 (n=371) 0.96 (n=130)	0.99 (n=85) 0.64 (n=77)	0.92 (n=1265) 1.00 (n=19)	nan (n=0) 0.08 (n=220)	0.97 (n=63) 0.89 (n=1228)	1.00 (n=1160) 0.99 (n=715)	0.88 (n=1377) 0.81 (n=2226)	0.61 (n=197) 0.81 (n=136)
mdr1 2+ copies	Mefloquine	0.00 (n=44) 0.00 (n=114)	0.00 (n=2735) 0.00 (n=2644)	0.00 (n=349) 0.00 (n=129)	0.00 (n=86) 0.00 (n=37)	0.00 (n=1339) 0.00 (n=14)	nan (n=0) 0.00 (n=164)	0.00 (n=77) 0.00 (n=1191)	0.43 (n=1128) 0.06 (n=654)	0.12 (n=1334) 0.01 (n=2122)	0.01 (n=198) 0.02 (n=116)
kelch13 WHO list	Artemisinin	0.00 (n=44) 0.00 (n=114)	0.00 (n=2781) 0.00 (n=2668)	0.00 (n=374) 0.00 (n=131)	0.00 (n=84) 0.00 (n=60)	0.00 (n=1341) 0.00 (n=20)	nan (n=0) 0.00 (n=189)	0.00 (n=78) 0.00 (n=1263)	0.28 (n=1103) 0.49 (n=665)	0.48 (n=1290) 0.64 (n=2179)	0.00 (n=199) 0.02 (n=106)
plasmepsin 2-3 2+ copies	Piperaquine	0.00 (n=44) 0.00 (n=114)	0.00 (n=2769) 0.00 (n=2558)	0.00 (n=385) 0.00 (n=134)	0.00 (n=86) 0.00 (n=34)	0.00 (n=1337) 0.00 (n=19)	nan (n=0) 0.00 (n=172)	0.00 (n=78) 0.00 (n=1194)	0.00 (n=1139) 0.00 (n=651)	0.19 (n=1296) 0.49 (n=2128)	0.00 (n=201) 0.00 (n=97)
dhfr triple mutant	SP (treatment)	0.00 (n=44) 0.00 (n=114)	0.76 (n=2425) 0.83 (n=2657)	0.83 (n=319) 0.89 (n=121)	0.95 (n=82) 0.28 (n=85)	0.87 (n=1235) 1.00 (n=18)	nan (n=0) 0.01 (n=230)	0.43 (n=65) 0.46 (n=1147)	0.89 (n=1123) 0.81 (n=710)	0.91 (n=1390) 0.86 (n=2223)	0.00 (n=201) 0.00 (n=138)
dhfr and dhps sextuple mutant	SP (IPTp)	0.00 (n=44) 0.00 (n=114)	0.00 (n=2898) 0.00 (n=3183)	0.01 (n=379) 0.02 (n=131)	0.00 (n=84) 0.05 (n=86)	0.10 (n=1271) 0.00 (n=19)	nan (n=0) 0.00 (n=233)	0.19 (n=68) 0.13 (n=1149)	0.81 (n=981) 0.78 (n=676)	0.18 (n=1037) 0.09 (n=1016)	0.00 (n=201) 0.00 (n=140)
kelch13 and mdr1	AS-MQ	0.00 (n=44) 0.00 (n=114)	0.00 (n=2848) 0.00 (n=2919)	0.00 (n=385) 0.00 (n=134)	0.00 (n=86) 0.00 (n=64)	0.00 (n=1356) 0.00 (n=20)	nan (n=0) 0.00 (n=203)	0.00 (n=78) 0.00 (n=1276)	0.15 (n=1121) 0.03 (n=677)	0.09 (n=1338) 0.01 (n=2151)	0.00 (n=201) 0.00 (n=123)
kelch13 and plasmepsin 2-3	DHA-PPQ	0.00 (n=44) 0.00 (n=114)	0.00 (n=2839) 0.00 (n=2836)	0.00 (n=386) 0.00 (n=134)	0.00 (n=86) 0.00 (n=62)	0.00 (n=1350) 0.00 (n=20)	nan (n=0) 0.00 (n=199)	0.00 (n=78) 0.00 (n=1277)	0.00 (n=1150) 0.00 (n=679)	0.16 (n=1302) 0.45 (n=2134)	0.00 (n=201) 0.00 (n=115)

Save table as:

df_drm_table_with_n.to_excel("DRM_before_after_table.xlsx")

CRT 72-76 haplotypes table¶

The final table will feature crt mutations in codons 72-76 which were associated with a particular resistance profile, with the crt 76T allele a very reliable marker of chloroquine resistance.

Let’s look at the distribution of haplotypes with T and K alleles.

# Rename haplotype column for a better representation
df_all_sample_metadata['crt_72-76'] = df_all_sample_metadata['crt_72-76[CVMNK]']

# Group and count haplotypes
df_all_sample_metadata.groupby(['crt_76[K]', 'crt_72-76']).size()

crt_76[K]  crt_72-76   
-          -                 26
K          -                  3
           CVMNK           5784
           CVMNK,CVMKK        1
           CVMNK,YVMNK        1
K,P        !,CVIEP            1
K,T        CVIET,CVMNK        7
           CVMNK,CVIDT        9
           CVMNK,CVIDT*       1
           CVMNK,CVIET      410
           CVMNK,CVIET*       5
           CVMNK,CVMNT        1
           CVMNK,SVMNT*       4
T          -                  2
           CIIET,CVIET        1
           CVIDT            535
           CVIDT,!            1
           CVIDT,CVIET       33
           CVIDT,CVIET*      16
           CVIET           8252
           CVIET,!            1
           CVIET,CIIET        1
           CVIET,CVI*         2
           CVIET,CVIDT       33
           CVIET,CVIDT*      15
           CVIET,CVMNT        1
           CVIET,SVMNT*       1
           CVIET,YVIET        3
           CVMET            105
           CVMET,CVMNT        2
           CVMNT             33
           CVMNT,CVMET        2
           SVMNT            337
           SVMNT,CVMET*       1
           SVMNT,CVMNT        1
           YVIET              2
T,K        CVIDK,CVIET        1
           CVIDT,CVMNK        7
           CVIET,!            3
           CVIET,CVIDK        1
           CVIET,CVIEK        4
           CVIET,CVMNK      519
           CVIET,CVMNK*      17
           CVMNK,CVIET       12
           CVMNT,CVI*         1
           SVMNT,CVMNK*       4
T,Q*       CVIET,!*           1
dtype: int64

We can see that there are heterozygous and ambiguous calls which we will exclude from the frequency estimation. Only exception is 72-76T haplotypes that contains heterozygous and ambiguous calls, since 76T call is reliable marker and they represent a bigger group.

# Set a group name for 76T haplotypes that contains ambiguous characters
df_all_sample_metadata.loc[
    df_all_sample_metadata['crt_72-76'].str.contains(',|-|X|\?') & ( df_all_sample_metadata['crt_76[K]'] == 'T' ),
    'crt_72-76'
] = 'Other T'

# Exclude het calls and haplotypes that contains ambiguous characters
analysis_samples = (
    ~df_all_sample_metadata['crt_72-76'].astype(str).str.contains(',|-|X|\?'))

# Look at the counts of haplotypes with 76K and 76T alleles
df_all_sample_metadata.loc[analysis_samples].groupby(['crt_76[K]', 'crt_72-76']).size()

crt_76[K]  crt_72-76
K          CVMNK        5784
T          CVIDT         535
           CVIET        8252
           CVMET         105
           CVMNT          33
           Other T       116
           SVMNT         337
           YVIET           2
dtype: int64

Now we can find the distribution of haplotypes in the populations.

# Create a pivot table with haplotypes as index, populations as columns, and counts as values
df_crt = pd.pivot_table(
    df_all_sample_metadata.loc[analysis_samples],
    index='crt_72-76',
    columns='Population',
    values='Study',
    aggfunc=len,
    fill_value=0
).loc[:, populations.keys()]

df_crt

Population	SA	AF-W	AF-C	AF-NE	AF-E	AS-S-E	AS-S-FE	AS-SE-W	AS-SE-E	OC-NG
crt_72-76
CVIDT	0	0	0	0	0	0	0	0	535	0
CVIET	0	1622	244	63	329	60	1241	1858	2835	0
CVMET	105	0	0	0	0	0	0	0	0	0
CVMNK	0	4035	152	94	1059	150	83	10	187	14
CVMNT	33	0	0	0	0	0	0	0	0	0
Other T	5	0	0	0	0	0	2	3	105	1
SVMNT	12	0	0	0	0	7	0	0	0	318
YVIET	0	0	0	0	0	0	0	0	2	0

As a last step we want to sort the order of the haplotypes. The wild type haplotype will be shown first, followed by other haplotypes sorted by penalty score and then by total number of haplotypes in descending order.

# Add a row_total column representing the total count for each haplotype
df_crt['row_total'] = df_crt.sum(1)

# Sort the table based on penalty score
df_crt['hamming_dist'] = [10 if x in ['Other T'] else scipy.spatial.distance.hamming(list('CVMNK'), list(str(x))) * len('CVMNK') for x in df_crt.index]
sort_order = df_crt.sort_values(['hamming_dist', 'row_total'], ascending=[True, False]).index

# Get the numbers of samples for each population
numbers_of_samples = df_crt.sum(0)

# Return proportions
df_crt_final = round(df_crt.loc[sort_order].divide(list(numbers_of_samples)),3)
df_crt_final

Population	SA	AF-W	AF-C	AF-NE	AF-E	AS-S-E	AS-S-FE	AS-SE-W	AS-SE-E	OC-NG	row_total	hamming_dist
crt_72-76
CVMNK	0.000	0.713	0.384	0.599	0.763	0.691	0.063	0.005	0.051	0.042	0.381	0.00
CVMNT	0.213	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.002	0.04
SVMNT	0.077	0.000	0.000	0.000	0.000	0.032	0.000	0.000	0.000	0.955	0.022	0.08
CVMET	0.677	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.007	0.08
CVIET	0.000	0.287	0.616	0.401	0.237	0.276	0.936	0.993	0.774	0.000	0.544	0.12
CVIDT	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.146	0.000	0.035	0.12
YVIET	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.001	0.000	0.000	0.16
Other T	0.032	0.000	0.000	0.000	0.000	0.000	0.002	0.002	0.029	0.003	0.008	0.40

We will add the population sizes to the column names and remove the columns used for sorting.

# Create final column names with population names and sample sizes
final_column_names = collections.OrderedDict()
for column_name in list(df_crt_final):
    final_column_names[column_name] = "%s (n=%d)" % (column_name, numbers_of_samples[column_name])

# Rename columns in the DataFrame
df_crt_final = df_crt_final.rename(columns=final_column_names)

# Drop row_total and hamming_dist columns
df_crt_final = df_crt_final.drop(['row_total (n=15164)', 'hamming_dist (n=25)'], axis=1)
df_crt_final

Population	SA (n=155)	AF-W (n=5657)	AF-C (n=396)	AF-NE (n=157)	AF-E (n=1388)	AS-S-E (n=217)	AS-S-FE (n=1326)	AS-SE-W (n=1871)	AS-SE-E (n=3664)	OC-NG (n=333)
crt_72-76
CVMNK	0.000	0.713	0.384	0.599	0.763	0.691	0.063	0.005	0.051	0.042
CVMNT	0.213	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
SVMNT	0.077	0.000	0.000	0.000	0.000	0.032	0.000	0.000	0.000	0.955
CVMET	0.677	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
CVIET	0.000	0.287	0.616	0.401	0.237	0.276	0.936	0.993	0.774	0.000
CVIDT	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.146	0.000
YVIET	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.001	0.000
Other T	0.032	0.000	0.000	0.000	0.000	0.000	0.002	0.002	0.029	0.003

Save the table as:

df_crt.to_excel("table_crt_haplotypes.xlsx")

MalariaGEN parasite data user guide

Drug Resistance Frequencies Across Pf7

Contents