Create a Neighbour Joining Tree

This notebook creates a neighbour-joining tree figure to display the population structure of the Pf8 data release.

The notebook uses a new Python library, anjl, which is a fast implementation of neighbour-joining algorithm and provides interactive plots for further exploration.

This notebook should take approximately five minutes to run.

Neighbour Joining Trees

A neighbour joining tree, or NJT, is a type of distance-based tree used to visualise population structure for a set of samples. A tree will plot more closely related samples on nearby ‘branches’, while more distantly related samples appear farther apart in a tree.

A NJT makes use of a distance matrix, which contains information on how near/distant samples are in an evolutionary sense. In the case of the Pf8 data, this distance matrix was calculated using high-quality bi-allelic single nucleotide polymorphisms (SNPs) from across the Pf genome. A NJT is created by joining the closest samples (or neighbours) over and over until the final tree is built.

For this use case, a NJT is a good choice because it is less computationally expensive than other trees such as Maximum Likelihood while still being a robust method for analysing genetic structure. You can read the original paper on the method here and a more user-friendly guide to the concept here

Setup

Install necessary packages, including the malariaGEN Python package

!pip install malariagen_data -q --no-warn-conflicts
!pip install anjl s3fs kaleido -q --no-warn-conflicts
  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone

Load the required Python libraries

import malariagen_data
import anjl
import s3fs
from botocore.config import Config
import numpy as np
import pandas as pd
from google.colab import drive

Access Pf8 Data

We use the malariagen data package to load the release data.

sample_data = malariagen_data.Pf8()
sample_metadata = sample_data.sample_metadata()

Then we load the file containing mean pairwise genetic distances for all Pf8 samples

# Cloud path to Pf8 distance matrix
s3_path = "s3://pf8-release/Pf8_mean_genotype_distance.npy"

config = {'signature_version': 's3',
    's3': { 'addressing_style': 'virtual' }
}

fs = s3fs.S3FileSystem(
    anon=True,
    endpoint_url='https://cog.sanger.ac.uk',
    config_kwargs=config
)

# Open and load the .npy file
with fs.open(s3_path, 'rb') as f:
    mean_distance_matrix = np.load(f)
    print('Distance matrix loaded successfully.')
Distance matrix loaded successfully.

Subset the data

We only want to include samples which have passed quality control (QC). There should be 24,409 such samples in Pf8.

# Make subsets of the metadata dataframe
df_qc = sample_metadata.loc[sample_metadata['QC pass'] == True]
df_qc.shape
(24409, 17)
# Apply the same filtering to the distance matrix
qc_samples = (sample_metadata['QC pass'] == True).values
qc_matrix = mean_distance_matrix[qc_samples][:,qc_samples]
qc_matrix.shape
(24409, 24409)
# We fill the diagonal with zeros to ensure the tree is built correctly
np.fill_diagonal(qc_matrix, 0)

Generate the NJT

# First, create a dictionary of colours for each of the ten subpopulations in Pf8
# This maps a population acronym (the dictionary key) to a 6-digit HEX code, which corresponds to a colour (the dictionary value)
colour_dict = {
"SA": "#4daf4a", # South America
"AF-W":"#e31a1c", # West Africa
"AF-C":"#fd8d3c", # Central Africa
"AF-NE": "#bb8129", # Northeast Africa
"AF-E":"#fecc5c", # East Africa
"AS-S-E":"#dfc0eb", # Eastern South Asia
"AS-S-FE": "#984ea3", # Far-Eastern South Asia
"AS-SE-W": "#9ecae1", # Western Southeast Asia
"AS-SE-E":"#3182bd", # Eastern Southeast Asia
"OC-NG":"#f781bf" # Oceania / New Guinea
}

We then use anjl to generate a NJT using the dynamic_nj method. You can read more on this and other available implementations of the tree building algorithm here, but in short: dynamic_nj is the fastest way to generate a NJT.

# Create the tree object
Z = anjl.dynamic_nj(qc_matrix, disallow_negative_distances = True)

Now create the figure, specifying parameters for the tree’s appearance and the information displayed in the interactive features.

# Create the figure object
fig = anjl.plot(
    Z,
    leaf_data=df_qc.reset_index().fillna('???'), # the input dataframe
    color= "Population", # how the tree branches should be coloured
    color_discrete_map = colour_dict, # the colour dictionary we just created
    hover_name="Sample", # what the title of the hover text should be
    hover_data=["Country"], # this can be a list
    line_width = 1,
    marker_size = 2,
    width = 1500,
    height = 1500,
    distance_sort = False,
    count_sort = True,
    render_mode="svg"
)

fig.show()

Figure Legend. Neighbour joining tree of 24,409 QC+ samples in Pf8. The tree is built from mean genetic distances between all sample pairs. Mean genetic distances were calculated as the distance between alternative allele frequencies, based on all bi-allelic coding SNPs passing quality filters. Tree branches are coloured according to ten subpopulations:  SA: South America, AF-W: West Africa, AF-C: Central Africa, AF-NE: Northeast Africa, AF-E: East Africa, AS-S-E: eastern South Asia, AS-S-FE: far-eastern South Asia, AS-SE-W: western Southeast Asia, AS-SE-E: eastern Southeast Asia, and OC-NG: Oceania / New Guinea. Hovering over branch tips will reveal the sample ID, Population, and Country of a sample.

Here is the NJT for 24,409 QC+ samples in Pf8, coloured according to ten major subpopulations.

Make use of anjl’s interactive features:

  • hover over a branch to see more information about the sample

  • drag and draw a box to zoom in on a part of the tree (double click to reset the zoom)

  • click on populations in the legend to toggle their display on/off

Save the figure

drive.mount('/content/drive')
Mounted at /content/drive
# This will send the file to your Google Drive, where you can download it from if needed
# Change the file path if you wish to send the file to a specific location
# Change the file name if you wish to call it something else

fig.write_image('/content/drive/My Drive/njt_pf8.png', scale=2)