{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Principal Coordinate Analysis" ] }, { "cell_type": "markdown", "metadata": { "id": "fman9VUtezVB" }, "source": [ "## Introduction\n", "\n", "In this notebook we will recreate Panels B and C from Figure 1 of the [Pf7 Paper](https://wellcomeopenresearch.org/articles/8-22/v1).\n", "\n", "Both panels contain a **Principal Coordinate Analysis** used to identify patterns of geographic and genetic structure within the dataset.\n", "\n", "**This notebook should take approximately 15 minutes to run**\n", "\n", "### **Principal Coordinate Analysis - PCoA**\n", "PCoA falls into a category of analyses which focus on \"*dimensionality reduction*\". Dimensionality reduction is used in data analysis to simplify complex data by reducing the number of variables while still retaining meaningful properties from the original dataset.\n", "\n", "The PCoA we will calculate is based on a large distance matrix of pairwise genetic distances between 16,203 *Plasmodium falciparum* samples. The distances were generated using high-quality, bi-allelic coding single nucleotide polymorphisms (SNPs) from throughout the Pf genome.\n", "\n", "The PCoA seeks to represent these pairwise distances in a lower-dimensional space, while preserving their relative distances as accurately as possible.\n", "\n", "Thus, we can simply view relative genetic distances between samples from different geographical locations." ] }, { "cell_type": "markdown", "metadata": { "id": "nhA0ZmATgTRw" }, "source": [ "## Setup\n", "\n", "Install the MalariaGEN data package:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "B1mxjiN2gaDn" }, "outputs": [], "source": [ "!pip install -q --no-warn-conflicts malariagen_data" ] }, { "cell_type": "markdown", "metadata": { "id": "wlgVe1xmoqu0" }, "source": [ "Install another required package, scikit-bio:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "RdHCd4wUoX4f" }, "outputs": [], "source": [ "!pip install -q --no-warn-conflicts scikit-bio" ] }, { "cell_type": "markdown", "metadata": { "id": "IiHNfbirgsFO" }, "source": [ "Load the required Python libraries:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "3ZsjmeigezVE" }, "outputs": [], "source": [ "import malariagen_data\n", "from ftplib import FTP\n", "import scipy.stats\n", "import scipy.spatial.distance\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import collections\n", "import skbio\n", "from skbio.stats.ordination import pcoa\n", "from google.colab import drive" ] }, { "cell_type": "markdown", "metadata": { "id": "A1GE4_3Eg8yn" }, "source": [ "## Data Access\n", "\n", "First load the Pf7 metadata:\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 295 }, "id": "ASL3m473hDN1", "outputId": "965f5dbd-f08f-4be1-feea-a53fed8f8bef" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " | Sample | \n", "Study | \n", "Country | \n", "Admin level 1 | \n", "Country latitude | \n", "Country longitude | \n", "Admin level 1 latitude | \n", "Admin level 1 longitude | \n", "Year | \n", "ENA | \n", "All samples same case | \n", "Population | \n", "% callable | \n", "QC pass | \n", "Exclusion reason | \n", "Sample type | \n", "Sample was in Pf6 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "FP0008-C | \n", "1147-PF-MR-CONWAY | \n", "Mauritania | \n", "Hodh el Gharbi | \n", "20.265149 | \n", "-10.337093 | \n", "16.565426 | \n", "-9.832345 | \n", "2014.0 | \n", "ERR1081237 | \n", "FP0008-C | \n", "AF-W | \n", "82.16 | \n", "True | \n", "Analysis_set | \n", "gDNA | \n", "True | \n", "
1 | \n", "FP0009-C | \n", "1147-PF-MR-CONWAY | \n", "Mauritania | \n", "Hodh el Gharbi | \n", "20.265149 | \n", "-10.337093 | \n", "16.565426 | \n", "-9.832345 | \n", "2014.0 | \n", "ERR1081238 | \n", "FP0009-C | \n", "AF-W | \n", "88.85 | \n", "True | \n", "Analysis_set | \n", "gDNA | \n", "True | \n", "
2 | \n", "FP0010-CW | \n", "1147-PF-MR-CONWAY | \n", "Mauritania | \n", "Hodh el Gharbi | \n", "20.265149 | \n", "-10.337093 | \n", "16.565426 | \n", "-9.832345 | \n", "2014.0 | \n", "ERR2889621 | \n", "FP0010-CW | \n", "AF-W | \n", "86.46 | \n", "True | \n", "Analysis_set | \n", "sWGA | \n", "False | \n", "