Investigating population structure with PCA

This notebook illustrates how to investigate population structure by running principal components analysis (PCA) using genetic variation data from Ag3. As usual this notebook is executable, click the launch icon above () to try running it for yourself.

Preamble: what is population structure?

In Ag3 we sequenced 2,784 wild-caught mosquitoes collected from 19 countries in sub-Saharan Africa. This includes mosquitoes from three known species within the Anopheles gambiae species complex. When we have a cohort of mosquitoes from different species and/or geographical locations, we would expect that not all mosquitoes are equally closely related to each other. These non-random patterns of relatedness are known as population structure.

For example, we might expect that any two mosquitoes from the same species will be more closely related than two mosquitoes from different species. This will hopefully be intuitively obvious, because mosquitoes will prefer to mate with individuals of the same species, even if they live alongside mosquitoes from other species and can hybridise. This is also known as reproductive isolation.

Similarly, we might expect that two mosquitoes sampled from the same location will be more closely related than two mosquitoes sampled from distant locations. This is based on the assumption that mosquitoes may travel to find a mate, but may only be willing or able to travel a certain distance. Mosquitoes may also be impeded from travelling in certain directions by natural physical barriers, such as a region of elevated terrain, or by variation in the availability of suitable habitat. These limits and barriers to movement can thus generate population structure, also known as geographical isolation.

For both of these causes of population structure – reproductive isolation and geographical isolation – we may have some idea of what to expect based on our knowledge of previous studies, but we may be surprised.

For example, the Anopheles gambiae species complex continues to be unravelled, and there may be cryptic species that we were not previously aware of (e.g., see Crawford et al. (2016) and Tennessen et al. (2020)).

Also, recent studies have suggested that some malaria mosquitoes may engage in long-distance migration (Huestis et al. 2019), challenging the previous view that mosquitoes generally don’t travel more than a few kilometres in their lifetimes (Service 1997). But we still don’t know to what extent long-distance migration occurs, or whether the rate or range of migration varies between geographical regions and/or mosquito species. We would also like to learn more about how ecological and landscape variation affects the movement and interbreeding of mosquito populations in different regions of Africa.

In short, we would like to investigate population structure, and we can use genetic variation data to do that. There are various methods for analysing population structure, but this notebook will focus on just one of these methods, principal components analysis (PCA). PCA is attractive because it is model-free, meaning that you don’t need to specify any model of population structure before-hand, you can just run PCA and start exploring the results. On the other hand, interpreting PCA requires some care, which we’ll expand on a little below. If you’re interested in reading more deeply about using PCA for investigating population structure, Patterson et al. (2006), Novembre and Stephens (2008) and McVean (2009) are all great resources.


Set up the environment by installing and importing some packages.

# install packages - will take a few minutes to compile scikit-allel
!pip install -U zarr==2.6.1 fsspec==0.8.7 gcsfs==0.7.2 dask==2021.03.0 plotly scikit-allel malariagen_data
     |██▍                             | 10kB 15.7MB/s eta 0:00:01
     |████▊                           | 20kB 17.3MB/s eta 0:00:01
     |███████▏                        | 30kB 9.0MB/s eta 0:00:01
     |█████████▌                      | 40kB 8.5MB/s eta 0:00:01
     |████████████                    | 51kB 4.9MB/s eta 0:00:01
     |██████████████▎                 | 61kB 5.4MB/s eta 0:00:01
     |████████████████▋               | 71kB 5.5MB/s eta 0:00:01
     |███████████████████             | 81kB 5.7MB/s eta 0:00:01
     |█████████████████████▍          | 92kB 5.5MB/s eta 0:00:01
     |███████████████████████▉        | 102kB 6.0MB/s eta 0:00:01
     |██████████████████████████▏     | 112kB 6.0MB/s eta 0:00:01
     |████████████████████████████▌   | 122kB 6.0MB/s eta 0:00:01
     |███████████████████████████████ | 133kB 6.0MB/s eta 0:00:01
     |████████████████████████████████| 143kB 6.0MB/s 
     |████████████████████████████████| 112kB 22.9MB/s 
     |████████████████████████████████| 942kB 40.4MB/s 
     |████████████████████████████████| 13.2MB 253kB/s 
     |████████████████████████████████| 10.8MB 42.9MB/s 
     |████████████████████████████████| 5.8MB 32.7MB/s 
     |████████████████████████████████| 1.3MB 34.8MB/s 
     |████████████████████████████████| 143kB 51.8MB/s 
     |████████████████████████████████| 296kB 42.0MB/s 
?25h  Building wheel for scikit-allel ( ... ?25l?25hdone
  Building wheel for asciitree ( ... ?25l?25hdone
# import packages
import os
import bisect
import hashlib
import json
import allel
import numpy as np
import dask
import dask.array as da
from dask.diagnostics import ProgressBar
# quieten dask warnings about large chunks
dask.config.set(**{'array.slicing.split_large_chunks': False})
import pandas as pd
import malariagen_data
import as px
# register a progress bar for longer-running dask computations
# N.B., run this cell only once!
# setup access to malariagen data in google cloud
ag3 = malariagen_data.Ag3("gs://vo_agam_release/")

Aside: saving PCA results

Some PCA runs may take a while to complete if you’re running this code on a service with modest computational resources such as Google Colab or MyBinder, because genotype calls from tens of millions of SNPs may need to be scanned to identify and extract the data for PCA. The run time will depend on the number of samples included in the analysis, but may take 20 mins or more for larger analyses.

To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. If you’re running this notebook on Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later. To mount your Google Drive, run the following code, and follow the authorization instructions:

# mount Google Drive if running on Google Colab
from google.colab import drive
Mounted at /content/drive

Now create a directory to hold PCA results:

# create a directory to hold PCA results - change this to something suitable
# if not running on Google Colab
results_dir = "drive/MyDrive/Colab Data/ag3-pca-results"
os.makedirs(results_dir, exist_ok=True)

PCA functions

Now define some functions for running a PCA.

def hash_params(*args, **kwargs):
    """Helper function to hash analysis parameters."""
    o = {
        'args': args,
        'kwargs': kwargs
    s = json.dumps(o, sort_keys=True).encode()
    h = hashlib.md5(s).hexdigest()
    return h

def run_pca(
    """Main function to run a PCA.
    contig : str
        Chromosome arm, e.g., '3L'.
    region_start : int, optional
        Start position of contig region to use.
    region_stop : int, optional
        Stop position of contig region to use.
    sample_sets : str or list of str, optional
        Sample sets to analyse.
    sample_query : str, optional
        A pandas query string to select specific samples.
    site_mask : {'gamb_colu_arab', 'gamb_colu', 'arab'}
        Which site mask to apply.
    min_minor_ac : int
        Minimum minor allele count.
    max_an_missing : int
        Maximum number of missing allele calls.
    n_snps : int
        Approximate number of SNPs to use.
    snp_offset : int
        Offset when thinning SNPs.
    n_components : int
        Number of PCA components to retain.

    data : pandas DataFrame
        Data frame with one row per sample, including columns "PC1", "PC2", etc.
    evr : numpy array
        Explained variance ratio per principal component.
    # construct a key to save the results under
    results_key = hash_params(

    # define paths for results files
    data_path = f'{results_dir}/{results_key}-data.csv'
    evr_path = f'{results_dir}/{results_key}-evr.npy'

        # try to load previously generated results
        data = pd.read_csv(data_path)
        evr = np.load(evr_path)
        return data, evr
    except FileNotFoundError:
        # no previous results available, need to run analysis
        print(f'running analysis: {results_key}')
    print('setting up inputs')

    # load sample metadata
    df_samples = ag3.sample_metadata(sample_sets=sample_sets)
    # access SNP genotypes
    gt = ag3.snp_genotypes(contig=contig, sample_sets=sample_sets, site_mask=site_mask)

    if region_start or region_stop:
        # locate region within contig
        pos = ag3.snp_sites(contig=contig, field='POS', site_mask=site_mask).compute()
        loc_region = slice(
            bisect.bisect_left(pos, region_start) if region_start else None,
            bisect.bisect_right(pos, region_stop) if region_stop else None,
        gt = gt[loc_region]
    if sample_query:
        # locate selected samples
        loc_samples = df_samples.eval(sample_query).values
        df_samples = df_samples.loc[loc_samples, :]
        gt = da.compress(loc_samples, gt, axis=1)
    print('locating segregating sites within desired frequency range')

    # perform allele count
    ac = allel.GenotypeDaskArray(gt).count_alleles(max_allele=3).compute()
    # calculate some convenience variables
    n_chroms = gt.shape[1] * 2
    an_called = ac.sum(axis=1)
    an_missing = n_chroms - an_called
    min_ref_ac = min_minor_ac
    max_ref_ac = n_chroms - min_minor_ac

    # here we choose biallelic sites involving the reference allele
    loc_seg = np.nonzero(ac.is_biallelic() & 
                         (ac[:, 0] >= min_ref_ac) & 
                         (ac[:, 0] <= max_ref_ac) & 
                         (an_missing <= max_an_missing))[0]
    print('preparing PCA input data')

    # thin SNPs to approximately the desired number
    snp_step = loc_seg.shape[0] // n_snps
    loc_seg_ds = loc_seg[snp_offset::snp_step]

    # subset genotypes to selected sites
    gt_seg = da.take(gt, loc_seg_ds, axis=0)
    # convert to genotype alt counts
    gn_seg = allel.GenotypeDaskArray(gt_seg).to_n_alt().compute()
    # remove any edge-case variants where all genotypes are identical
    loc_var = np.any(gn_seg != gn_seg[:, 0, np.newaxis], axis=1)
    gn_var = np.compress(loc_var, gn_seg, axis=0)

    print('running PCA')

    # run the PCA
    coords, model = allel.pca(gn_var, n_components=n_components)
    # add PCs to dataframe
    data = df_samples.copy()
    for i in range(n_components):
        data[f'PC{i+1}'] = coords[:, i]
    # save results
    evr = model.explained_variance_ratio_
    data.to_csv(data_path, index=False), evr)
    print(f'saved results: {results_key}')
    return data, evr

Plotting functions

Before we starting running PCAs, define some plotting functions to help visualise the results.

def plot_variance(evr, **kwargs):
    """Plot a bar chart showing variance explained by each principal
    # prepare variables
    y = evr * 100
    x = [str(i+1) for i in range(len(y))]
    # setup plotting options
    plot_kwargs = dict(
            'x': 'Principal component',
            'y': 'Explained variance (%)',
    # apply any user overrides

    # make a bar plot
    fig =, y=y, **plot_kwargs)
def jitter(a, f):
    r = a.max() - a.min()
    return a + f * np.random.uniform(-r, r, a.shape)

def plot_coords(

    # setup data
    data = data.copy()
    # apply jitter if desired - helps spread out points when tightly clustered
    if jitter_frac:
        data[x] = jitter(data[x], jitter_frac)
        data[y] = jitter(data[y], jitter_frac)
    # convenience variables
    data['country_location'] = data['country'] + ' - ' + data['location']
    data['size'] = 1  # hack to allow us to control marker size
    # setup plotting options
    plot_kwargs = dict(
    # apply any user overrides

    # 2D scatter plot
    fig = px.scatter(data, x=x, y=y, **plot_kwargs)

def plot_coords_3d(

    # setup data
    data = data.copy()
    # apply jitter if desired - helps spread out points when tightly clustered
    if jitter_frac:
        data[x] = jitter(data[x], jitter_frac)
        data[y] = jitter(data[y], jitter_frac)
        data[z] = jitter(data[z], jitter_frac)
    # convenience variables
    data['country_location'] = data['country'] + ' - ' + data['location']
    # setup plotting options
    plot_kwargs = dict(
    # apply any user overrides

    # 3D scatter plot
    fig = px.scatter_3d(data, x=x, y=y, z=z, **plot_kwargs)
# choose colours for species
species_palette = px.colors.qualitative.Plotly
species_color_map = {
    'gambiae': species_palette[0],
    'coluzzii': species_palette[1],
    'arabiensis': species_palette[2],
    'intermediate_gambiae_coluzzii': species_palette[3],
    'intermediate_arabiensis_gambiae': species_palette[4],

Analysis: Central African Republic

Now run some analyses, investigating population structure within some of the Ag3 sample sets.

Let’s begin by running a PCA using only samples from the Central African Republic and SNPs from chromosome arm 3L.

data, evr = run_pca(

Let’s look at what the run_pca() function returns.

sample_id partner_sample_id contributor country location year month latitude longitude sex_call sample_set release aim_fraction_colu aim_fraction_arab species_gambcolu_arabiensis species_gambiae_coluzzii species PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
0 BK0001-C RCA_1 Alessandra della Torre Central African Republic Bangui 1993 12 4.367 18.583 F AG1000G-CF v3 0.029 0.002 gamb_colu gambiae gambiae -54.966766 -38.828590 23.306778 -18.807184 7.509643 -58.887413 52.629280 -12.944190 8.555619 -53.326637
1 BK0002-C RCA_2 Alessandra della Torre Central African Republic Bangui 1993 12 4.367 18.583 F AG1000G-CF v3 0.956 0.002 gamb_colu coluzzii coluzzii 152.585720 1.706959 8.435535 1.094421 -4.127664 -5.787193 -9.662916 -20.674557 -16.575504 0.090932
2 BK0003-C RCA_3 Alessandra della Torre Central African Republic Bangui 1993 12 4.367 18.583 F AG1000G-CF v3 0.964 0.002 gamb_colu coluzzii coluzzii 150.370160 -5.296166 -1.884464 -6.347249 -4.915325 -1.825995 -23.605349 -9.248346 33.025055 -13.820646
3 BK0005-C RCA_5 Alessandra della Torre Central African Republic Bangui 1993 12 4.367 18.583 F AG1000G-CF v3 0.023 0.002 gamb_colu gambiae gambiae -54.385426 -8.639940 55.502890 1.536387 32.557390 -47.155390 -0.826889 206.070590 73.425740 115.960180
4 BK0006-C RCA_6 Alessandra della Torre Central African Republic Bangui 1993 12 4.367 18.583 F AG1000G-CF v3 0.964 0.002 gamb_colu coluzzii coluzzii 161.375430 3.376961 -13.414440 -10.664361 17.075394 -11.291839 -15.304790 10.595125 5.892384 20.023388

The data variable is a pandas DataFrame, with one row per sample, and including columns “PC1”, “PC2”, etc., with the coordinates of the principal components.

array([0.03751405, 0.01718241, 0.01698252, 0.01615884, 0.01554069,
       0.01534222, 0.01509024, 0.01486337, 0.01475849, 0.01461872],

The evr variable is a numpy array holding the explained variance ratio for each of the principal components (we’ll look again at this in a minute).

Let’s make a scatter plot the PCA results for the first two components:

title = 'Central African Republic (3L)'
plot_coords(data, x='PC1', y='PC2',

What does this show? Each marker represents an individual mosquito. There are two main groups of inviduals, separated along PC1. The markers are coloured according to their species, and we can see that the An. gambiae individuals are on the left, whereas the An. coluzzii individuals are on the right. This is what we would expect from a set of individuals from two different mosquito species, all collected at the same time and place. I.e., we have reproductive isolation here, corresponding to the two known species An. gambiae and An. coluzzii.

Aside: how many principal components should I look at?

Notice that some of the An. gambiae samples spread out along PC2. How do you know whether this is meaningful or just random noise? One way to get an intuition for this is to examine the variance explained by each of the components:

plot_variance(evr, title=title)

The first principal component explains around 3.5% of the variance, whereas all of the subsequent principal components explain around 1.7% of the variance. The absolute magnitude of these values is less important, but what is more important is the fact that PC1 explains much more variance than PC2, and PC2 explains a similar amount of variance to all the subsequent PCs. In other words, there is a big step down from PC1 to PC2, then the variance explained flattens off. This is a good indication that PC1 is capturing some real structure in the data, and the rest of the PCs are probably random noise.

There are various statistical methods for formally testing whether a principal component conveys a real signal of population structure or is just random noise (see, e.g., Patterson et al. (2006) and Forkman et al. (2019)). However, for exploratory analyses, looking at the differences in variance explained between the adjacent PCs, and ignoring the tail of PCs where the variance flattens off, is not a bad rule of thumb.

Aside: PCA and genetic distance

In the above scatter plot of PC1 and PC2, the An. coluzzii individuals all cluster close together. Similarly, most of the An. gambiae individuals cluster close together. It can be tempting to interpret this in the following way, which would be wrong:

“The individuals within each cluster are genetically nearly identical to each other.”

Similarly, it can be tempting to draw the following conclusion, which would also be wrong:

“The two clusters of individuals are genetically very different from each other.”

What is wrong with these statements?

In short, PCA does not tell you anything about the absolute magnitude of genetic distance between individuals. All it can tell you is something about relative genetic distance. I.e., the PCA tells us that the An. coluzzii individuals are more closely related to each other than to the An. gambiae individuals, and vice versa. However, it does not tell us how much more, and there may still be a lot of genetic diversity within each of these clusters. The key point is that all individuals within each cluster are related (or unrelated) to each other by a similar degree.

Analysis: Uganda

Let’s explore another sample set, this time from Uganda.

title = 'Uganda (3L)'
data, evr = run_pca(
plot_variance(evr, title=title)
plot_coords(data, x='PC1', y='PC2',

Here we have a similar result to Central African Republic, in that PC1 explains much more variance than the other components, and separates the mosquitoes into two groups by species. In this case, however, we have An. gambiae and An. arabiensis.

We also have an interesting individual sitting in the middle of PC1. The fact that this mosquito sits almost perfectly half-way between the two species groups, and also gets called as intermediate arabiensis/gambiae by our species assignment method, suggest it is a hybrid, probably an F1 (progeny of parents of different species).

One other point to note, the An. gambiae mosquitoes from Uganda were sampled from two different locations, Kihihi in the south west of the country, and Nagongera in the east. Let’s explore this further by rerunning PCA with just the Ugandan An. gambiae. Let’s also exclude two An. gambiae samples which occurred as outliers on the previous PCA.

title = 'Uganda <i>An. gambiae</i> (3L)'
data, evr = run_pca(
        "species == 'gambiae' and "
        "sample_id not in ['AC0223-C', 'AC0240-C']"
plot_variance(evr, title=title)
plot_coords(data, x='PC1', y='PC2',

Now we can see there is some weak geographical isolation, with Nagongera individuals towards the left of PC1 and Kihihi towards the right. The isolation here is “weak” because there is no clear separation between the two groups of samples, rather there is a continuous spread with some overlap between the two locations.

Aside: interactive scatter plots

To make the scatter plot above we used a library called plotly express. You’ve probably figured out by now that these plots are interactive. For example, you can:

  • Hover over points in the plot to see more details.

  • Click and drag an area of the plot to zoom in.

  • Click on the legend to hide and show groups of points.

The advantage of using an interactive plotting library like this is that it helps in various ways when exploring the PCA results. For example, sometimes an individual sample or pair of samples will appear as outliers on a PCA. This can occur for a variety of reasons, such as cryptic kinship, and we generally want to remove outlier samples from the analysis. Using the interactive plot it’s straightforward to identify outlier samples by hovering over them to see their sample ID, then rerun the PCA without them.

Analysis: Tanzania

Now we have seen both reproductive isolation between species and geographical isolation in action, it’s time to tackle a more complex situation: Tanzania.

title = 'Tanzania (3L)'
data, evr = run_pca(
plot_variance(evr, title=title)

Here the first three principal components explain more variance than the others. A perfect excuse to plot the results in 3D.

plot_coords_3d(data, x='PC1', y='PC2', z='PC3', 

What’s going on here? Well, there are clearly four distinct clusters.

One cluster contains all the An. arabiensis.

Two further clusters each contain exclusively An. gambiae. If you hover over the points, you’ll see that one cluster contains only mosquitoes from Muheza, which is on the east coast, while the other cluster contains only mosquitoes from Muleba, in the north west of the country. This is a clear case of geographical isolation.

The final cluster is puzzling, because it contains mostly mosquitoes that our species assignment method calls as being intermediate between An. gambiae and An. coluzzii. The species An. coluzzii is not thought to occur in East Africa, and so these cannot be recent hybrids between the two species. Another possibility is that mosquitoes in this group represent a cryptic taxon, i.e., a previously unknown group of mosquitoes that is reproductively isolated from both An. arabiensis and An. gambiae. For now this remains a hypothesis, but this is a good example of how PCA can alert us to previously hidden complexities in malaria mosquito populations.

Analysis: East African An. arabiensis

We can also combine data from multiple countries to perform a broader analysis of population structure. To finish off with, let’s run a PCA of all the An. arabiensis mosquitoes from East Africa. This is a larger set of samples, so may take a little longer to run.

title = 'East African <i>An. arabiensis</i> (3L)'
data, evr = run_pca(
        "species == 'arabiensis' and "
        "country in ['Uganda', 'Tanzania', 'Kenya', 'Malawi']"
plot_variance(evr, title=title)
plot_coords(data, x='PC1', y='PC2',

Here we can see different degrees of geographical isolation between mosquitoes from different countries. The strongest isolation is between Malawi and the more northerly countries, separated on PC1. Then there is somewhat weaker isolation between Tanzania and Uganda, separated on PC2. The few Kenyan individuals cluster together with Tanzanian individuals from coastal sampling locations, which is unsurprising given the close geographical proximity.

Further reading

Here are those PCA references again, in case you feel like taking a deeper dive:

Hopefully this notebook has been a useful introduction to PCA. If you have any comments, questions or suggestions, please get in touch.

This notebook is part of the Ag1000G phase 3 user guide. See the menu at the left for more documentation and example analyses.