Exploring copy number variation (CNV)

In this notebook we will explore copy number variation (CNV) within the Ag1000G phase 3 samples, focusing on some specific genome regions of interest. We will use the copy number state data for each individual sample, inferred from fitting an HMM to windowed coverage data, to visualise evidence for copy number variation. We will also estimate the frequency of copy number variation in different genes and populations.

As usual this notebook is executable, click the launch icon above () to run it for yourself. Note in particular that, although this notebook investigates the Cyp9k1 and Cyp6p/aa genome regions as examples, if you run this notebook yourself you can change this to investigate any genome region you are interested in.


Install the packages we’ll need. If you’re working on MyBinder, you can skip this step, packages are already installed.

# recommended package installs for Google Colab
!pip install -q malariagen_data bokeh

Import packages.

import malariagen_data
from bisect import bisect_left, bisect_right
import numpy as np
import dask.array as da
from dask.diagnostics import ProgressBar
import bokeh.plotting as bkplt
import bokeh.models as bkmod
import bokeh.layouts as bklay
import bokeh.io as bkio
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In this notebook we’ll be making some interactive plots using the bokeh package. Set up plotting with bokeh in the notebook.


Finally, setup access to Ag3 data in Google Cloud.

ag3 = malariagen_data.Ag3("gs://vo_agam_release/")

Introduction to copy number variation

Mosquitoes are diploid organisms, like us, meaning that each individual mosquito inherits two complete genome copies, one from each parent. Thus, for any given gene within the mosquito genome sequence, an individual will normally carry two copies.

Of course, genomes can vary. One common type of genome variation is single nucleotide polymorphism (SNP), which we’ve previously explored in the analysis notebooks on SNPs and SNP effects. Another way that genomes can vary is where a gene, or a genome region containing multiple genes, becomes copied two or more times within the same chromosome. Genes or genome regions can also sometimes be deleted. This is known as copy number variation (CNV).

Previously, the Ag1000G Consortium analysed copy number variation within the Ag1000G phase 2 data resource (Lucas et al. 2019). A key finding was that, although CNVs can occur throughout the mosquito genome, there are several genome hotspots where CNVs occur much more frequently. Several of these hotspots occur in genome regions containing genes involved in metabolic resistance to insecticides, such as cytochrome P450 and glutathione S-transferase genes. It has been known for several decades that copy number variation within these gene families likely plays an important role in insecticide resistance. More copies of a gene means more of the corresponding protein will be expressed, and increasing expression of genes that can metabolise insecticides leads to increased ability to tolerate the insecticide (Adolfi et al. 2019).

For the Ag1000G phase 3 data resource, we have estimated the copy number state for each individual mosquito throughout the entire genome sequence, compared to the AgamP4 reference sequence. For each mosquito we’ve counted the number of aligned sequence reads within fixed-size 300 bp windows, then normalised this coverage data and fitted a hidden Markov model. Further details are given in the section on CNV calling in the methods page.

The results of these CNV HMMs provide a rich dataset for exploring copy number variation in any given genome region of interest.

Accessing CNV HMM data

The CNV HMM results for a given chromosome and given sample sets can be accessed using the cnv_hmm method. Let’s access the data for chromosome arm 2R, for two of the Ag3 sample sets from Burkina Faso.

ds_hmm = ag3.cnv_hmm(contig="2R", sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"])
Dimensions:           (samples: 283, variants: 205151)
    variant_position  (variants) int32 dask.array<chunksize=(205151,), meta=np.ndarray>
    variant_end       (variants) int32 dask.array<chunksize=(205151,), meta=np.ndarray>
    variant_contig    (variants) uint8 dask.array<chunksize=(205151,), meta=np.ndarray>
    sample_id         (samples) object dask.array<chunksize=(181,), meta=np.ndarray>
Dimensions without coordinates: samples, variants
Data variables:
    call_CN           (variants, samples) int8 dask.array<chunksize=(205151, 181), meta=np.ndarray>
    call_RawCov       (variants, samples) int32 dask.array<chunksize=(131072, 181), meta=np.ndarray>
    call_NormCov      (variants, samples) float32 dask.array<chunksize=(131072, 181), meta=np.ndarray>
    contigs:  ('2R', '2L', '3R', '3L', 'X')

The ds_hmm variable is an xarray dataset. This is simply a container for several data arrays. In particular, we’ll be most interested in the variant_position and variant_end arrays, which provide the genomic coordinates for the window start and stop positions; and the call_CN array, which provides the inferred copy number state across samples and windows. Let’s look at each of these in turn.

# window start coordinates (1-based)
array([       1,      301,      601, ..., 61544401, 61544701, 61545001],
# window end coordinates (1-based, inclusive)
array([     300,      600,      900, ..., 61544700, 61545000, 61545105],
# copy number state predictions from the HMMs
<xarray.DataArray 'call_CN' (variants: 205151, samples: 283)>
dask.array<concatenate, shape=(205151, 283), dtype=int8, chunksize=(205151, 181), chunktype=numpy.ndarray>
    variant_position  (variants) int32 dask.array<chunksize=(205151,), meta=np.ndarray>
    variant_end       (variants) int32 dask.array<chunksize=(205151,), meta=np.ndarray>
    variant_contig    (variants) uint8 dask.array<chunksize=(205151,), meta=np.ndarray>
    sample_id         (samples) object dask.array<chunksize=(181,), meta=np.ndarray>
Dimensions without coordinates: variants, samples

Notice that the call_CN array has 205,151 rows and 283 columns. That’s because there are 205,151 windows on chromosome arm 2R, and we’ve requested data for 283 individual mosquitoes. For example, here are the copy number state predictions for the first 5 genome windows, for the first sample:

cn = ds_hmm['call_CN'].values
cn[:5, 0]
array([7, 4, 4, 8, 8], dtype=int8)

The numbers 7, 4, 4, 8 and 8 are copy number states. Above I said that the normal copy number state is 2, so what’s going on here? Well, these values are from the beginning of the chromosome arm, and the ends of the chromosome can exhibit a lot of copy number variation. Let’s gain some more intuition by making a histogram of the call_CN values across all windows and individuals.

fig, ax = plt.subplots()
ax.hist(cn.flatten(), bins=np.arange(-1, 13, 1))
ax.set_xlabel('Copy number state')
ax.set_title('Copy number histogram')
ax.set_xticks(np.arange(-1, 13, 1) + .5)
ax.set_xticklabels(np.arange(-1, 13, 1));

Above we can see that by far the most common copy number state is 2, as expected. A copy number state less than 2 implies a deletion relative to the reference genome. A copy number state greater than 2 implies an amplification relative to the reference genome. The special value -1 means that the copy number state could not be inferred because of unreliable sequence read alignments.

OK, now we have some sense of what the data look like, let’s explore some specific genome regions of interest.

Visualising amplifications and deletions

There are a number of ways of plotting the results of the CNV HMMs in order to visually search for the presence or absence of CNVs in a genome region of interest, and explore their locations and frequencies in different populations. Here we’ll make a plot that shows, for each genomic window, the number of individuals with an amplification (CN > 2) or a deletion (0 <= CN < 2). Let’s make some plotting functions.

def plot_genes(contig, width, height, x_range=None):

    df_geneset = ag3.geneset(attributes=["ID", "Parent", "Name", "description"]).set_index("ID")

    # select the gene rows within the given contig
    data = df_geneset.query(f"type == 'gene' and contig == '{contig}'").copy()

    # plot each gene as a rectangle - add some columns to define rectangle
    # coordinates
    data['left'] = data['start'] 
    data['right'] = data['end']
    data['bottom'] = np.where(data['strand'] == '+', 1, 0)
    data['top'] = data['bottom'] + 0.8

    # tidy up some columns for presentation
    data['Name'].fillna('', inplace=True)
    data['description'].fillna('', inplace=True)

    # determine how long the contig is
    contig_length = len(ag3.genome_sequence(contig))

    # define tooltips for hover
    tooltips = [
        ("ID", '@ID'),
        ("Name", '@Name'),
        ("Description", '@description'),

    # make a figure
    fig = bkplt.figure(

    # add functionality to click through to vectorbase
    url = f'https://vectorbase.org/vectorbase/app/record/gene/@ID'
    taptool = fig.select(type=bkmod.TapTool)
    taptool.callback = bkmod.OpenURL(url=url)

    # now plot the genes as rectangles
    fig.quad(bottom='bottom', top='top', left='left', right='right',
             source=data, line_width=.5, fill_alpha=.5)

    # tidy up the plot
    if x_range is None:
        fig.x_range = bkmod.Range1d(0, contig_length/1e6, bounds='auto')
    fig.xaxis.axis_label = f'Contig {contig} position (bp)'
    fig.y_range = bkmod.Range1d(-.5, 2.3)
    fig.ygrid.visible = False
    yticks = [0.4, 1.4]
    yticklabels = ['rev', 'fwd']
    fig.yaxis.ticker = yticks
    fig.yaxis.major_label_overrides = {k: v for k, v in zip(yticks, yticklabels)}
    fig.yaxis.axis_label = f'Strand'

    return fig

def plot_hmm_cnv(contig, start, stop, sample_sets, sample_query=None, width=700, height=200):

    # access HMM data
    ds_hmm = ag3.cnv_hmm(contig=contig, sample_sets=sample_sets)

    # apply samples query if given
    if sample_query is not None:
        df_samples = ag3.sample_metadata(sample_sets=sample_sets)
        loc_samples = df_samples.eval(sample_query)
        ds_hmm = ds_hmm.isel(samples=loc_samples)

    # access data variables 
    vpos = ds_hmm['variant_position'].values
    vend = ds_hmm['variant_end'].values
    cn = ds_hmm['call_CN'].data

    # restrict CNV data to selected genome region
    loc_region = slice(bisect_left(vpos, start), bisect_right(vend, stop))
    vpos = vpos[loc_region]
    vend = vend[loc_region]
    cn = cn[loc_region].compute()

    # setup plotting variables
    x = ((vpos + vend) / 2)  # window midpoints 
    n_del = np.sum((cn >= 0) & (cn < 2), axis=1)
    n_amp = np.sum(cn > 2, axis=1)
    n_samples = ds_hmm.dims['samples']

    # plot amplifications
    fig1 = bkplt.figure(
        title='Amplification (CN > 2)',
        y_range=(0, n_samples),
    fig1.varea(x=x, y1=np.zeros(len(x)), y2=n_amp)
    fig1.x_range = bkmod.Range1d(start, stop, bounds=(start, stop))
    fig1.yaxis.axis_label = f'No. individuals'

    # plot deletions
    fig2 = bkplt.figure(
        title='Deletion (CN < 2)',
        y_range=(0, n_samples),
    fig2.varea(x=x, y1=np.zeros(len(x)), y2=n_del)
    fig2.yaxis.axis_label = 'No. individuals'

    # plot genes
    fig3 = plot_genes(contig=contig, width=width, height=120, x_range=fig1.x_range)

    # combine plots
    grid = bklay.gridplot(
        [fig1, fig2, fig3],

Cyp9k1 region

In our previous analysis of Ag1000G phase 2, we found that the region surrounding the Cyp9k1 gene on the X chromosome was a CNV hotspot. Let’s explore that in the Ag3 data resource.

plot_hmm_cnv(contig="X", start=15_230_000, stop=15_260_000, 
             sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"], 
             sample_query="species == 'gambiae' and sex_call == 'F'")

In the above plots, we can see that more than 100 of the An. gambiae individuals have a copy number amplification spanning the Cyp9k1 gene.

Let’s compare with the An. coluzzii.

plot_hmm_cnv(contig="X", start=15_230_000, stop=15_260_000, 
             sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"], 
             sample_query="species == 'coluzzii' and sex_call == 'F'")

Apparently, amplifications spanning Cyp9k1 in the An. coluzzii in these sample sets are much less common, although there is a slightly odd spike of amplifications immediately upstream of the gene which probable needs further investigation.

Cyp6p/aa region

For comparison, let’s also visualise CNVs in the Cyp6p/aa region on chromosome arm 2R, another known CNV hotspot.

plot_hmm_cnv(contig="2R", start=28_450_000, stop=28_530_000, 
             sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"], 
             sample_query="species == 'gambiae' and sex_call == 'F'")