Exploring copy number variation (CNV)
Contents
Exploring copy number variation (CNV)¶
In this notebook we will explore copy number variation (CNV) within the Ag1000G phase 3 samples, focusing on some specific genome regions of interest. We will use the copy number state data for each individual sample, inferred from fitting an HMM to windowed coverage data, to visualise evidence for copy number variation. We will also estimate the frequency of copy number variation in different genes and populations.
As usual this notebook is executable, click the launch icon above () to run it for yourself. Note in particular that, although this notebook investigates the Cyp9k1 and Cyp6p/aa genome regions as examples, if you run this notebook yourself you can change this to investigate any genome region you are interested in.
Setup¶
Install the packages we’ll need. If you’re working on MyBinder, you can skip this step, packages are already installed.
!pip install -qU malariagen_data bokeh
Import packages.
import malariagen_data
from bisect import bisect_left, bisect_right
import numpy as np
import dask.array as da
from dask.diagnostics import ProgressBar
import bokeh.io as bkio
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In this notebook we’ll be making some interactive plots using the bokeh package. Set up plotting with bokeh in the notebook.
bkio.output_notebook()
Finally, setup access to Ag3
data in Google Cloud.
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 data resource API | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0 |
Cohorts analysis | 20211101 |
Species analysis | aim_20200422 |
Site filters analysis | dt_20200416 |
Software version | 3.0.0 |
Introduction to copy number variation¶
Mosquitoes are diploid organisms, like us, meaning that each individual mosquito inherits two complete genome copies, one from each parent. Thus, for any given gene within the mosquito genome sequence, an individual will normally carry two copies.
Of course, genomes can vary. One common type of genome variation is single nucleotide polymorphism (SNP), which we’ve previously explored in the analysis notebooks on SNPs and SNP effects. Another way that genomes can vary is where a gene, or a genome region containing multiple genes, becomes copied two or more times within the same chromosome. Genes or genome regions can also sometimes be deleted. This is known as copy number variation (CNV).
Previously, the Ag1000G Consortium analysed copy number variation within the Ag1000G phase 2 data resource (Lucas et al. 2019). A key finding was that, although CNVs can occur throughout the mosquito genome, there are several genome hotspots where CNVs occur much more frequently. Several of these hotspots occur in genome regions containing genes involved in metabolic resistance to insecticides, such as cytochrome P450 and glutathione S-transferase genes. It has been known for several decades that copy number variation within these gene families likely plays an important role in insecticide resistance. More copies of a gene means more of the corresponding protein will be expressed, and increasing expression of genes that can metabolise insecticides leads to increased ability to tolerate the insecticide (Adolfi et al. 2019).
For the Ag1000G phase 3 data resource, we have estimated the copy number state for each individual mosquito throughout the entire genome sequence, compared to the AgamP4 reference sequence. For each mosquito we’ve counted the number of aligned sequence reads within fixed-size 300 bp windows, then normalised this coverage data and fitted a hidden Markov model. Further details are given in the section on CNV calling in the methods page.
The results of these CNV HMMs provide a rich dataset for exploring copy number variation in any given genome region of interest.
Accessing CNV HMM data¶
The CNV HMM results for a given chromosome and given sample sets can be accessed using the cnv_hmm
method. Let’s access the data for chromosome arm 2R, for two of the Ag3
sample sets from Burkina Faso.
ds_hmm = ag3.cnv_hmm(region="2R", sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"])
ds_hmm
<xarray.Dataset> Dimensions: (samples: 283, variants: 205151) Coordinates: variant_position (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray> variant_end (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray> variant_contig (variants) uint8 dask.array<chunksize=(205151,), meta=np.ndarray> sample_id (samples) object dask.array<chunksize=(181,), meta=np.ndarray> Dimensions without coordinates: samples, variants Data variables: call_CN (variants, samples) int8 dask.array<chunksize=(65536, 64), meta=np.ndarray> call_RawCov (variants, samples) int32 dask.array<chunksize=(65536, 64), meta=np.ndarray> call_NormCov (variants, samples) float32 dask.array<chunksize=(65536, 64), meta=np.ndarray> sample_coverage_variance (samples) float32 dask.array<chunksize=(181,), meta=np.ndarray> sample_is_high_variance (samples) bool dask.array<chunksize=(181,), meta=np.ndarray> Attributes: contigs: ('2R', '2L', '3R', '3L', 'X')
- samples: 283
- variants: 205151
- variant_position(variants)int32dask.array<chunksize=(65536,), meta=np.ndarray>
Array Chunk Bytes 820.60 kB 262.14 kB Shape (205151,) (65536,) Count 5 Tasks 4 Chunks Type int32 numpy.ndarray - variant_end(variants)int32dask.array<chunksize=(65536,), meta=np.ndarray>
Array Chunk Bytes 820.60 kB 262.14 kB Shape (205151,) (65536,) Count 5 Tasks 4 Chunks Type int32 numpy.ndarray - variant_contig(variants)uint8dask.array<chunksize=(205151,), meta=np.ndarray>
Array Chunk Bytes 205.15 kB 205.15 kB Shape (205151,) (205151,) Count 1 Tasks 1 Chunks Type uint8 numpy.ndarray - sample_id(samples)objectdask.array<chunksize=(181,), meta=np.ndarray>
Array Chunk Bytes 2.26 kB 1.45 kB Shape (283,) (181,) Count 6 Tasks 2 Chunks Type object numpy.ndarray
- call_CN(variants, samples)int8dask.array<chunksize=(65536, 64), meta=np.ndarray>
Array Chunk Bytes 58.06 MB 4.19 MB Shape (205151, 283) (65536, 64) Count 42 Tasks 20 Chunks Type int8 numpy.ndarray - call_RawCov(variants, samples)int32dask.array<chunksize=(65536, 64), meta=np.ndarray>
Array Chunk Bytes 232.23 MB 16.78 MB Shape (205151, 283) (65536, 64) Count 42 Tasks 20 Chunks Type int32 numpy.ndarray - call_NormCov(variants, samples)float32dask.array<chunksize=(65536, 64), meta=np.ndarray>
Array Chunk Bytes 232.23 MB 16.78 MB Shape (205151, 283) (65536, 64) Count 42 Tasks 20 Chunks Type float32 numpy.ndarray - sample_coverage_variance(samples)float32dask.array<chunksize=(181,), meta=np.ndarray>
Array Chunk Bytes 1.13 kB 724 B Shape (283,) (181,) Count 6 Tasks 2 Chunks Type float32 numpy.ndarray - sample_is_high_variance(samples)booldask.array<chunksize=(181,), meta=np.ndarray>
Array Chunk Bytes 283 B 181 B Shape (283,) (181,) Count 6 Tasks 2 Chunks Type bool numpy.ndarray
- contigs :
- ('2R', '2L', '3R', '3L', 'X')
The ds_hmm
variable is an xarray dataset. This is simply a container for several data arrays. In particular, we’ll be most interested in the variant_position
and variant_end
arrays, which provide the genomic coordinates for the window start and stop positions; and the call_CN
array, which provides the inferred copy number state across samples and windows. Let’s look at each of these in turn.
# window start coordinates (1-based)
ds_hmm['variant_position'].values
array([ 1, 301, 601, ..., 61544401, 61544701, 61545001],
dtype=int32)
# window end coordinates (1-based, inclusive)
ds_hmm['variant_end'].values
array([ 300, 600, 900, ..., 61544700, 61545000, 61545105],
dtype=int32)
# copy number state predictions from the HMMs
ds_hmm['call_CN']
<xarray.DataArray 'call_CN' (variants: 205151, samples: 283)> dask.array<concatenate, shape=(205151, 283), dtype=int8, chunksize=(65536, 64), chunktype=numpy.ndarray> Coordinates: variant_position (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray> variant_end (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray> variant_contig (variants) uint8 dask.array<chunksize=(205151,), meta=np.ndarray> sample_id (samples) object dask.array<chunksize=(181,), meta=np.ndarray> Dimensions without coordinates: variants, samples
- variants: 205151
- samples: 283
- dask.array<chunksize=(65536, 64), meta=np.ndarray>
Array Chunk Bytes 58.06 MB 4.19 MB Shape (205151, 283) (65536, 64) Count 42 Tasks 20 Chunks Type int8 numpy.ndarray - variant_position(variants)int32dask.array<chunksize=(65536,), meta=np.ndarray>
Array Chunk Bytes 820.60 kB 262.14 kB Shape (205151,) (65536,) Count 5 Tasks 4 Chunks Type int32 numpy.ndarray - variant_end(variants)int32dask.array<chunksize=(65536,), meta=np.ndarray>
Array Chunk Bytes 820.60 kB 262.14 kB Shape (205151,) (65536,) Count 5 Tasks 4 Chunks Type int32 numpy.ndarray - variant_contig(variants)uint8dask.array<chunksize=(205151,), meta=np.ndarray>
Array Chunk Bytes 205.15 kB 205.15 kB Shape (205151,) (205151,) Count 1 Tasks 1 Chunks Type uint8 numpy.ndarray - sample_id(samples)objectdask.array<chunksize=(181,), meta=np.ndarray>
Array Chunk Bytes 2.26 kB 1.45 kB Shape (283,) (181,) Count 6 Tasks 2 Chunks Type object numpy.ndarray
Notice that the call_CN
array has 205,151 rows and 283 columns. That’s because there are 205,151 windows on chromosome arm 2R, and we’ve requested data for 283 individual mosquitoes. For example, here are the copy number state predictions for the first 5 genome windows, for the first sample:
cn = ds_hmm['call_CN'].values
cn[:5, 0]
array([7, 4, 4, 8, 8], dtype=int8)
The numbers 7, 4, 4, 8 and 8 are copy number states. Above I said that the normal copy number state is 2, so what’s going on here? Well, these values are from the beginning of the chromosome arm, and the ends of the chromosome can exhibit a lot of copy number variation. Let’s gain some more intuition by making a histogram of the call_CN
values across all windows and individuals.
fig, ax = plt.subplots()
ax.hist(cn.flatten(), bins=np.arange(-1, 13, 1))
ax.set_xlabel('Copy number state')
ax.set_ylabel('Frequency')
ax.set_title('Copy number histogram')
ax.set_xticks(np.arange(-1, 13, 1) + .5)
ax.set_xticklabels(np.arange(-1, 13, 1));

Above we can see that by far the most common copy number state is 2, as expected. A copy number state less than 2 implies a deletion relative to the reference genome. A copy number state greater than 2 implies an amplification relative to the reference genome. The special value -1 means that the copy number state could not be inferred because of unreliable sequence read alignments.
OK, now we have some sense of what the data look like, let’s explore some specific genome regions of interest.
Visualising amplifications and deletions¶
There are a number of ways of plotting the results of the CNV HMMs in order to visually search for the presence or absence of CNVs in a genome region of interest, and explore their locations and frequencies in different populations.
Cyp9k1 region¶
In our previous analysis of Ag1000G phase 2, we found that the region surrounding the Cyp9k1 gene on the X chromosome was a CNV hotspot. Let’s explore that in the Ag3
data resource.
ag3.plot_cnv_hmm_heatmap(
region="X:15,233,000-15,260,000",
sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"],
sample_query="taxon == 'gambiae' and sex_call == 'F'"
);
In the above plots, we can see that many of the An. gambiae individuals have a copy number amplification spanning the Cyp9k1 gene.
Let’s compare with the An. coluzzii.
ag3.plot_cnv_hmm_heatmap(
region="X:15,233,000-15,260,000",
sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"],
sample_query="taxon == 'coluzzii' and sex_call == 'F'"
);
Apparently, amplifications spanning Cyp9k1 in the An. coluzzii in these sample sets are much less common.
Cyp6p/aa region¶
For comparison, let’s also visualise CNVs in the Cyp6p/aa region on chromosome arm 2R, another known CNV hotspot.
ag3.plot_cnv_hmm_heatmap(
region="2R:28,450,000-28,580,000",
sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"],
sample_query="taxon == 'gambiae'"
);
Again, let’s compare with the An. coluzzii from the same sample sets.
ag3.plot_cnv_hmm_heatmap(
region="2R:28,450,000-28,580,000",
sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"],
sample_query="taxon == 'coluzzii'"
);
In this genome region, amplifications appear to be much more common among the An. coluzzii in these sample sets than the An. gambiae. This is an indication that, within Burkina Faso, different cytochrome P450 genes may be playing a role in pyrethroid insecticide resistance in different species.
Summarising CNV frequencies by gene¶
In addition to visualising the data, it can also be useful to summarise the HMM results by gene. In this case, we might use the modal copy number state within the windows spanning a gene to estimate the gene copy number within an individual mosquito. The gene_cnv_frequencies
method provides an implementation of this.
Let’s illustrate this by computing CNV frequencies for the genes in the Cyp6p/aa cluster, again using two Burkina Faso sample sets. These two sample sets provide data from two different species and time points, which can be intersting to compare.
Compute CNV frequencies in genes in the Cyp6aa/p cluster.
df_cnv_cyp6aap = ag3.gene_cnv_frequencies(
region="2R:28,450,000-28,510,000",
cohorts="admin1_year",
sample_sets=["AG1000G-BF-A", "AG1000G-BF-B"]
)
df_cnv_cyp6aap
gene_strand | gene_description | contig | start | end | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | max_af | windows | label | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | gene_name | cnv_type | ||||||||||||
AGAP002862 | CYP6AA1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28480576 | 28482637 | 0.9125 | 0.811321 | 0.041237 | 0.065217 | 0.912500 | 8 | AGAP002862 (CYP6AA1) amp |
AGAP013128 | CYP6AA2 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28483301 | 28484921 | 0.8250 | 0.792453 | 0.041237 | 0.065217 | 0.825000 | 6 | AGAP013128 (CYP6AA2) amp |
AGAP002863 | COEAE6O | amp | - | carboxylesterase alpha esterase [Source:VB Com... | 2R | 28485262 | 28487080 | 0.6000 | 0.509434 | 0.010309 | 0.021739 | 0.600000 | 7 | AGAP002863 (COEAE6O) amp |
AGAP002864 | CYP6P15P | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28487640 | 28489092 | 0.5250 | 0.528302 | 0.010309 | 0.021739 | 0.528302 | 6 | AGAP002864 (CYP6P15P) amp |
AGAP002865 | CYP6P3 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28491415 | 28493141 | 0.0375 | 0.075472 | 0.010309 | 0.021739 | 0.075472 | 7 | AGAP002865 (CYP6P3) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28491415 | 28493141 | 0.0000 | 0.000000 | 0.113402 | 0.000000 | 0.113402 | 7 | AGAP002865 (CYP6P3) del | ||
AGAP002866 | CYP6P5 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28494017 | 28495645 | 0.0375 | 0.056604 | 0.020619 | 0.021739 | 0.056604 | 6 | AGAP002866 (CYP6P5) amp |
AGAP002867 | CYP6P4 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28497087 | 28498674 | 0.0375 | 0.075472 | 0.010309 | 0.021739 | 0.075472 | 6 | AGAP002867 (CYP6P4) amp |
AGAP002868 | CYP6P1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28499251 | 28500900 | 0.0375 | 0.056604 | 0.010309 | 0.021739 | 0.056604 | 6 | AGAP002868 (CYP6P1) amp |
AGAP002869 | CYP6P2 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28501033 | 28502910 | 0.0375 | 0.056604 | 0.010309 | 0.021739 | 0.056604 | 7 | AGAP002869 (CYP6P2) amp |
AGAP002870 | CYP6AD1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28504248 | 28505816 | 0.0375 | 0.056604 | 0.010309 | 0.021739 | 0.056604 | 6 | AGAP002870 (CYP6AD1) amp |
Finally, let’s make this table a little more readable by removing some columns and highlighting frequencies.
ag3.plot_frequencies_heatmap(df_cnv_cyp6aap)
Here we can see that the most commonly amplified gene is Cyp6aa1, which is amplified in 90% of An. coluzzii individuals in 2012 and 81% of An. coluzzii individuals in 2014.
Further reading¶
Hopefully this notebook has been a useful introduction to exploring CNVs in the Ag3
data resource. Don’t forget you can use this notebook to analyse any genome region of interest. If you have any comments, questions or suggestions, please feel free to get in touch.
This notebook is part of the Ag1000G phase 3 user guide. See the menu at the left for more documentation and example analyses.
If you’re interested to read more about CNVs in general and in Anopheles mosquitoes, check out the following links:
Lucas et al. (2019) Whole-genome sequencing reveals high complexity of copy number variation at insecticide resistance loci in malaria mosquitoes
Adolfi et al. (2019) Functional genetic validation of key genes conferring insecticide resistance in the major African malaria vector, Anopheles gambiae