Structural variants and the limits of genome sequences

Sequence is everything.

Part of the DNA sequence from a human genome. Photo courtesy Ian Glover, Flickr
Part of the DNA sequence from a human genome. Photo courtesy Ian Glover, Flickr

Or not.

While a genome’s sequence was never held in genomics to be the whole story, there was a time around the turn of the century when it seemed like base pair sequences were all we sought. As the human genome project ramped up, then neared completion, the furor around it made it seem like the foundations of wellness and disease, longevity, athletic talent and much more all resided in linear readouts of ACGTs.

Now we know that sequence tells only part of the genomics story, and that part is smaller than initially thought. After the human reference genome was published over a decade ago, researchers started to find a lot of noise in the sequence signal. While human genomic variation was primarily thought of in terms of SNPs (single nucleotide polymorphisms, the natural variation between single bases in the population), researchers like Charles Lee (now with JAX-Genomic Medicine), Michael Wigler (Cold Spring Harbor) and Stephen Scherer (Hospital for Sick Children, Toronto) noticed additional variation in the form of copy number variants (CNVs). That is, instead of all genes having two copies (one from each parent) in a standard diploid genome, the gene numbers varied more than expected. There might be one copy, or three or more. And these CNVs occurred with some frequency in the healthy population, not only in people with severe disease, contributing more to genomic variation between individuals than SNPS do.

As it turns out, CNVs are only one type of what are now called structural variants (SVs). Genomes are riddled with them, including CNVs, insertions and deletions of DNA segments (indels), inversions (sections where the sequence is correct but the order is reversed), and more. Standard short-read sequencing methods, where the genome is sheared into 250-bp or even shorter segments, often miss these variations. Putting the genome back together depends on assembling the fragments using the reference sequence as a template, which loses structural differences. In other words, when you chop the genome into small pieces, it is then difficult to tell whether a sequence is present in higher or lower frequency, whether it has been inserted elsewhere in the genome, in which direction it is read, and so on.

A couple of recent papers looking into very different research areas have underscored the importance of structural variation. In the first paper, which appeared in Genome Research earlier in the summer and about which I’ve already written in more detail, a team lead by Laura Reinholdt at The Jackson Laboratory used an analytics pipeline optimized for mouse exome data and a powerful exome variation database. They found the likely disease-causing mutations in 53% of the mouse strains they examined, with 11% of the mutations discovered in novel genes. This was a success, but what about the other 47%? They dug deeper to find out.

What they found was copy number variants (where instead of two copies of a gene there is only one or there are three or more) or structural mutations, where sequences are added to or deleted from each gene. Both elude typical approaches for analyzing exome sequencing data and even most whole genome sequencing analyses. By adding a protocol that detects structural anomalies, in many cases they were able to find the copy number variation or structural mutation that caused the disease phenotype in the mice. This is of crucial importance, because the current success rate for finding sequence mutations in rare disease patients hovers at around 25%. Analyses that better assess structural variations hold the potential to find far more of the causative changes for these patients.

The other paper, from Cell Reports, looked at gastric cancers, which occur at relatively high frequencies in Southeast Asian populations. In this case the causal changes can occur in areas where chromosomes exchange sequences, creating a combined “fusion” protein that initiates cancer. Again, the correct sequences are all there, so when sequencing is done with short reads and assembled against a reference genome, all looks normal. So a team led by Yijun Ruan from The Jackson Laboratory for Genomic Medicine instead used a sequencing technique that detects and characterizes genomic structural rearrangements to analyze the gastric cancer cells.

What they found were five recurrent fusion proteins and calculated that they occurred at frequencies of 2%-5% in gastric cancers. They also investigated a particular protein, a combination of CLDN18 and ARHGAP26 and determined that it impaired cell adhesion, leading to gastritis, a risk factor for gastric cancer. While more research is necessary, it is clear that standard sequencing protocols don’t tell the whole story in many cancers.

This all makes sense at a certain level, because genomes are not linear. They are highly complicated and convoluted structures, and they are far more dynamic and intertwined than linear sequences might have us believe. Keep an eye on structural variants, because the more we learn about genomes, the better we’ll understand the roles they play. And it’s clear now that they contribute far more to health and disease than we could have imagined in the “sequence is everything” days a mere 15 years ago.


Mark Wanner followed graduate work in microbiology with more than 25 years of experience in book publishing and scientific writing. His work at The Jackson Laboratory focuses on making complex genetic, genomic and technical information accessible to a variety of audiences. Follow Mark on Twitter at @markgenome.