1. Complete doesn’t necessarily mean complete.
The original “complete” human genome sequence — and the vast majority of the ones that have followed — actually omits nearly 10% of the full sequence. There are parts of chromosomes that are very difficult to sequence using standard methods, such as the centromeres — the middle of the “x” shape where chromosomes condense — and telomeres at the ends of each chromosome. There are also other, highly repetitive regions that are impossible to accurately align against a reference sequence, so they are omitted. Roughly 20 years ago, when the first sequence was reported as finished, there was a lot of debate about the importance and possible function of the 98.5% of the genome that doesn’t code for proteins, so leaving certain non-coding regions out seemed reasonable. We now know that these sequences can be quite important indeed, and recent efforts such as the T2T Consortium have used improved long-read sequencing technologies (see #4 below) to, at long last, fill in the gaps.
2. One genome doesn’t teach us nearly as much as we thought it would.
The first human genome sequence was generated from a sample provided by a male resident of northern European descent from the Buffalo, New York, region of the United States. It was lauded as the “blueprint of life,” and it was thought at the time that it would be sufficient to provide significant insights into health and disease, even before more sequences were completed and analyzed. That has not turned out to be the case, for reasons that seem rather obvious in hindsight. The single data set leaves out one of the two sexes entirely, of course, as well as every other human population on Earth. It’s been known for a long time that the two sexes as well as different ethnic groups carry different disease susceptibilities and risks, but it took a while to hammer home the message that genomic data sets need to diversify. And while the situation has improved, most sequenced genomes are still of northern European ancestry: in short, much work remains to be done.
3. Loss of function doesn’t always lead to adverse effects.
Ah, it looked so simple two decades ago. Sequence the genome, look at the coding sequences for mutations or variants associated with disease, and the roadmap to health and wellness would be clear. Well, maybe it wasn’t quite that understated, but it was close. The research since has wiped such assumptions completely away, with layers of added complexity uncovered with each new discovery. An important finding in human datasets revealed that we all carry genetic variants that cause loss-of-function in many genes; the healthy population averages about 100 such dysfunctional genes each. Research with mice has muddied the waters further, as knocking out the same gene in multiple inbred — that is, genetically pretty much identical — mice can result in highly variable outcomes between individuals. Sometimes it’s pretty extreme, with some mice not even surviving until birth and others not only living but appearing essentially normal. Therefore, interpreting genetic variation is contextual within each individual, and penetrant mutations, those that consistently lead to disease or other effects, are not the rule.
4. Most genomic variation is at the structural level.
Although human genomes are highly similar on a percentage basis, they still vary naturally at the single base level at millions of places along the ~3.2-billion base pair sequence. An adenine in my genome might be a guanine in yours, leading to the subtle changes that, in sum, make us unique. Little did we know 20 years ago that such differences, known to scientists as single nucleotide polymorphisms (SNPs), tell only part of the story. Other, larger variants, called structural variants (SVs) lurk, but were rarely detectable by short-read sequencing methods because they don’t, in fact, change a sequence that has been broken down and then reassembled. They involve sequence deletions, duplications, inversions and insertions within the genome, and they create variation for more base pairs than SNPs do. Advanced long-read sequencing methods, which can cover hundreds of thousands and even millions of bases in a single read, are now being used to detect and characterize these SVs, which are also being associated with various diseases.
5. The road from gene to messenger RNA to protein is long and winding.
While in college and graduate school too many years ago, I learned the Central Dogma of Life. Genes, made of A, T, C and G nucleotides, are transcribed from their DNA templates to produce pre-messenger RNAs (pre-mRNAs). The pre-mRNAs are processed, so that sequences known as introns are edited out and the remaining exons stitched together into mature mRNAs. The mRNAs then travel from the nucleus, which houses the genome in each cell, to the cytoplasm, where ribosomes are located. Each three-nucleotide sequence in the mRNA codes for a specific amino acid, which is added one by one at the ribosome in a process called translation, until voila, a protein is produced that will join the vast array of other proteins to carry out the functions needed by the cell. Except it’s nowhere near that simple. There is a dizzying regulatory network that determines when and how much of a gene is actually transcribed. The processing of pre-mRNAs is not uniform, and alternative splicing of them can create many different proteins, known as isoforms, from the same gene. Problems with any step of the process, from aberrant DNA transcription, to splicing defects, to translation misfires, can lead to dysfunction and disease.