G***** Maps

Google Maps as an analogy to read mapping.


Deoxyribonucleic acid (DNA) is the molecule that carries most of the genetic information used by the cell. Nucleotides (conventionally represented by the letters A, C, G or T) are the basic units of DNA molecules. DNA sequencing is the process of reading the nucleotides that comprise a DNA molecule (e.g. “GCAAACCAAT” is a 10-nucleotide DNA string). The genome is the complete set of DNA molecules in a cell and knowing its nucleotide sequence is key to understand how the cell works. In most organisms genomes are organized into one or more relatively long DNA molecules known as chromosomes, each containing thousands to millions of nucleotides, also named base pairs (bp). For instance, the human genome comprises ~3 billion bp [1]. Even the smallest genome, that of the bacteria Nasuia deltocephalinicola, has a size of 112,000 bp [2], so reading the sequence of a genome or part of it requires the experimental capability to sequence thousands of nucleotides.


Fortunately, current sequencing technologies allow such a high throughput in a reasonable time and at a relatively low cost. Genomes can nowadays be sequenced in days for less than $10,000, and sequencing companies keep shortening the sequencing time and are quickly reaching the $1,000 genome barrier. Although this may seem costly, it represents a substantial price drop compared to the first initiative to sequence the human genome [1], which was completed in a decade and had a cost of about $2.7 billions.


Our capability to sequence entire genomes comes at a price: current sequencers produce millions of relatively short genome readouts (or simply called “reads”) for which their nucleotide composition is known but their original location in the genome is not. Typical read lengths are 50–250 nucleotides long but novel sequencing strategies able to read DNA molecules with lengths larger than thousands of nucleotides are under development. Most genomic applications rely on the ability to match each bit of sequenced DNA to a specific position in the genome of reference (e.g. human genome), a process known as alignment or mapping –without this process the raw output of sequencers is merely millions of unsorted 4-letters strings! I was recently trying to explain this process to my wife when I came out with an analogy that hopefully will illustrate it well enough.


One can think of the mapping to the genome as the task of finding an address in Google Maps. Imagine we get a handwritten address of the restaurant we are meeting for dinner: “Al Mahara”. Without more information, knowing where we will have a delicious meal tonight seems as difficult as placing the DNA string “AAACGTTGGCA” in its correct position in the genome. A quick search using Google Maps will find the address of the restaurant (city, street and number) and show it on the map. Similarly, sequence aligners or mappers will (try to) find the genomic coordinates of our DNA string; in this way, we can find that “AAACGTTGGCA” corresponds to the position number 33,031,597 on chromosome 21.


Just as it could happen in a Google Maps search, there are several factors that can complicate the mapping of sequencing reads to the genome.


For one, the read we get from the sequencing machine may contain errors and thus not fully match the sequence of the string of DNA whose coordinates we want to find. Typical sequencing errors include missing nucleotides and miscalls. A missing nucleotide, represented by an “N”, is included in the output read when the sequencer fails to identify the actual nucleotide (e.g. “AAACGTTGNNA”); it is the sequencer’s way to say “I just don’t know”. Just as reading the name of the restaurant from someone with a terrible handwriting complicates its search in Google Maps, read aligners may have a harder time when trying to map reads with missing information. Miscalls occur when, for instance, the read sequence contains a T where the actual DNA string has an A. Trying to align reads containing such errors to the genome reference can be as misleading as googling “Al Mahtra” instead of “Al Mahara” (certainly a luxury restaurant in Dubai is not the same as a tiny village in Estonia).


Similarly, the read of the length may be too short to be conclusively mapped to the genome. For instance, trying to find the location of a 10-bp DNA sequence may be as naive as hoping that Google Maps will find a single place when fed with “Café”. This is further complicated by the presence of DNA chunks in the genome that are not unique but repeated multiple times, just as in many parts of the world stores of a given brand are everywhere around the corner. Trying to map a DNA fragment derived from such repetitive regions back to the genome will likely produce many possible locations and thus be as little conclusive as writing “McDonald’s” in Google Maps. So, the longer the read the more information is provided to the mapper and the higher the chance that a single address in the genome pops up. This especially important for mapping repetitive genomic regions: longer reads equal to giving more context information to the mapper, just as one could add the city and the neighborhood to the search of a multihit restaurant.


The ingredients for an effective search into the map of the genome may not be very surprising after all: minimize missing letters and typos and maximize the input information. Fair enough. Do you feel like start searching genome addresses?



Bibliography

[1] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi:10.1038/35057062.

[2] Bennett GM, Moran NA. Small, smaller, smallest: the origins and evolution of ancient dual symbioses in a Phloem-feeding insect. Genome Biol Evol. 2013;5:1675–88. doi:10.1093/gbe/evt118.

Written on July 18, 2017