It is a big era for sequencing. About forty years ago, Frederick Sanger and colleagues brought Sanger sequencing technique to the world. During the past four decades, tremendous progress has been made regarding speed, read length and throughput, along with a sharp reduction of per-base cost. Many sequencing techniques appeared in this megatrend as well as multiple outstanding companies. No matter what exciting techniques have been developed, these three main generations will not be ignored in this revolution: Sanger, NGS, and SMS. Here I will present some knowledge about the difference between them.
Sanger sequencing, which is used by the Human Genome Project (HGP), has finished its sequencing phase in 2003 from 1984. This project took hundreds of labs working together, and cost of around $100 million.
- High precision 「error rate: 0.001% - 1%」
- Long read
- Low throughput
- High cost
Next-Generation Sequencing (NGS)
With its unprecedented throughput, scalability, and speed, next-generation sequencing enables researchers to study biological systems at a level never been possible. Illumina NGS utilizes a fundamentally different approach from the classic Sanger chain-termination method. It leverages sequencing by synthesis (SBS) technology - tracking the addition of labeled nucleotides as the DNA chain is copied - in a massively parallel fashion. Besides, the cost of sequencing a single human genome is under $1000, which means detecting SNPs and structural variants or construct big databases will not only just a dream anymore.
- Low cost
- High throughput
- Decent Precision「error rate: 0.46% - 2.4%」
- Short read (which will lead to a low precision when performing sequence assembly)
Single Molecule Sequencing (SMS)
This DNA sequencing technology is done on a chip that contains many ZMWs (zero-mode waveguide), a structure that creates an illuminated observation volume that is small enough to observe only a single nucleotide of DNA being incorporated by DNA polymerase. Inside each ZMW, a single molecule of single-stranded DNA template is immobilized to the bottom through which light can penetrate and create a visualization chamber that allows monitoring of the activity of the DNA polymerase at a single molecule level. The signal from a phospho-linked nucleotide incorporated by the DNA polymerase is detected as the DNA synthesis proceeds which result in the DNA sequencing in real time. Plus, it cost under $100 for genome sequencing.
- Long read
- High throughput
- Indel (insert & delete)
- Low precision 「error rate: 11% - 14%」
When it mentions sequencing, the ultimate goal is to get a human genome directly. If there is a machine able to sequence the human genome directly, the following discussion will make no sense.
But it is not. We still need sequence multiple reads and assemble them. Theoretically, small reads (30-40bp) can construct a genome very well if there has a circumstance without repeats or heterozygous. Back to reality, we need to confront the fact that the problems are still here that wait to be settled. Lots of assembly algorithms are dealing with these two tricky problems: de Bruijn graph, Newbler, AMOS, and CAP3, etc.
NGS is a significant trend, sanger is “Gold standard”, while SMS is the future. Even though the SMS has an extremely high error rate right now, but it will come into a blossom era when genius scientists keep moving on. There are two major classes of assembly algorithms: overlap-layout-consensus (OLC) and de-Bruijn-graph (DBG). Considering the computational consumption of time and memory, the OLC algorithm is more suitable for the low-coverage long reads, whereas the DBG algorithm is more suitable for high-coverage short reads and especially for large genome assembly. In the next decades, it may be a glorious era for OLC.
Entering the second decade of the 21st century, high-quality genome sequences for many species are still in high demand by the genomics field. Besides the second-generation sequencing technologies, there are many other new technologies helpful for de novo sequencing, including the single molecular sequencing PacBio (www.pacificbiosciences.com) and the Optical Mapping environmental technology OpGen (www.opgen.com), which has recently entered the market. PacBio produces extreme long reads (1–10 kb) but with a high error rate (15–20%), whereas OpGen can generate 100 kb–1 Mb length physically linked markers, which facilitates the physical map construction. Driven by the development of these technologies, we see a bright future for de novo assembly of large genomes, with the standard likely to be raised from draft sequence to chromosome level sequence and from haploid sequence to diploid sequence. The development of assembly algorithms is tied closely to the development of sequencing technologies. The history has turned from OLC for Sanger sequencing, to DBG for second-generation sequencing, and the future will likely lead back to OLC for long reads sequencing. In the next few years, short reads assembly and long reads assembly may co-exist, and both the OLC and DBG algorithms will be improved continuously. Small, simple genomes can be assembled well with pure short reads; average difficulty genomes can use the hybrid assembly method using both long and short reads, whereas large complex genomes will rely more on long reads assembly. As outlined here, it is clear that sequencing technologies and assembly algorithms will change rapidly over the next few years, and construction will get more comfortable and better as technologies continue to improve.