3. Genome sequencing

3.2. Genome sequencing technologies

3.2.1. Sanger sequencing

Until the mid-2000s, DNA sequencing primarily relied on the Sanger technique, which was invented in the 1970s and partially automated in the late 1980s with the introduction of sequencing machines. These first-generation sequencers represented great progress at the time and were based on the size fractionation of DNA fragments by electrophoresis and laser detection of the four possible bases by using four fluorochromes. However, for each 500 to 1000 bp read produced, separate sequencing reactions, based on the copy of a template DNA, had to be performed. This technique has become obsolete in favor of next-generation genome sequencing, but is still used for sequencing polymerase chain reaction (PCR) amplicons when targeting specific regions of one honey bee genome.

3.2.2. Next generation sequencing

The next major genome sequencing breakthrough was the advent of next-generation sequencing (NGS) techniques in the mid-2000s. Although these were proposed at first by three companies (Roche, Applied Biosystems, and Solexa/Illumina), today, the dominant platform is Illumina. The breakthrough came from the fact that the sequencing reactions were no longer performed individually, but simultaneously on a surface, or flow cell. This allows millions of DNA fragments to be amplified in parallel, with fluorescently labeled nucleotides added and detected sequentially. Parallel sequencing has a very high throughput and can currently produce up to billions of reads per run. However, these reads are short (150-250 bp, depending on the technology used), which can be a major limitation, especially for de novo sequencing. This inconvenience is partially overcome by the fact that two 150 bp reads (read pairs) can be produced from both ends of each DNA fragment. Before the advent of long-read sequencing, read pairs distant up to 10 kb could be produced (mate-pairs) to help in sequence assembly and scaffolding. Parallel sequencing is used, for instance, in population genomics or for generating a very high density of unbiased markers in a genome-wide association study.

3.2.3. Long-read sequencing

Long-read sequencing, pioneered by Pacific Biosciences (PacBio) and Oxford Nanopore, is the newest sequencing approach. These innovative technologies can produce reads longer than 10 kb, but until recently, at the cost of a high sequence error rate. As of the time of writing, both parallel and long-read sequencing are the technologies of choice and are used either independently or in combination. Long-read sequencing is often used for producing new genome assemblies and for the detection of structural variants (SVs). For a detailed discussion on using long-read sequencing for transcriptomics, please refer to [Section 6.2.2].

Today, sequencing is done via dedicated core facilities or private companies. Users submit their samples or DNA, and in return, are supplied with the sequencing files along with quality assessment of the sequenced data. Most of the work, then, consists of analyzing the data to extract biological meaning. However, depending on the biological question and perhaps also on budget considerations, the sequencing strategy (e.g., read depth, platform) will have to be defined in advance, and at least some basic knowledge of the advantages and limits of the current technologies are required. Sequencing platforms often provide tools to guide the new user (e.g., coverage calculator, sample pooling normalization calculator) but we do recommend consulting sequencing specialists as a very first step.