Contents
Next generation sequencing
Introduction
The sequencing of the human genome was completed in 2003, after 13 years of international collaboration and investment of USD 3 billion. The Human Genome Project used Sanger sequencing (albeit heavily optimized), the principal method of DNA sequencing since its invention in the 1970s.
Today, the demand for sequencing is growing exponentially, with large amounts of genomic DNA needing to be analyzed quickly, cheaply, and accurately. Thanks to new sequencing technologies known collectively as Next Generation Sequencing, it is now possible to sequence an entire human genome in a matter of hours.
Sanger sequencing and Next-generation sequencing
The principle behind Next Generation Sequencing (NGS) is similar to that of Sanger sequencing, which relies on capillary electrophoresis. The genomic strand is fragmented, and the bases in each fragment are identified by emitted signals when the fragments are ligated against a template strand. The Sanger method required separate steps for sequencing, separation (by electrophoresis) and detection, which made it difficult to automate the sample preparation and it was limited in throughput, scalability and resolution. The NGS method uses array-based sequencing which combines the techniques developed in Sanger sequencing to process millions of reactions in parallel, resulting in very high speed and throughput at a reduced cost. The genome sequencing projects that took many years with Sanger methods can now be completed in hours with NGS, although with shorter read lengths (the number of bases that are sequenced at a time) and less accuracy.
Next generation methods of DNA sequencing have three general steps:
- Library preparation: libraries are created using random fragmentation of DNA, followed by ligation with custom linkers
- Amplification: the library is amplified using clonal amplification methods and PCR
- Sequencing: DNA is sequenced using one of several different approaches
Library preparation
Firstly, DNA is fragmented either enzymatically or by sonication (excitation using ultrasound) to create smaller strands. Adaptors (short, double-stranded pieces of synthetic DNA) are then ligated to these fragments with the help of DNA ligase, an enzyme that joins DNA strands. The adaptors enable the sequence to become bound to a complementary counterpart.
Adaptors are synthesised so that one end is 'sticky' whilst the other is 'blunt' (non-cohesive) with the view to joining the blunt end to the blunt ended DNA. This could lead to the potential problem of base pairing between molecules and therefore dimer formation. To prevent this, the chemical structure of DNA is utilised, since ligation takes place between the 3'-OH and 5'-P ends. By removing the phosphate from the sticky end of the adaptor and therefore creating a 5'-OH end instead, the DNA ligase is unable to form a bridge between the two termini (Figure 1).
In order for sequencing to be successful, the library fragments need to be spatially clustered in PCR colonies or 'polonies' as they are conventionally known, which consist of many copies of a particular library fragment. Since these polonies are attached in a planar fashion, the features of the array can be manipulated enzymatically in parallel. This method of library construction is much faster than the previous labour intensive procedure of colony picking and E. coli cloning used to isolate and amplify DNA for Sanger sequencing, however, this is at the expense of read length of the fragments.
Amplification
Library amplification is required so that the received signal from the sequencer is strong enough to be detected accurately. With enzymatic amplification, phenomena such as 'biasing' and 'duplication' can occur leading to preferential amplification of certain library fragments. Instead, there are several types of amplification process which use PCR to create large numbers of DNA clusters.
Emulsion PCR
Emulsion oil, beads, PCR mix and the library DNA are mixed to form an emulsion which leads to the formation of micro wells (Figure 2).
In order for the sequencing process to be successful, each micro well should contain one bead with one strand of DNA (approximately 15% of micro wells are of this composition). The PCR then denatures the library fragment leading two separate strands, one of which (the reverse strand) anneals to the bead. The annealed DNA is amplified by polymerase starting from the bead towards the primer site. The original reverse strand then denatures and is released from the bead only to re-anneal to the bead to give two separate strands. These are both amplified to give two DNA strands attached to the bead. The process is then repeated over 30-60 cycles leading to clusters of DNA. This technique has been criticised for its time consuming nature, since it requires many steps (forming and breaking the emulsion, PCR amplification, enrichment etc) despite its extensive use in many of the NGS platforms. It is also relatively inefficient since only around two thirds of the emulsion micro reactors will actually contain one bead. Therefore an extra step is required to separate empty systems leading to more potential inaccuracies.
Bridge PCR
The surface of the flow cell is densely coated with primers that are complementary to the primers attached to the DNA library fragments (Figure 3). The DNA is then attached to the surface of the cell at random where it is exposed to reagents for polymerase based extension. On addition of nucleotides and enzymes, the free ends of the single strands of DNA attach themselves to the surface of the cell via complementary primers, creating bridged structures. Enzymes then interact with the bridges to make them double stranded, so that when the denaturation occurs, two single stranded DNA fragments are attached to the surface in close proximity. Repetition of this process leads to clonal clusters of localised identical strands. In order to optimise cluster density, concentrations of reagents must be monitored very closely to avoid overcrowding.
Sequencing
Several competing methods of Next Generation Sequencing have been developed by different companies.
454 Pyrosequencing
Pyrosequencing is based on the 'sequencing by synthesis' principle, where a complementary strand is synthesised in the presence of polymerase enzyme (Figure 4). In contrast to using dideoxynucleotides to terminate chain amplification (as in Sanger sequencing), pyrosequencing instead detects the release of pyrophosphate when nucleotides are added to the DNA chain. It initially uses the emulsion PCR technique to construct the polonies required for sequencing and removes the complementary strand. Next, a ssDNA sequencing primer hybridizes to the end of the strand (primer-binding region), then the four different dNTPs are then sequentially made to flow in and out of the wells over the polonies. When the correct dNTP is enzymatically incorporated into the strand, it causes release of pyrophosphate. In the presence of ATP sulfurylase and adenosine, the pyrophosphate is converted into ATP. This ATP molecule is used for luciferase-catalysed conversion of luciferin to oxyluciferin, which produces light that can be detected with a camera. The relative intensity of light is proportional to the amount of base added (i.e. a peak of twice the intensity indicates two identical bases have been added in succession).
Pyrosequencing, developed by 454 Life Sciences, was one of the early successes of Next-generation sequencing; indeed, 454 Life Sciences produced the first commercially available Next-generation sequencer. However, the method was eclipsed by other technologies and, in 2013, new owners Roche announced the closure of 454 Life Sciences and the discontinuation of the 454 pyrosequencing platform.
Ion torrent semiconductor sequencing
Ion torrent sequencing uses a "sequencing by synthesis" approach, in which a new DNA strand, complementary to the target strand, is synthesized one base at a time. A semiconductor chip detects the hydrogen ions produced during DNA polymerization (Figure 5).
Following polony formation using emulsion PCR, the DNA library fragment is flooded sequentially with each nucleoside triphosphate (dNTP), as in pyrosequencing. The dNTP is then incorporated into the new strand if complementary to the nucleotide on the target strand. Each time a nucleotide is successfully added, a hydrogen ion is released, and it detected by the sequencer's pH sensor. As in the pyrosequencing method, if more than one of the same nucleotide is added, the change in pH/signal intensity is correspondingly larger.
Ion torrent sequencing is the first commercial technique not to use fluorescence and camera scanning; it is therefore faster and cheaper than many of the other methods. Unfortunately, it can be difficult to enumerate the number of identical bases added consecutively. For example, it may be difficult to differentiate the pH change for a homorepeat of length 9 to one of length 10, making it difficult to decode repetitive sequences.
Sequencing by ligation (SOLiD)
SOLiD is an enzymatic method of sequencing that uses DNA ligase, an enzyme used widely in biotechnology for its ability to ligate double-stranded DNA strands (Figure 6). Emulsion PCR is used to immobilise/amplify a ssDNA primer-binding region (known as an adapter) which has been conjugated to the target sequence (i.e. the sequence that is to be sequenced) on a bead. These beads are then deposited onto a glass surface - a high density of beads can be achieved which which in turn, increases the throughput of the technique.
Once bead deposition has occurred, a primer of length N is hybridized to the adapter, then the beads are exposed to a library of 8-mer probes which have different fluorescent dye at the 5' end and a hydroxyl group at the 3' end. Bases 1 and 2 are complementary to the nucleotides to be sequenced whilst bases 3-5 are degenerate and bases 6-8 are inosine bases. Only a complementary probe will hybridize to the target sequence, adjacent to the primer. DNA ligase is then uses to join the 8-mer probe to the primer. A phosphorothioate linkage between bases 5 and 6 allows the fluorescent dye to be cleaved from the fragment using silver ions. This cleavage allows fluorescence to be measured (four different fluorescent dyes are used, all of which have different emission spectra) and also generates a 5’-phosphate group which can undergo further ligation. Once the first round of sequencing is completed, the extension product is melted off and then a second round of sequencing is perfomed with a primer of length N−1. Many rounds of sequencing using shorter primers each time (i.e. N−2, N−3 etc) and measuring the fluorescence ensures that the target is sequenced.
Due to the two-base sequencing method (since each base is effectively sequenced twice), the SOLiD technique is highly accurate (at 99.999% with a sixth primer, it is the most accurate of the second generation platforms) and also inexpensive. It can complete a single run in 7 days and in that time can produce 30 Gb of data. Unfortunately, its main disadvantage is that read lengths are short, making it unsuitable for many applications.
Reversible terminator sequencing (Illumina)
Reversible terminator sequencing differs from the traditional Sanger method in that, instead of terminating the primer extension irreversibly using dideoxynucleotide, modified nucleotides are used in reversible termination. Whilst many other techniques use emulsion PCR to amplify the DNA library fragments, reversible termination uses bridge PCR, improving the efficiency of this stage of the process.
Reversible terminators can be grouped into two categories: 3'-O-blocked reversible terminators and 3'-unblocked reversible terminators.
3'-O-blocked reversible terminators
The mechanism uses a sequencing by synthesis approach, elongating the primer in a stepwise manner. Firstly, the sequencing primers and templates are fixed to a solid support. The support is exposed to each of the four DNA bases, which have a different fluorophore attached (to the nitrogenous base) in addition to a 3’-O-azidomethyl group (Figure 7).
Only the correct base anneals to the target and is subsequently ligated to the primer. The solid support is then imaged and nucleotides that have not been incorporated are washed away and the fluorescent branch is cleaved using TCEP (tris(2-carboxyethyl)phosphine). TCEP also removes the 3’-O-azidomethyl group, regenerating 3’-OH, and the cycle can be repeated (Figure 8) .
3'-unblocked reversible terminators
The reversible termination group of 3'-unblocked reversible terminators is linked to both the base and the fluorescence group, which now acts as part of the termination group as well as a reporter. This method differs from the 3'-O-blocked reversible terminators method in three ways: firstly, the 3’-position is not blocked (i.e. the base has free 3’-OH); the fluorophore is the same for all four bases; and each modified base is flowed in sequentially rather than at the same time.
The main disadvantage of these techniques lies with their poor read length, which can be caused by one of two phenomena. In order to prevent incorporation of two nucleotides in a single step, a block is put in place, however in the event of no block addition due to a poor synthesis, strands can become out of phase creating noise which limits read length. Noise can also be created if the fluorophore is unsuccessfully attached or removed. These problems are prevalent in other sequencing methods and are the main limiting factors to read length.
This technique was pioneered by Illumina, with their HiSeq and MiSeq platforms. HiSeq is the cheapest of the second generation sequencers with a cost of $0.02 per million bases. It also has a high data output of 600 Gb per run which takes around 8 days to complete.
Third generation sequencing
A new cohort of techniques has since been developed using single molecule sequencing and single real time sequencing, removing the need for clonal amplification. This reduces errors caused by PCR, simplifies library preparation and, most importantly, gives a much higher read length using higher throughput platforms. Examples include Pacific Biosciences' platform which uses SMRT (single molecule real time) sequencing to give read lengths of around one thousand bases and Helicos Biosciences which utilises single molecule sequencing and therefore does not require amplification prior to sequencing. Oxford Nanopore Technologies are currently developing silicon-based nanopores which are subjected to a current that changes as DNA passes through the pore. This is anticipated to be a high-throughput rapid method of DNA sequencing, although problems such as slowing transportation through the pore must first be addressed.
Sequencing epigenetic modifications
Just as Next generation sequencing enabled genomic sequencing on a massive scale, it has become clear recently that the genetic code does not contain all the information needed by organisms. Epigenetic modifications to DNA bases, in particular 5-methylcytosine, also convey important information.
All of the second generation sequencing platforms depend, like Sanger sequencing, on PCR and therefore cannot sequence modified DNA bases. In fact, both 5-methylcytosine and 5-hydroxymethylcytosine are treated as cytosine by the enzymes involved in PCR; therefore, epigenetic information is lost during sequencing.
Bisulfite sequencing
Bisulfite sequencing exploits the difference in reactivity of cytosine and 5-methylcytosine with respect to bisulfite: cytosine is deaminated by bisulfite to form uracil (which reads as T when sequenced), whereas 5-methylcytosine is unreactive (i.e. reads as C). If two sequencing runs are done in parallel, one with bisulfite treatment and one without, the differences between the outputs of the two runs indicate methylated cytosines in the original sequence. This technique can also be used for dsDNA, since after treatment with bisulfite, the strands are no longer complementary and can be treated as ssDNA.
5-Hydroxymethylcytosine, another important epigenetic modification, reacts with bisulfite to form cytosine-5-methylsulfonate (which reads as C when sequenced). This complicates matters somewhat, and means that bisulfite sequencing cannot be used as a true indicator of methylation in itself.
Oxidative bisulfite sequencing
Oxidative bisulfite sequencing adds a chemical oxidation step, which converts 5-hydroxymethylcytosine to 5-formylcytosine using potassium perruthenate, KRuO4, before bisulfite treatment. 5-Formylcytosine is deformylated and deaminated to form uracil by bisulfite treatment. Now, three separate sequencing runs are necessary to distinguish cytosine, 5-methylcytosine and 5-hydroxymethylcytosine (see Figure 9).
Applications of Next-generation sequencing
Next generation sequencing has enabled researchers to collect vast quantities of genomic sequencing data. This technology has a plethora of applications, such as: diagnosing and understanding complex diseases; whole-genome sequencing; analysis of epigenetic modifications; mitochondrial sequencing; transcriptome sequencing - understanding how altered expression of genetic variants affects an organism; and exome sequencing - mutations in the exome are thought to contain up to 90% of mutations in the human genome, which leads to disease. DNA techniques have been used to identify and isolate genes responsible for certain diseases, and provide the correct copy of the defective gene known as ‘gene therapy’.
A large focus area in gene therapy is cancer treatment – one potential method would be to introduce an antisense RNA (which specifically prevents the synthesis of a targeted protein) to the oncogene, which is triggered to form tumorous cells. Another method is named ‘suicide gene therapy’ which introduces genes to kill cancer cells selectively. Many genetic codes for toxic proteins and enzymes are known, and introduction of these genes into tumor cells would result in cell death. The difficulty in this method is to ensure a very precise delivery system to prevent killing healthy cells. These methods are made possible by sequencing to analyze tumor genomes, allowing medical experts to tailor chemotherapy and other cancer treatments more effectively to their patients’ unique genetic composition, revolutionizing the diagnostic stages of personalized medicine.
As the cost of DNA sequencing goes down, it will become more widespread, which brings a number of issues. Sequencing produces huge volumes of data, and there are many computational challenges associated with processing and storing the data. There are also ethical issues, such as the ownership of an individual's DNA when the DNA is sequenced. DNA sequencing data must be stored securely, since there are concerns that insurance groups, mortgage brokers and employers may use this data to modify insurance quotes or distinguish between candidates. Sequencing may also help to find out whether an individual has an increased risk to a particular disease, but whether the patient is informed or if there is a cure for the disease is another issue altogether.
References
Ahmadian, A.; Svahn, H.; Massively Parallel. Sequencing Platforms using Lab on a Chip Technologies. Lab Chip, 11, 2653 - 2655 (2011).
Balasubramanian, S.; Decoding Genomes at High Speed: Implications for Science and Medicine. Angew. Chem Int. Ed. 50, 12406-12410 (2011).
Balasubramanian, S.; Sequencing Nucleic Acids: from Chemistry to Medicine. Chem. Commun. 47, 7281 - 7286 (2011).
Chen, F.; Dong, M.; Ge, M.; Zhu, L.; Ren, L.; Liu, G.; Mu, R.; The History and Advances of Reversible Terminators Used in New Generations of Sequencing Technology. Gen. Pro. Bio. 11, 34-40 (2013).
Mardis, E.; Next-Generation DNA Sequencing Methods. Annu. Rev. Genomics Hum. Genet. 9, 387 - 402 (2008).
Mardis, E.; Next-Generation DNA Sequencing Platforms. Annu. Rev. Anal. Chem. 6, 287-303 (2013).
Shendure, J.; Ji, H.; Next-Generation DNA Sequencing. Nat. Biotech. 26, 10 1135-1144 (2008).