Современные методы геномных исследований и секвенирования ДНК Константин Валерьевич Крутовский В.н.с. Лаборатории популяционной генетики Института общей генетики им. Н. И. Вавилова РАН Нayчный pyководитeль Научно-образовательного центра геномных исследований и Лаборатории лесной геномики Сибирского федерального университета Профессор Отделения лесной генетики и селекции Гёттингенского университета, Германия Адьюнкт профессор Отделения по изучению и управлению экосистем Техасского агро-механический университета, США Лекция студентам СФУ 10 июня 2014 г. 1 Learning objectives and outcomes • Brief introduction into genomics (structure, objectives, etc.) • Genome-sequencing methods: ‒ Sanger sequencing ‒ Next-generation sequencing (NGS) technology • de novo sequencing, resequencing, and target sequencing Лекция студентам СФУ 10 июня 2014 г. 2 Что такое «Геномика»? • Термин «геном» (genome) был предложен немецким ботаником проф. Hans Winkler (1877- 1945) в 1920 г. (University of Hamburg), который объединил термины «ген» (“gene”) и «хромосома» (“chromosome”) для обозначения одновременно всех генов во всех хромосомах ядра клетки • Термин «геномика» (genomics) был предложен относительно недавно в 1986 г. Thomas Roderick (Jackson Laboratory, USA) для нового журнала Genomics и описания научной дисциплины связанной с секвенированием, картированием и анализом генома • Геномика более широкое понятие в настоящее время и охватывает сравнение геномов разных видов (comparative genomics), их эволюцию (evolutionary genomics) и функционирование генома в целом (functional genomics) Геномика – это изучение генов и их функций в их полной совокупности и взаимодействии Лекция студентам СФУ 10 июня 2014 г. 3 Основы геномной структуры Ген Строчка в тексте/Предложение (состоящее из 4-х «букв»-нуклеотидов A, T, C и G, и 3-х буквенных «слов»-триплетов или кодонов) Хромосома Глава Геном Генофонд Основная задача геномики - полное секвенирование и расшифровка генома Лекция студентам СФУ 10 июня 2014 г. 4 Современные методы секвенирования ДНК • Выпуск с 2005 г. коммерческих высокопроизводительных секвенаторов ДНК на основе новых технологий массивного параллельного секвенирования компаниями Roche и Illumina • Создание в 2012 г. Научно-образовательного центра геномных исследований Сибирского федерального университета при поддержке Отдела генетики и селекции Центра защиты леса Красноярского края и Лаборатории лесной генетики и селекции Института леса им. В. Н. Сукачева СОРАН (http://genome.sfukras.ru) оборудованного самым производительным секвенатором Illumina HiSeq 2000 Лекция студентам СФУ 10 июня 2014 г. 5 Проблема полногеномного секвенирования с помощью новых секвенаторов: возможность секвенировать только короткими фрагментами 100-150 но Выделение тотальной геномной ДНК Дополнительное фрагментирование ДНК Геном Длинные фрагменты ДНК Биоинформатическая обработка на суперкомпьютерах миллионов «ридов» и сборка их в контиги и скаффолды Короткие фрагменты ДНК Приготовление библиотек для секвенирования и кластеризация (cBot) Секвенирование на HiSeq 2000 и генерация сотен миллионов коротких прочтений («ридов») Референсный Геном Лекция студентам СФУ 10 июня 2014 г. 6 Nitrogenous, Nucleoside and Nucleotide Bases Нуклеотид (Азотистые основания ) (nucleoside with a phosphate at 5’ carbon) пуриновые А.о.: (A) (G) аденин гуанин пиримидиновые А.о. (C) цитозин тимин (T) (U) урацил Nucleoside (nitrogenous base linked to a 2-deoxyD-ribose at 1’ carbon) Нуклеозиды —гликозиламины, содержащие азотистое основание, связанное с сахаром (рибозой или дезоксирибозой). Лекция студентам СФУ 10 июня 2014 г. 7 Synthesis of a Nucleotide: Adenosine monophospate (AMP) Лекция студентам СФУ 10 июня 2014 г. 8 Nucleotide polymer • DNA Polymerase “backbone” цепь “backbone” цепь A T G C Phosphodiester bond C G Лекция студентам СФУ 10 июня 2014 г. 9 DNA Sequencing Methods 1nd generation sequencing methods: ─ chemical degradation of nucleotides method (Allan Maxam and Walter Gilbert, 1977) ─ chain-termination or dideoxy method (Frederick Sanger, 1977) 2nd generation high-throughput massively parallel shotgun sequencing methods: – sequencing-by-synthesis methods: pyrosequencing by Jonathan Rothberg (Marguilis et al. 2005) (454 Life Sciences, a subsidiary of Roche Diagnostics, originally a subsidiary of CuraGen Corporation) based on fixing fragmented (nebulized) and adapter-ligated DNA fragments to small DNA-capture beads PCR amplified in a water-in-oil emulsion in a PicoTiterPlate, and then sequenced using DNA polymerase, ATP sulfurylase, and luciferase (to generate light for detection of the individual nucleotides added to the template DNA) and adding sequentially 4 DNA nucleotides in a fixed order using the Genome Sequencer FLX Instrument (there is also a downscaled Junior version); based on reversible dye-terminators, the DNA are extended one nucleotide at a time followed by image acquisition - the camera takes images of the fluorescently labeled nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle (Solexa, now Illumina) using Illumina Genome Analyzer II, HiSeq and MiSeq instruments; semiconductor sequencing developed by Ion Torrent Systems Inc. founded by Jonathan Rothberg (now Life Technologies), based on the detection of hydrogen ions that are released during the DNA synthesis, as opposed to the optical methods used in other sequencing systems (Ion Torrent and Ion Proton Sequencers - Personal Genome Machines). – sequencing-by-ligation method developed by Applied Biosystems (now Life Technologies), the DNA is amplified by emulsion PCR and then sequenced using the SOLiD Instrument. Лекция студентам СФУ 10 июня 2014 г. 10 DNA Sequencing Methods 3rd generation single molecular (SM) based sequencing methods: – sequencing-by-synthesis methods: SM method by Helicos Biosciences uses DNA fragments with added poly-A tail adapters which are attached to the flow cell surface. The next steps involve extensionbased sequencing with cyclic washes of the flow cell with fluorescently labeled nucleotides (one nucleotide type at a time, as with the Sanger method). The reads are short, up to 55 bp per run, but there are recent improvements; Single molecule real time (SMRT) sequencing by Pacific Biosciences is based on the sequencing by synthesis of the DNA in zero-mode wave-guides (ZMWs) - small welllike containers with the signal capturing tools located at the bottom of the well using DNA polymerase attached to the ZMW bottom and producing reads of up to 15 Kbp, with mean read lengths of 2.5 to 2.9 Kbp; 4th generation single molecular (SM) based sequencing methods: Nanopore DNA sequencing by Oxford Nanopore Technologies is based on the readout of electrical signals occurring at nucleotides passing through pores in membranes created by the pore-forming protein α-hemolysin covalently bound with cyclodextrin within the nanopore that will bind transiently to the DNA molecule being detected. The DNA passing through the nanopore changes its ion current. Each type of the nucleotide blocks the ion flow through the pore for a different period of time (GridION system and miniaturised MinION instrument). Лекция студентам СФУ 10 июня 2014 г. 11 Dideoxy (Sanger) Method Dideoxynucleotides without a hydroxyl group at 3'-end, which prevents strand extension are used together with normal nucleotides in the sequencing reaction: PPP O 5' CH2 O BASE Frederick Sanger 1918 –2013 OH 3' Лекция студентам СФУ 10 июня 2014 г. 12 Dideoxy (Sanger) Method As a result there is a mixture of fragments of different size in a sequencing reaction: Sequencing cycle: • 94°C: DNA denaturing • 45°C: primer annealing A A A T T T • 60-72°C: thermostable DNA polymerase primer extension Repeat 25-35 times ddATP in the reaction: a ddA will be occasionally added to the growing strand whenever there is a T in the template strand Лекция студентам СФУ 10 июня 2014 г. 13 How to visualize DNA fragments? • Radioactivity – radioactive isotope labeled primers (kinase with 32P) – radioactive isotope labeled dNTPs (gamma 35S or 32P) • Fluorescence – ddNTPs chemically synthesized to contain fluorescent dye – each ddNTP fluoresces at a different wavelength allowing identification • Polyacrylamide gel electrophoresis - high resolution of fragments differing by a single dNTP – slab gels – capillary gels: require only a tiny amount of sample to be loaded, run much faster than slab gels, best for high throughput sequencing Лекция студентам СФУ 10 июня 2014 г. 14 Sequencing gel electrophoresis of DNA fragments amplified using radioactive isotope labeled primers or ddNTP • Radioactively labelled primer or dNTPs in sequencing reaction • 4 different ddNTPs used in 4 separate reactions, respectively • Separation of sequencing products by slab gel electrophoresis • Autoradiography Лекция студентам СФУ 10 июня 2014 г. 15 Sl b l An automated sequencer using mixed ddNTPs labelled by 4 different fluorescent dyes Single lane or capillary output: Slab or capillary gel electrophoresis: Лекция студентам СФУ 10 июня 2014 г. 17 Лекция студентам СФУ 10 июня 2014 г. 18 Automated Version of the Dideoxy Method ABI PRISM 3730xLGenetic Analyzer 96 capillaries Лекция студентам СФУ 10 июня 2014 г. 19 Dideoxy (Sanger) Method Advantages: • relatively long fragments (500-750 bp) • low frequency of sequencing errors (“gold standard”) Disadvantages: • expensive • laborious • low productivity ABI PRISM Genetic Analyzers: Лекция студентам СФУ 10 июня 2014 г. 20 Next Generation Sequencing (NGS) Overview of Next generation sequencing (NGS) technologies: second, third and fourth generations Second generations: • Sequencing-by-synthesis using luciferase and light signal detection - Pyrosequencing (454 / Roche) • Sequencing-by-synthesis using reversible nucleotide terminators and fluorescent signal detection - Illumina sequencing technology • Application of Illumina next-generation sequencing for marker discovery and genotyping Whole genome resequencing Whole transcriptome sequencing Genotyping by Sequencing Genomic complexity reduction and target resequencing Лекция студентам СФУ 10 июня 2014 г. 21 Available 2nd-Generation Sequencing Technologies 500-800 bp www.roche-applied-science.com www.illumina.com /2x300 1 x 75 bp 2 x 50 bp www.lifetechnologies.com Niedringhaus et al. (2011) Landscape of Next-Generation Sequencing Technologies. Anal. Chem. 83: 4327–4341 Illumina dominates the market with 71% of the market share (revenue of about $1.6 billion in 2013), followed by ABI Life Technologies at 16%, Roche at 10%, and PacBio at 3%. Лекция студентам СФУ 10 июня 2014 г. 22 2nd Generation Sequencing: Pyrosequencing method – Basic Principle • Pyrosequencing is based on the generation of light signal through release of diphosphate or pyrophosphate (PPi) on nucleotide addition: (NA)n + dNTP (NA)n+1 + PPI • PPi is used to generate ATP from adenosine phosphosulfate (APS): APS + PPI ATP • ATP and luciferase generate light by conversion of luciferin to oxyluciferin. pyrophospate DNA Polymerase I (from E. coli) adenosine phosphosulfate adenosine triphosphate (from fireflies; oxidizes luciferin and emits light) – visible light is generated and is proportional to the number of incorporated nucleotides – 1 pmol DNA = 6×1011 ATP = 6×109 photons at 560 nm Лекция студентам СФУ 10 июня 2014 г. 23 Pyrosequencing: Preparation of DNA library • shearing genomic DNA to small DNA fragments 500-800 bp long • attaching single DNA fragments to very small plastic beads (one fragment per bead) • emulsion-based clonal PCR amplification (emPCR) of the DNA on each bead to cover each bead with a cluster of identical fragments to enhance the light signal • placing each bead in a separate well on a PicoTiterPlate, a fiber optic chip with up to 1.6 million wells (A mix of enzymes such as DNA polymerase, ATP sulfurylase, and luciferase are also packed into the well.) • The PicoTiterPlate is then placed into the GS FLX System for sequencing. Лекция студентам СФУ 10 июня 2014 г. 24 Pyrosequencing – Basic Principle • Sequence-by-synthesis via DNA polymerase directed chain extension, one base at a time in the presence of a reporter (luciferase). Each nucleotide is added separately in a separate cycle. • Only one of four will generate a light signal. Luciferase will emit a photon of light in response to the pyrophosphate (PPi) released upon nucleotide addition by DNA polymerase • The remaining nucleotides are either (1) washed away or (2) removed enzymatically by apyrase (nucleotide degradation enzyme). • The light signal is recorded on a pyrogram: DNA sequence: T A G T CC GG A Nucleotide added: A T C A G C T Лекция студентам СФУ 10 июня 2014 г. 25 Pyrosequencing Results: Height of peak indicates the number of dNTPs added This sequence: TTTGGGGTTGCAGTT Лекция студентам СФУ 10 июня 2014 г. 26 Pyrosequencing 25 Mbp in about 4 hours APS = Adenosine phosphosulfate http://www.youtube.com/watch?v=bFNjxKHP8Jc Лекция студентам СФУ 10 июня 2014 г. 27 Pyrosequencing Sequencing by synthesis • Advantages: – – – – – – accurate - low frequency of sequencing errors relatively long fragments (500-800 bp) parallel processing automated no need for labeled primers and nucleotides no need for gel electrophoresis • Disadvantages: – – – – expensive laborious low productivity nonlinear light response after more than 5-6 identical nucleotides Лекция студентам СФУ 10 июня 2014 г. 28 Illumina: Preparation a DNA library for sequencing - Cluster formation Samples are loaded in a flowcell with 8 lanes and clusters attached to the cell surface are generated and prepared for sequencing using cBot instrument Hybridization and second strand synthesis Flowcell with 8 separate lanes and primers complementary to adapters immobilized on the flowcell surface. Bridge PCR Linearization and primer annealing Лекция студентам СФУ 10 июня 2014 г. cBOT HiSeq2000 29 Illumina: Sequencing by Synthesis (SBS) • Sequencing is performed by addition of all 4 labeled reversible terminators (1 base is added per cycle for all the fragments). • The block and phluorophore are then removed, and another cycle starts up to the desired read length. HiSeq: 120-150 Millions of clusters per lane (8 lanes per flowcell); up to 150 bp × 2 (total output = 600 Gb for a full run with 2 flowcells) MiSeq: 25 Millions of clusters per flowcell with a single lane; up to 300 bp × 2 (total output = 15 Gb for a full run) Лекция студентам СФУ 10 июня 2014 г. 30 Illumina sequencing technology in 12 steps http://www.illumina.com/downloads/SS_DNAsequencing.pdf Лекция студентам СФУ 10 июня 2014 г. 31 1. Fragment genomic DNA and ligate adapters 2. Attach DNA to surface 3. Bridge amplification DNA adapters 4. Fragments become double stranded 5. Denature the doublestranded molecules 6. Complete amplification Randomly fragment genomic DNA and ligate adapters to both ends of the fragments Лекция студентам СФУ 10 июня 2014 г. 32 adapter DNA fragment adapter dense lawn of primers 1. Fragment genomic DNA and ligate adapters 2. Attach DNA fragments to surface 3. Bridge amplification 4. Fragments become double stranded 5. Denature the doublestranded molecules 6. Complete amplification Bind single-stranded fragments randomly to the inside surface of the flow cell channels Лекция студентам СФУ 10 июня 2014 г. 33 1. Fragment genomic DNA and ligate adapters 2. Attach DNA fragments to surface 3. Bridge amplification 4. Fragments become double stranded 5. Denature the doublestranded molecules 6. Complete amplification Add unlabeled nucleotides and enzyme to initiate solid-phase bridge amplification Лекция студентам СФУ 10 июня 2014 г. 34 1. Fragment genomic DNA and ligate adapters 2. Attach DNA fragments to surface attached terminus free terminus attached terminus 3. Bridge amplification 4. Fragments become double stranded 5. Denature the double- stranded molecules 6. Complete amplification The enzyme incorporates nucleotides to build doublestranded bridges on the solid-phase substrate Лекция студентам СФУ 10 июня 2014 г. 35 1. Fragment genomic DNA and ligate adapters 2. Attach DNA fragments to surface attached attached 3. Bridge amplification 4. Fragments become double stranded 5. Denature the doublestranded molecules 6. Complete amplification Denaturation leaves single-stranded templates anchored to the substrate Лекция студентам СФУ 10 июня 2014 г. 36 1. Fragment genomic DNA and ligate adapters 2. Attach DNA fragments to surface 3. Bridge amplification 4. Fragments become double stranded 5. Denature the double- stranded molecules clusters 6. Complete amplification 7. Reversed complement Several million dense clusters of double- strands are cleaved and stranded DNA are generated in each washed away channel of the flow cell Лекция студентам СФУ 10 июня 2014 г. 37 8. First nucleotide base adding and the first nucleotide base synthesis 9. Image first base 10. The block and phluorophore are then removed 11. Second nucleotide base adding and the second nucleotide base synthesis 12. Image second base 13. The block and phluorophore are then removed Laser 14. Sequencing over multiple chemistry cycles The first sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA 15. Align data Лекция студентам СФУ 10 июня 2014 г. polymerase 38 8. First nucleotide base adding and the first nucleotide base synthesis 9. Image first base 11. The block and phluorophore are then removed 12. Second nucleotide base adding and the second nucleotide base synthesis 13. Image second base 14. The block and phluorophore are then removed 15. Sequencing over multiple chemistry cycles After laser excitation, the emitted 16. Align data fluorescence from each cluster is captured and the first base is identified Лекция студентам СФУ 10 июня 2014 г. 39 8. First nucleotide base adding and the first nucleotide base synthesis 9. Image first base 10. The block and phluorophore are then removed 11. Second nucleotide base adding and the second nucleotide base synthesis 12. Image second base 13. The block and phluorophore are then removed Laser 14. Sequencing over multiple chemistry cycles The next cycle repeats the 15. Align data incorporation of four labeled reversible terminators, primers, and DNA polymerase Лекция студентам СФУ 10 июня 2014 г. 40 8. First nucleotide base adding and the first nucleotide base synthesis 9. Image first base 10. The block and phluorophore are then removed 11. Second nucleotide base adding and the second nucleotide base synthesis 12. Image second base 13. The block and phluorophore are then removed 14. Sequencing over multiple chemistry cycles After laser excitation the image is captured as before, and the identity of the second base is recorded. 15. Align data Лекция студентам СФУ 10 июня 2014 г. 41 8. First nucleotide base adding and the first nucleotide base synthesis 9. Image first base 10. The block and phluorophore are then removed 11. Second nucleotide base adding and the second nucleotide base synthesis 12. Image second base 13. The block and phluorophore are then removed 14. Sequencing over multiple chemistry cycles 15. Process sequence reads, generate contigs, map (align) reads or contigs to the reference sequence, if it is available www.youtube.com/watch?v=l99aKKHcxC4 42 The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time. Лекция студентам СФУ 10 июня 2014 г. 8. First nucleotide base adding and the first nucleotide base synthesis 9. Image first base Reference sequence 10. The block and phluorophore are then removed 11. Second nucleotide base adding and the second nucleotide base synthesis Unknown variant identified and called Known SNP called 12. Image second base 13. The block and phluorophore are then removed 14. Sequencing over multiple chemistry cycles 15. Process sequence reads, generate The data are aligned and compared contigs, map (align) reads or contigs to a reference, and sequencing to the reference sequence, if it is differences are identified. available 43 Лекция студентам СФУ 10 июня 2014 г. Single, paired-end and multiplexed sequencing Single read sequencing: Primer 1 Adapter 1 with annealing site for Primer 1 Pair-end sequencing: Primer 1 Adapter 2 Adapter 1 with annealing site for Primer 1 • resequencing • expression quantification Adapter 2 with annealing site for Primer 2 Primer 2 • de novo sequencing for better assembly • allow to resolve better transcript isoforms • can be used to detect gene fusion Multiplexed sequencing: Primer 1 Adapter 1 with annealing site for Primer 1 Adapter 2 with Barcode annealing site for Primer 2 and barcode tag Primer 2 • The sequencing of a 6-nucleotides barcode on one of the adapters allows to identify samples pooled together and sequenced in a single lane. • Up to 96 samples per lane. Лекция студентам СФУ 10 июня 2014 г. 44 Illumina Sequencers and Genotypers Лекция студентам СФУ 10 июня 2014 г. 45 Illumina Sequencers Лекция студентам СФУ 10 июня 2014 г. 46 Illumina HiSeq 1000, 2000 or 2500 HiSeq 2000 Лекция студентам СФУ 10 июня 2014 г. 47 Illumina MiSeq Desktop Sequencer Лекция студентам СФУ 10 июня 2014 г. 48 Third-Fourth Generation Sequencing Technologies • PacificBiosciences (www.pacificbiosciences.com): PacBioRS • Complete Genomics (www.completegenomics.com) (only for human genome) • Ion Torrent and Ion Proton (Life/ABI) (www.iontorrent.com): Ion Personal Genome Machine (PGM™) sequencer • Oxford Nanopore (www.nanoporetech.com/): GridION (Fourth generation) Лекция студентам СФУ 10 июня 2014 г. 49 PacBio Single Molecule Real Time (SMRT) Sequencing ZMW = Zero Mode Waveguide http://www.youtube.com/watch?v=v8p4ph2MAvI Лекция студентам СФУ 10 июня 2014 г. 50 PacBio Sample prep Лекция студентам СФУ 10 июня 2014 г. 51 PacBio SMRT Sequencing Лекция студентам СФУ 10 июня 2014 г. 52 PacBio Base Modification Detection (Application in development) methylated unmethylated A methylated adenine in the template (top) slows the incorporation of a thymine in the replicating strand of DNA. The rate of incorporation can be compared to an unmodified version of the same template (bottom) which has a much faster thymine addition. Differences between the modified and unmodified incorporation rates indicate potential sites of modified bases. These differences often span multiple bases, creating a distinctive signature. Flusberg et al. (2010) Nature Methods 7: 461-465 Лекция студентам СФУ 10 июня 2014 г. 53 Ion Torrent & Proton: Personal Genome Machine (PGM) When a nucleotide is added to a DNA template and is then incorporated into a strand of DNA, a hydrogen ion is released. The charge from that ion changes the pH of the solution, which can be detected by a ion sensor. • reads up to 200-400 (Torrent) or 200 (Proton) bp long • from 30 Mb to >2 Gb per run (different Torrent chips) • 10 Gb (Proton) • max 4-6 M reads (4-7 hours; Torrent), 60-80 M reads (2-4 hours; Proton) • useful for Amplicon-seq, smallRNA-seq, small genomes sequencing (i.e. bacterial, virus) (www.iontorrent.com) http://www.youtube.com/watch?v=yVf2295JqUg Лекция студентам СФУ 10 июня 2014 г. 54 Oxford Nanopore: The GridION system • Electrical single-molecule sequencing • Protein nanopores used as biosensor • Exonuclease sequencing: combining a protein nanopore and processive enzyme for the sequential identification of DNA bases as they pass through the pore • Oxford Nanopore signed a commercialisation agreement with Illumina for this technology, however commercialisation timelines have not been disclosed Лекция студентам СФУ 10 июня 2014 г. 55 The GridION system: Nanopore sensing T C A Exonuclease sequencing: combining a protein nanopore and processive enzyme (e.g., a exonuclease) for the sequential identification of DNA bases as they pass through the pore. Лекция студентам СФУ 10 июня 2014 г. 56 The GridION system: Electrical sequencing trace • The system is designed to give ultra-high read length (tens of Kb) • First generation commercial system is designed to achieve tens of Gb per day Лекция студентам СФУ 10 июня 2014 г. 57 Third-Forth Generation Sequencing Technologies 35 bp 200-400 bp Niedringhaus et al. (2011) Landscape of Next-Generation Sequencing Technologies. Anal. Chem. 83: 4327–4341 Лекция студентам СФУ 10 июня 2014 г. 58 NGS Platform statistics Instrument Illumina HiSeq X (2 flow cells) Illumina HiSeq 2500 - high output v4 Life Technologies SOLiD – 5500xl Illumina NextSeq 500 Oxford Nanopore GridION 8000 Ion Torrent - Proton III Illumina MiSeq v3 Ion Torrent – PGM 318 chip Ion Torrent – PGM 316 chip Oxford Nanopore MinION 454 FLX+ Ion Torrent – PGM 314 chip Pacific Biosciences RS II Amplification Run time BridgePCR BridgePCR emPCR BridgePCR None - SMS emPCR bridgePCR emPCR emPCR None - SMS emPCR emPCR None - SMS 3 days 6 days 8 days 30 hrs. varies 6 hrs. 55 hrs. 7.3 hrs. 4.9 hrs. ≤6 hrs. 20 hrs. 3.7 hrs. 2 hrs. M of Read / run Bases / read 6000 2000 1410 400 10 500 22 4.75 2.5 0.1 1 0.475 0.03 300 250 110 300 10000 175 600 400 400 9000 650 400 3000 Reagent Reagent Reagent Cost Gbp / cost / Gb, Cost / run, $ Cost / Gb, $ /M reads, $ run $ 12750 14950 10503 4000 1000 1000 1442 874 674 900 6200 474 100 7 30 68 33 10 11 109 460 674 1000 9538 2495 1111 2 7 7 10 100 2 66 184 270 9000 6200 998 3333 1800 500 155.1 120 100 87.5 13.2 1.9 1 0.9 0.65 0.19 0.09 7 30 68 33 10 11 109 460 674 1000 9538 2495 1111 ‘Benchtop’ NGS technology: will substitute the capillary electrophoresis (CE) sequencers for common experiments, such as Illumina libraries verification, amplicon sequencing and small genome sequencing. Лекция студентам СФУ 10 июня 2014 г. 59 Overview of genome sequencing analysis Genome whole genome shortgun de novo sequencing whole genome resequencing target sequencing de novo assembling assembling via mapping to a reference sequence assembling via mapping to a target sequence Genome annotation (repetitive DNA elements, proteincoding &other genes) Лекция студентам СФУ 10 июня 2014 г. 60 Applications of Genome Sequencing Purpose Template Example genome sequencing The Genome 10K project (sequencing 10,000 vertebrate species genomes, approximately one for every vertebrate genus); 1K Plant Genomes Project, 10K fungi genomes De novo sequencing Resequencing ancient DNA Extinct Neanderthal genome metagenomics Human guts whole genomes Sequencing 1000 individual human genomes project genomic regions Assessment of genomic rearrangements or diseaseassociated regions somatic mutations Transcriptome full-length transcripts Noncoding RNAs Epigenetics Methylation changes Sequencing mutations in cancer Defining regulated messenger RNA transcripts Identifying and quantifying microRNAs in samples Measuring methylation changes in cancer Table 13.15 in Bioinformatics and Functional Genomics by J. Pevsner (2nd ed., Wiley-Blackwell, 2009) p. 538 Лекция студентам СФУ 10 июня 2014 г. 61 Genome sizes in nucleotide base pairs plasmids viruses bacteria fungi plants algae insects The size of the human genome is ~ 3.2 X 109 bp mollusks bony fish The human genome is thought to contain ~21,000 protein and ~2,000 RNA coding genes. amphibians reptiles birds mammals 104 105 106 107 108 109 conifers 1010 1011 1012 http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt Лекция студентам СФУ 10 июня 2014 г. 62 Eukaryotic completed genome projects > 2 Gb Genus, species Subgroup Size (Mb) #chr Pinus taeda Land Plants 20150 12 Loblolly pine Picea abies Land Plants 20000 12 Norway spruce Triticum urartu Land Plants 4940 7 wheat A-genome progenitor Aegilops tauschii Land Plants 4360 7 Tausch's goatgrass Macropus eugenii Mammals 3800 8 tammar wallaby Oryctolagus cuniculus Mammals 3500 22 rabbit Land Plants 3480 12 pepper Cavia porcellus Mammals 3400 31 guinea pig Homo sapiens Mammals 3200 23 human Pan troglodytes Mammals 3100 24 chimpanzee Bos taurus Mammals 3000 30 cow Dasypus novemcinctus Mammals 3000 32 nine-banded armadillo Loxodonta africana Mammals 3000 28 African savanna elephant Sorex araneus Mammals 3000 20 European shrew Rattus norvegicus Mammals 2750 21 rat Nicotiana sylvestris Land Plants 2636 12 tobacco Mammals 2400 39 dog Land Plants 2365 10 corn Capsicum annuum Canis familiaris Zea mays Лекция студентам СФУ 10 июня 2014 г. Common name 63 База данных геномных проектов http://www.genomesonline.org • • • • Complete Projects 6366 Permanent Drafts 16884 Incomplete Projects 24508 Targeted Projects 920 Лекция студентам СФУ 10 июня 2014 г. • Organisms 48772 Archaea 873 Bacteria 35373 Eukarya 8155 64 GC content varies across genomes Bacteria Number of species in each GC class 10 5 Plants 5 Invertebrates 3 Vertebrates 10 5 20 30 40 50 60 70 80 GC content (%) Fig. 13.15 in Bioinformatics and Functional Genomics by J. Pevsner (2nd ed., Wiley-Blackwell, 2009) p. 556 Лекция студентам СФУ 10 июня 2014 г. 65 Genomic markers development and genotyping using next generation sequencing bar-coded DNA or mRNA library pools individual genomic or mRNA (tissue-specific or total) target-enriched genomic DNA bar-coded pools individual genomic DNA LOB 2 LOB 3 est9217 20 cM LOB 1 LOB 4 LOB 5 image analysis LOB 6 est9155 est1623 est8569 est9022 est2253 est1576 est9157 est0048 est8972 est8647 est8500 est8513 est8613 est8596 est0893 est8747 est8837 est9092 est8614 est8612 est8939 est1955 LOB 11 LOB 12 est0674 est8580 est9076 est8732 est2615 est8510 est9036 est1626 est1165 est2610 est1950 est1764 est0624 est8436 est1635 est1643 est2290 est0149 est0500 est1750 est8540 est8564 est9044 est9113 estce8 est8702 est8415 est8429 genotyping and phenotyping in mapping or natural population LOB 10 est1956 est9156 est9151 est2781 est1454 est9008 est8560 est8781 LOB 9 est9034 est8473 est0739 est2166 est0066 est9102 est9198 LOB 8 est8907 estce9 est0464 LOB 7 est2358 est8531 est1934 est8843 est9053 est8537 next generation high-throughput massively parallel DNA sequencing est2274 est0606 est8738 est8887 est2009 est8886 est8565 est8725 estc9 est8898 quantitative trait loci (QTL), candidate gene or association mapping high-throughput SNP genotyping SNP marker development Лекция студентам СФУ 10 июня 2014 г. DNA chromatograms sequence processing and analysis 66 Genomic DNA target enrichment for high-throughput massively parallel sequencing using the Agilent's SureSelect Target Enrichment System • Our library contained 647,634 baits (2X coverage) representing 35,550 unigenes • Capture target size ≈ 39 Mb http://www.opengenomics.com/SureSelect_Target_Enrichment_System Лекция студентам СФУ 10 июня 2014 г. 67 Genomic DNA enrichment for 35,550 genes in loblolly pine for highthroughput massively parallel sequencing using bar-coding and the Agilent's SureSelect Target Enrichment System 647,634 oligonucleotide hybridization 120 bp long probes (baits) based on 35,550 unigenes build by Dr. Chun Liang (Miami University) were designed to target 39 Mb of gene space using Agilent Genomic Workbench software to gene enrich DNA libraries for sequencing Example of 113 baits covering unigene #14 Лекция студентам СФУ 10 июня 2014 г. 68 USDA NIFA Climate Change Program 1: Regional Approaches to Climate Change Project: “Integrating research, education and extension for enhancing southern pine climate change mitigation and adaptation” http://www.pinemap.org • 2,8 млн SNPs уже генотипировано в почти 40,000 генах ладанной сосны в моей лаборатории в Texas A&M University в этом проекте путём прямого секвенирования геномной ДНК, обогащённой экзомными районами с помощью гибридизации тотальной ДНК с 600 млн олигонуклетидных проб, представляющих почти полный транскриптом (~40 тыс. экспрессируемых генов) ладанной сосны • более чем 400 деревьях со всего ареала, профенотипированных по большому числу адаптивных и селекционно-ценных признаков, а также изученных по большому числу средовых факторов будут генотипированы по всем обнаруженным SNPs для обнаружения аллелей и гаплотипов связанных с изменчивостью адаптивных признаков, а также с устойчивостью к средовым факторам • фактически, это означает переход от отдельных маркёров к полному генотированию через секвенирование! • эра маркёров заканчивается – наступает эра полногеномного секвенирования! • популяционная геномика вместе с молекулярной экологией (экогеномикой) позволят: − обнаружить гены и аллели ответственные за адаптацию − связать генотипы с адаптивными фенотипами и средой Лекция студентам СФУ 10 июня 2014 г. 69 Спасибо за внимание! По вопросам биоинформатики обращайтесь к Юлии Путинцевой yuliya-putintseva@rambler.ru Лекция студентам СФУ 9 июня 2014 г. 70