Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Plant Omics: Advances in Big Data Biology
Plant Omics: Advances in Big Data Biology
Plant Omics: Advances in Big Data Biology
Ebook905 pages9 hours

Plant Omics: Advances in Big Data Biology

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book provides a comprehensive overview of plant omics and big data in the fields of plant and crop biology. It discusses each omics layer individually, including genomics, transcriptomics, proteomics, and covers model and non-model species. In a section on advanced topics, it considers developments in each specialized domain, including genome editing and enhanced breeding strategies (such as genomic selection and high-throughput phenotyping), with the aim of providing tools to help tackle global food security issues. The importance of online resources in big data biology are highlighted in a section summarizing both wet- and dry-biological portals. This section introduces biological resources, datasets, online bioinformatics tools and approaches that are in the public domain.

This title:
reviews each omics layer individually;
focuses on new advanced research domains and technology; and
summarizes publicly available experimental and informatics resources.

This book is for students, engineers, researchers and academics in plant biology, genetics, biotechnology and bioinformatics.
LanguageEnglish
Release dateDec 14, 2022
ISBN9781789247534
Plant Omics: Advances in Big Data Biology

Related to Plant Omics

Related ebooks

Science & Mathematics For You

View More

Related articles

Related categories

Reviews for Plant Omics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Plant Omics - Hajime Ohyanagi

    Preface

    I gave her one, they gave him two,

    You gave us three or more;

    They all returned from him to you,

    Though they were mine before.

    from Alice’s Adventures in Wonderland, Lewis Carroll (1865)

    * * *

    The concept and term of genome (gene suffixed by -ome, which refers to a totality or complete set) have been appreciated and exploited for more than a century. With this idea, the further elucidation of biological systems has revealed the extremely complex, stochastic, yet resilient and well-orchestrated nature of biology and has given names to the branches of genomics such as transcriptome, proteome, metabolome, phenome, and so forth, instead of confining the exploration to one gene at a time. As the suffix -ome suggests, each omics is inherently a big data biology whose ultimate goal is to integrate the myriad data into one, as in the above quote from the chapter Alice’s Evidence in Alice’s Adventures in Wonderland. For a long time, addressing the totality of biology was no more than a half-fledged hope, but advances in the technology of molecular biology have given wings to approaching such objectives.

    Among the kingdoms of life, Plantae is essential to humankind and has served as a model organism from early genetics to the modern basic science. The goal of this book is to provide baseline knowledge to students as a guide to omics and to present recent advancements in the selected topics with the focus on plant omics.

    This book, Plant Omics: Advances in Big Data Biology, has three sections, corresponding to baseline knowledge, advanced topics, and resources. The baseline section covers plant genomics (Chapter 1), transcriptomics (Chapter 2), proteomics (Chapter 3), metabolomics (Chapter 4), phenomics (Chapter 5), non-coding transcriptomics (Chapter 6), epigenomics (Chapter 7), and organellar omics (Chapter 8). In the later chapters, advanced topics such as plant cis-element and transcription factors (Chapter 9), gene expression networks (Chapter 10), hormones (Chapter 11), plant–pathogen interactions (Chapter 12), GWAS (Genome-Wide Association Studies) (Chapter 13), genomic selection (Chapter 14), genome editing (Chapter 15), and deep learning (Chapters 16, 17, and 18) are dissected by cutting-edge plant scientists. In the last couple of chapters, valuable archives for plant experimental resources (Chapter 19) and online omics databases (Chapter 20) are summarized by resource specialists.

    As the editors, we would like to express our sincere gratitude to all the authors for their great contributions to this book. We hope that this book will serve as a guide for students and be an inspiring read for researchers from various fields. We thank Alison Smith, David Hemming, Ali Thompson, Emma McCann, and Marta Patiño of CABI for their continuous guidance and encouragement during all the stages of this project.

    Hajime Ohyanagi

    Eiji Yamamoto

    Ai Kitazumi

    Kentaro Yano

    1 Plant Genomics

    Masaru Bamba¹, Kenta Shirasawa², Sachiko Isobe², Nadia Kamal³, Klaus Mayer³ and Shusei Sato¹*

    ¹Graduate School of Life Sciences, Tohoku University, Japan; ²Laboratory of Plant Genetics and Genomics, Kazusa DNA Research Institute, Japan; ³Plant Genome and Systems Biology, Helmholtz Zentrum München, Munich, Germany

    *Corresponding author: shuseis@ige.tohoku.ac.jp

    © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.)

    DOI: 10.1079/9781789247534.0001

    Abstract

    In this post-genomic era, we now have easy access to the genetic information of entire living organisms, and that information has been essential for biological research. The prosperity of genomics resulted from the progress of DNA sequence technologies, the development of computational analysis environments, and the establishment of biological resources. Plant genomics is one of the research fields that has strongly benefited from these technical advances. This chapter presents the evolution and transition of DNA sequence technologies and gives concrete examples of the proliferation in plant genomic research.

    1.1 Introduction

    As was the case in other biological lineages, such as Saccharomyces cerevisiae in eukaryotes and Caenorhabditis elegans in multicellular organisms, plant genome analysis started with a single general model, Arabidopsis thaliana, by a large multinational consortium using the Sanger sequencing method (Arabidopsis Genome Initiative, 2000). The obtained Arabidopsis genome information has been used as a solid infrastructure in the plant research community. Along with the progress of DNA sequencing technologies, the target of plant genome analysis shifted toward a wide range of plant species (Fig. 1.1), and a variety of plant genome information has been applied as the basis for integration of the multiple biological omics data (Rai et al., 2017). The development of long-read sequencing technologies made it feasible to sequence not only a single representative genome but also multiple accessions within the same species (Golicz et al., 2020).

    In this chapter, we attempt to shed light on the status of plant genomics by describing the latest genome sequencing technologies and give concrete examples of genome/pan-genome analysis in two plant taxa (Fabaceae and Poaceae).

    1.2 Advanced Technologies in Plant Genomics

    Genome sequencing technology has advanced dramatically in the past 15 years after the appearance of next-generation sequencing (NGS) technologies. The first phase of NGS technology was the development of massive parallel sequencing platforms with read lengths of approximately 50–300 bp, the so-called second-generation sequencing technology, compared with the first-generation dideoxy chain termination method (Sanger method). This was followed by the third and fourth generations, in which sequencing of single DNA molecules without amplification was achieved with average median length of approximately 10–20 kbp, and also several reads longer than 50 kbp. Since the transition of NGS platforms has been fast in the past decade, the platforms used at the beginning of the NGS era, such as Roche 454 and ABI SOLiD, have already become obsolete. The Illumina HiSeq, which was a representative of the short-read sequencing platform in the 2010s, also stopped being produced recently. Despite the frequent update of their platforms, all of the fundamental technologies in NGS are considered to be present, since new concepts of sequencing strategy have not been introduced during the past several years.

    Two illustrations depict phylogenetic relationships among green plants.

    Fig. 1.1. Sequenced plant genomes: important and milestone species in plant genomics. Plant genomes whose whole genomes have been sequenced and that have been chosen as important/milestone species. Phylogenetic relationships (A) among green plants, except for seed plants (based on Wickett et al., 2014); and (B) among seed plants (based on Angiosperm Phylogeny Group (APG) IV. Taxonomic characteristics shown in the branches. Common name or cultivar described follows the scientific name.

    Click to see the long description.

    The current NGS platforms can roughly be classified into the following four categories: (i) bench-top short-read sequencing (e.g., Illumina MiSeq, Thermo Fisher IonProton); (ii) large-scale short-read sequencing (e.g., Illumina NovaSeq, MGI DNB-Seq); (iii) accurate long-read sequencing (e.g., PacBio Sequel II); and (iv) ultra-long-read sequencing (e.g., Oxford Nanopore Technologies). Short-read sequencing platforms are frequently used for base variant detection (single nucleotide polymorphisms (SNPs) and short insertion and deletion (indels)) by whole-genome shotgun sequencing-based methods such as GBS (Elshire et al., 2011), RAD-Seq (Baird et al., 2008), and GRAS-Di (Miki et al., 2020). With the massive amount of data production, short-read sequencing platforms are also used for gene expression and protein–DNA interaction analyses through RNA sequencing (RNA-seq) and chromatin immunoprecipitation sequencing (ChIP-seq), respectively. On the other hand, long-read sequencing platforms are basically applied for whole-genome assembly and structure variant identification at both genome and transcriptome levels. In whole-genome assembly and structure variant identification, several other technologies also assist the analysis, such as optical mapping (e.g., Bionano Saphyr, available at https://bionanogenomics.com/, accessed July 2022) and the Hi-C library, which is based on a genome-wide chromatin conformation capture method (Lieberman-Aiden et al., 2009).

    Until the long-read sequencing platform gained popularity, short reads were used for genome assembly (Giani et al., 2020). Because of the highly repetitive nature of many plant genomes, de novo assembly (building a genome from scratch without any reference genome information) of plant genomes using short-read sequences tended to be a time- and labor-consuming process. This problem resulted from the short reads on repetitive regions in which where they were actually from could not be identified. Thus the plant genome sequencing projects were carried out focusing on a single representative accession in the target species. Availability of long-read sequencing technologies is expected to help to overcome the difficulties with assembling the repeat-rich region. Oxford Nanopore Technologies sequencers produce ultra-long reads of > 100 kb in length, but the sequences are error-prone (Dumschott et al., 2020). Furthermore, to achieve the ultra-long-read sequencing, extracting intact high-molecular-weight DNA is essential, but it is still challenging in plants because of the presence of cell walls and various secondary metabolites (Dumschott et al., 2020). In this way, the fourth-generation sequencing platforms still have room for improvement in plant genome analysis. Among current practical approaches, HiFi reads produced from circular consensus sequencing (CCS), which allow us to read one sequence multiple times, generated from PacBio Sequel II, are considered suitable in plant genome sequencing due to their accuracy (which has improved from 90% to more than 99.9%) with 10–20 kb read lengths (Hon et al., 2020). Because of its high accuracy, the contigs constructed with HiFi reads do not require error collection after assembly. In addition, Hi-C and comparable methods, such as Omni-C, have largely contributed to constructing a proximity map to generate chromosome-scale scaffolds, although it is recommended that the results should be confirmed by comparing the results of optical and/or linkage mapping (Udall and Dawe, 2018).

    These technologies allow us to establish high-quality (reference-level) plant genomes and compare them more easily and efficiently. The comparison of many genomes allows us to estimate a plant’s historical trajectories with population genomics approaches and to presume which genetic polymorphisms were responsible for the phenotypic variation (Bamba et al., 2019). Furthermore, comparing high-quality genomes will eliminate the limitation of focusing only on the differences in the core genomes shared among all focal organisms. Therefore, the progress of sequencing technologies is and will continue to be bringing plant genomics research into the pan-genome analysis era.

    1.3 Status of Fabaceae Genomics

    Fabaceae (Leguminosae) is the third-largest family of flowering plants, consisting of 751 genera and 19,500 species (Christenhusz and Byng, 2016). The economic value of Fabaceae for human consumption is second only to Gramineae, and the most significant character from an ecological viewpoint is a biological nitrogen-fixing symbiosis with nodule bacteria called rhizobia (Bennett, 2011). Since nitrogen-fixing symbiosis could help reduce chemical fertilizer for plant growth, leguminous plants are drawing attention toward sustainable agricultural crop production; therefore, significant efforts have been made to establish the genomic resources.

    In the genome of Fabaceae, a draft genome of Lotus japonicus was published in 2008 ahead of other legumes (Sato et al., 2008), followed by the complete genomes of soybean (Glycine max) (Schmutz et al., 2010) and Medicago truncatula (Young et al., 2011). L. japonicus and M. truncatula are used as the model legume species for nitrogen-fixing symbiosis, while soybean is used for a molecular basis of the production of the seed protein and oil contents. In addition to these three species, genome analyses of other leguminous plants have been carried out for 12 species, including pigeon pea (Varshney et al., 2012), chickpea (Varshney et al., 2013), mung bean (Kang et al., 2014), common bean (Schmutz et al., 2014), adzuki bean (Kang et al., 2015), hyacinth bean (Chang et al., 2019), white lupin (Hufnagel et al., 2020), pea (Kreplak et al., 2019), bambara groundnut (Chang et al., 2019), cowpea (Lonardi et al., 2019), asparagus bean (Xia et al., 2019), and black lentil (Pootakham et al., 2020). Three species, lima bean (Wisser et al., 2021), rice bean (Kaul et al., 2019), and cluster bean (Gaikwad et al., 2020), are uploaded to the pre-print server, and the lentil genome has not been published but is available as pre-released information (KnownPulse, available at https://knowpulse.usask.ca/, accessed July 2022). In other leguminous plants, genomes of 30 species, including clover (Istvánek et al., 2014), lupin (Hane et al., 2017), and peanut (Bertioli et al., 2019), were published. In total, the genomes of 46 species belonging to 29 genera were available at the time of writing (December 2020), and it can be said that the genomes of all major commercial legumes have been revealed, except for faba bean (Vicia fava). The whole-genome sequencing of faba bean is challenging, due to the large size of genomes (around 13 Gbp); however, the advanced sequencing platforms will allow us to reach that.

    For four crop species (peanut, pigeon pea, soybean, and white lupin) and for M. truncatula, their germlines and pan-genomic data, which can be used to detect their structural variations, are available. Soybean has the largest sets of germlines, consisting of over 50,000 lines (Liu et al., 2020), 2819 of which have been re-sequenced, and 23 genomes were assembled for the reference-level quality. Additionally, the genome data of the wild relative of soybean (Glycine soja) have been established for over 100 accessions. For peanut, the resequencing data of large sets of germlines (over 10,000) and five high-quality genome assemblies are available. Peanut is an allotetraploid species, and the genome information on the predictive progenitor species (Arachis duranensis, A. ipaensis, and A. monticola) is also available (Bertioli et al., 2016; Yin et al., 2018). In pigeon pea and white lupin, germline re-sequencing and pan-genomic data are viable, although there is currently no information on wild relatives (Zhao et al., 2020; Hufnagel et al., 2021). In Medicago, the re-sequencing data on germlines of M. truncatula and M. sativa are available. The pan-genome of M. truncatula can be used for the detection of structural variations (Zhou et al., 2017), and the re-sequencing-level pan-genomes of M. sativa are available (Shen et al., 2020).

    In addition, re-sequencing-level pan-genome information is available in six leguminous species: adzuki bean (Yang et al., 2015), common bean (Lobaton et al., 2018), pea (Kreplak et al., 2019), chickpea (Varshney et al., 2019), L. japonicus (Shah et al., 2020), and black lentil (Pootakham et al., 2020). In L. japonicus, there are pan-genomes and germlines for 136 wild accessions, and these were used for understanding the adaptation history of that species in the natural environment (Shah et al., 2020). Chickpea, common bean, and pea have pan-genome information consisting of 429, 35, and 42 lines, respectively (Lobaton et al., 2018; Varshney et al., 2019; Kreplak et al., 2019). For the Vigna pan-genome, the 49 and 89 genomes of adzuki bean and black lentil, respectively, are available (Yang et al., 2015; Pootakham et al., 2020); furthermore, the pan-genome of the cowpea project (CowpeaPan) is in progress.

    1.4 Status of Poaceae Genomics

    In the Poaceae, the rice genome (Oryza sativa subsp. japonica cv. Nipponbare) has been determined in advance of all other monocots (International Rice Genome Sequencing Project, 2005). This genomic information had been used for the sequencing of other Poaceae crops, such as maize (Schnable et al., 2009), sorghum (Paterson et al., 2009), barley (International Barley Genome Sequencing Consortium, 2012), and wheat (International Wheat Genome Sequencing Consortium, IWGSC, 2018) as a reference. Information for whole-genome variants among other rice cultivars (indica, Guangluai-3, Nongken-58, and Kasalath) (Sakai et al., 2014) and wild species, Oryza rufipogon and O. longistaminata, which are candidates for their origin, have also been published. Besides, the rice genome collection consists of more than 200 high-quality collections and more than 450 low-quality collections so far (Huang et al., 2012). The rice genome, therefore, has become an essential tool for agricultural prosperity with the Poaceae.

    One of the most significant milestones in recent Poaceae genome research is the determination of the cereal crop genomes. In recent years there have been significant breakthroughs in sequencing technologies and the ability to assemble even the largest and most complex cereal genomes, such as bread wheat and barley. For both the latter species, reference-quality genome assemblies have been generated, in 2017 for barley (Mascher et al., 2017) and in 2018 for wheat (IWGSC, 2018), using novel computational strategies and genome assembly algorithms. The high repeat content (> 80%), high transposon activity, large genome sizes (e.g., 17 Gb for bread wheat, five times larger than the human genome), and polyploidy have complicated the assembly of cereal genomes for a long time. Single reference genomes are an invaluable tool to better understand cereal biology and unlock the gene content as well as regulatory networks. To assess the genetic potential of natural variation in cereal crops, however, multi-genome comparisons become essential. As a consequence, genome projects including the generation and comparative analysis of multiple reference genome sequences for wheat and barley started to arise. Major objectives of these pan-genome (which represents the entire set of genes within a species) projects are to determine the core gene set, i.e., the set of genes shared by all lines, and genes shared by only some lines or singleton genes (dispensable genes). Other main areas to study are structural variation, single nucleotide polymorphisms (SNPs), particular genes and quantitative trait loci (QTLs) involved in specific traits, copy number variations (CNVs), presence–absence variations (PAVs), and many more.

    One recent pan-genome project working on a polyploid species with a large genome is the international 10+ wheat genome project, coordinated by Prof. Curtis Pozniak from University of Saskatchewan. For this project, wheat lines from all around the world were chosen to ensure a maximum of genetic diversity and hence a pan-genome as complete as possible. The selected 10+ bread wheat cultivars were subsequently sequenced using Illumina short-read technology and assembled with NRGene’s DeNovoMagic (NRGene, Ness Ziona, Israel) algorithm, leading to high-quality chromosome-scale genome assemblies. Comparing the gene content of these reference genomes revealed variation in gene content, which likely reflects the complex breeding history of the selected lines as well as adaptation to diverse environments throughout their breeding history. Extensive efforts to improve grain yield and quality and make plants more resistant to pests and diseases are also reflected in the genic space. By comparing the chromosomal structure of the reference sequences, a diversity of structural rearrangements and introgressions from wild relatives could be identified. This also highlights the importance of multiple reference genomes in high-quality genome projects, since this enables the investigation of chromosomal translocations, duplications, and deletions with high accuracy.

    For the barley pan-genome project, 20 diverse barley lines were selected from 22,000 barley accessions that were previously hosted at IPK Gatersleben and have been genetically characterized (Milner et al., 2019; Jayakodi et al., 2020). The selected 20 lines represent the major barley germplasms and include eight cultivars, 11 landraces and one wild barley accession (Hordeum vulgare subsp. spontaneum). High-quality reference genome assemblies were generated for the 20 accessions using either the TRITEX pipeline (Monat et al., 2019) or other short-read assembly algorithms.

    Comparative structural analysis of the 20 barley lines could show that the single-copy barley core genome present in all lines was made up of 402.5 Mb and included almost the entirety of the annotated gene space. On the other hand, PAV was found in a total of 235.9 Mb of single-copy sequence in the panel of 20 accessions, representing the variable component of the pan-genome.

    In the study by Jayakodi et al. (2020) a method based on chromosome conformation capture sequencing (Hi-C) (Himmelbach et al., 2018) was used to study large chromosomal inversions (> 1 Mb). The genomes of 70 accessions were analyzed and 42 inversions ranging from 4 Mb to 141 Mb in size could be identified. The majority of these inversions were located in the proximal regions of the chromosome arms that are known for their low recombining rate.

    In summary, the newly sequenced wheat and barley reference genomes provide an unprecedented basis for functional gene discovery and breeding that help to improve cereals. Subsequent project phases include the generation of de novo gene predictions for all assemblies based on extensive transcriptomic data. These data will be the basis for in-depth insights into the functional and regulatory organization of the wheat and barley pan-genomes.

    1.5 Conclusion

    Genome information on plant species has been an important tool for anchoring the extensive dataset produced from related analyses. The high-throughput capacity introduced by NGS technologies has made it feasible in a wide range of plant species to apply advanced genetic approaches using a large number of germline resources, such as population genomics and genomic selections. The cost reduction and enhanced quality of long-read sequencing technology will make complex genomes accessible for whole-genome investigation as well as pan-genome analysis, both of which offer a broader understanding of genetic diversity of gene pools in the target species. Accumulating comprehensive genome information will continue to be the basis for plant research by integrating a large variety of information provided by advancing plant omics approaches.

    References

    Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814), 796–815. DOI: 10.1038/35048692.

    Baird, N.A., Etter, P.D., Atwood, T.S., Currey, M.C., Shiver, A.L. et al. (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers. PloS ONE 3(10), e3376. DOI: 10.1371/journal.pone.0003376.

    Bamba, M., Kawaguchi, Y.W. and Tsuchimatsu, T. (2019) Plant adaptation and speciation studied by population genomic approaches. Development, Growth & Differentiation 61(1), 12–24. DOI: 10.1111/dgd.12578.

    Bennett, B.C. (2011) Twenty-five economically important plant families. Encyclopedia of Life Support Systems (EOLSS), Economic Botany. Available at: https://docplayer.net/20954333-Twentyfive-economically-important-plant-families.html (accessed June 2022).

    Bertioli, D.J., Cannon, S.B., Froenicke, L., Huang, G., Farmer, A.D. et al. (2016) The genome sequences of Arachis duranensis and Arachis ipaensis, the diploid ancestors of cultivated peanut. Nature Genetics 48(4), 438–446. DOI: 10.1038/ng.3517.

    Bertioli, D.J., Jenkins, J., Clevenger, J., Dudchenko, O., Gao, D. et al. (2019) The genome sequence of segmental allotetraploid peanut Arachis hypogaea. Nature Genetics 51(5), 877–884. DOI: 10.1038/s41588-019-0405-z.

    Chang, Y., Liu, H., Liu, M., Liao, X., Sahu, S.K. et al. (2019) The draft genomes of five agriculturally important African orphan crops. GigaScience 8(3), giy152. DOI: 10.1093/gigascience/giy152.

    Christenhusz, M.J.M. and Byng, J.W. (2016) The number of known plants species in the world and its annual increase. Phytotaxa 261(3), 201. DOI: 10.11646/phytotaxa.261.3.1.

    Dumschott, K., Schmidt, M.H.-W., Chawla, H.S., Snowdon, R. and Usadel, B. (2020) Oxford nanopore sequencing: new opportunities for plant genomics? Journal of Experimental Botany 71(18), 5313–5322. DOI: 10.1093/jxb/eraa263.

    Elshire, R.J., Glaubitz, J.C., Sun, Q., Poland, J.A., Kawamoto, K. et al. (2011) A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PloS ONE 6(5), e19379. DOI: 10.1371/journal.pone.0019379.

    Gaikwad, K., Ramakrishna, G., Srivastava, H., Saxena, S., Kaila, T. et al. (2020) Chromosome scale reference genome of cluster bean (Cyamopsis tetragonoloba (L.) Taub). Genomics. DOI: 10.1101/2020.05.16.098434.

    Giani, A.M., Gallo, G.R., Gianfranceschi, L. and Formenti, G. (2020) Long walk to genomics: history and current approaches to genome sequencing and assembly. Computational and Structural Biotechnology Journal 18, 9–19. DOI: 10.1016/j.csbj.2019.11.002.

    Golicz, A.A., Bayer, P.E., Bhalla, P.L., Batley, J. and Edwards, D. (2020) Pangenomics comes of age: from bacteria to plant and animal applications. Trends in Genetics 36(2), 132–145. DOI: 10.1016/j.tig.2019.11.006.

    Hane, J.K., Ming, Y., Kamphuis, L.G., Nelson, M.N., Garg, G. et al. (2017) A comprehensive draft genome sequence for lupin (Lupinus angustifolius), an emerging health food: insights into plant-microbe interactions and legume evolution. Plant Biotechnology Journal 15(3), 318–330. DOI: 10.1111/pbi.12615.

    Himmelbach, A., Ruban, A., Walde, I., Šimková, H., Doležel, J. et al. (2018) Discovery of multi-megabase polymorphic inversions by chromosome conformation capture sequencing in large-genome plant species. The Plant Journal 96(6), 1309–1316. DOI: 10.1111/tpj.14109.

    Hon, T., Mars, K., Young, G., Tsai, Y.-C., Karalius, J.W. et al. (2020) Highly accurate long-read HiFi sequencing data for five complex genomes. Scientific Data 7(1), 399. DOI: 10.1038/s41597-020-00743-4.

    Huang, X., Kurata, N., Wei, X., Wang, Z.-X., Wang, A. et al. (2012) A map of rice genome variation reveals the origin of cultivated rice. Nature 490(7421), 497–501. DOI: 10.1038/nature11532.

    Hufnagel, B., Marques, A., Soriano, A., Marquès, L., Divol, F. et al. (2020) High-quality genome sequence of white lupin provides insight into soil exploration and seed quality. Nature Communications 11(1), 1–12. DOI: 10.1038/s41467-019-14197-9.

    Hufnagel, B., Soriano, A., Taylor, J., Divol, F., Kroc, M. et al. (2021) Pangenome of white lupin provides insights into the diversity of the species. Plant Biotechnology Journal 19(12), 2532–2543. DOI: 10.1111/pbi.13678.

    International Barley Genome Sequencing Consortium (2012) A physical, genetic and functional sequence assembly of the barley genome. Nature 491, 711–716.

    International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436, 793–800.

    International Wheat Genome Sequencing Consortium (2018) Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 361, 6403.

    Istvánek, J., Jaros, M., Krenek, A. and Řepková, J. (2014) Genome assembly and annotation for red clover (Trifolium pratense; Fabaceae). American Journal of Botany 101(2), 327–337. DOI: 10.3732/ajb.1300340.

    Jayakodi, M., Padmarasu, S., Haberer, G., Bonthala, V.S., Gundlach, H. et al. (2020) The barley pan-genome reveals the hidden legacy of mutation breeding. Nature 588(7837), 284–289. DOI: 10.1038/s41586-020-2947-8.

    Kang, Y.J., Kim, S.K., Kim, M.Y., Lestari, P., Kim, K.H. et al. (2014) Genome sequence of mungbean and insights into evolution within Vigna species. Nature Communications 5, 5443. DOI: 10.1038/ncomms6443.

    Kang, Y.J., Satyawan, D., Shim, S., Lee, T., Lee, J. et al. (2015) Draft genome sequence of adzuki bean, Vigna angularis. Scientific Reports 5, 1–8. DOI: 10.1038/srep08069.

    Kaul, T., Eswaran, M., Thangaraj, A., Meyyazhagan, A., Nehra, M. et al. (2019) Rice bean (Vigna umbellata) draft genome sequence: unravelling the late flowering and unpalatability related genomic resources for efficient domestication of this underutilized crop. [bioRxiv]. DOI: 10.1101/816595.

    Kreplak, J., Madoui, M.-A., Cápal, P., Novák, P., Labadie, K. et al. (2019) A reference genome for pea provides insight into legume genome evolution. Nature Genetics 51(9), 1411–1422. DOI: 10.1038/s41588-019-0480-1.

    Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T. et al. (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950), 289–293. DOI: 10.1126/science.1181369.

    Liu, Y., Du, H., Li, P., Shen, Y., Peng, H. et al. (2020) Pan-genome of wild and cultivated soybeans. Cell 182(1), 162–176. DOI: 10.1016/j.cell.2020.05.023.

    Lobaton, J.D., Miller, T., Gil, J., Ariza, D., de la Hoz, J.F. et al. (2018) Resequencing of common bean identifies regions of inter-gene pool introgression and provides comprehensive resources for molecular breeding. The Plant Genome 11(2), 170068. DOI: 10.3835/plantgenome2017.08.0068.

    Lonardi, S., Muñoz-Amatriaín, M., Liang, Q., Shu, S., Wanamaker, S.I. et al. (2019) The genome of cowpea (Vigna unguiculata [L.] Walp.). The Plant Journal 98(5), 767–782. DOI: 10.1111/tpj.14349.

    Mascher, M., Gundlach, H., Himmelbach, A., Beier, S., Twardziok, S.O. et al. (2017) A chromosome conformation capture ordered sequence of the barley genome. Nature 544(7651), 427–433. DOI: 10.1038/nature22043.

    Miki, Y., Yoshida, K., Enoki, H., Komura, S., Suzuki, K. et al. (2020) GRAS-Di system facilitates high-density genetic map construction and QTL identification in recombinant inbred lines of the wheat progenitor Aegilops tauschii. Scientific Reports 10(1), 21455–21455. DOI: 10.1038/s41598-020-78589-4.

    Milner, S.G., Jost, M., Taketa, S., Mazón, E.R., Himmelbach, A. et al. (2019) Genebank genomics highlights the diversity of a global barley collection. Nature Genetics 51(2), 319–326. DOI: 10.1038/s41588-018-0266-x.

    Monat, C., Padmarasu, S., Lux, T., Wicker, T., Gundlach, H. et al. (2019) TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biology 20(1), 284. DOI: 10.1186/s13059-019-1899-5.

    Paterson, A.H., Bowers, J.E., Bruggmann, R., Dubchak, I., Grimwood, J. et al. (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457(7229), 551–556. DOI: 10.1038/nature07723.

    Pootakham, W., Nawae, W., Naktang, C., Sonthirod, C., Yoocha, T. et al. (2020) A chromosome-scale assembly of the black gram (Vigna mungo) genome. Molecular Ecology Resources 21(1), 238–250. DOI: 10.1111/1755-0998.13243.

    Rai, A., Saito, K. and Yamazaki, M. (2017) Integrated omics analysis of specialized metabolism in medicinal plants. The Plant Journal 90(4), 764–787. DOI: 10.1111/tpj.13485.

    Sakai, H., Kanamori, H., Arai-Kichise, Y., Shibata-Hatta, M., Ebana, K. et al. (2014) Construction of pseudomolecule sequences of the aus rice cultivar Kasalath for comparative genomics of Asian cultivated rice. DNA Research 21(4), 397–405. DOI: 10.1093/dnares/dsu006.

    Sato, S., Nakamura, Y., Kaneko, T., Asamizu, E., Kato, T. et al. (2008) Genome structure of the legume, Lotus japonicus. DNA Research 15(4), 227–239. DOI: 10.1093/dnares/dsn008.

    Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T. et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature 463(7278), 178–183. DOI: 10.1038/nature08670.

    Schmutz, J., McClean, P.E., Mamidi, S., Wu, G.A., Cannon, S.B. et al. (2014) A reference genome for common bean and genome-wide analysis of dual domestications. Nature Genetics 46(7), 707–713. DOI: 10.1038/ng.3008.

    Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F. et al. (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326(5956), 1112–1115. DOI: 10.1126/science.1178534.

    Shah, N., Wakabayashi, T., Kawamura, Y., Skovbjerg, C.K., Wang, M.-Z. et al. (2020) Extreme genetic signatures of local adaptation during Lotus japonicus colonization of Japan. Nature Communications 11(1), 253. DOI: 10.1038/s41467-019-14213-y.

    Shen, C., Du, H., Chen, Z., Lu, H., Zhu, F. et al. (2020) The chromosome-level genome sequence of the autotetraploid alfalfa and resequencing of core germplasms provide genomic resources for alfalfa research. Molecular Plant 13(9), 1250–1261. DOI: 10.1016/j.molp.2020.07.003.

    Udall, J.A. and Dawe, R.K. (2018) Is it ordered correctly? Validating genome assemblies by optical mapping. The Plant Cell 30(1), 7–14. DOI: 10.1105/tpc.17.00514.

    Varshney, R.K., Chen, W., Li, Y., Bharti, A.K., Saxena, R.K. et al. (2012) Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume crop of resource-poor farmers. Nature Biotechnology 30(1), 83–89. DOI: 10.1038/nbt.2022.

    Varshney, R.K., Song, C., Saxena, R.K., Azam, S., Yu, S. et al. (2013) Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait improvement. Nature Biotechnology 31(3), 240–246. DOI: 10.1038/nbt.2491.

    Varshney, R.K., Thudi, M., Roorkiwal, M., He, W., Upadhyaya, H.D. et al. (2019) Resequencing of 429 chickpea accessions from 45 countries provides insights into genome diversity, domestication and agronomic traits. Nature Genetics 51(5), 857–864. DOI: 10.1038/s41588-019-0401-3.

    Wickett, N.J., Mirarab, S., Nguyen, N., Warnow, T., Carpenter, E. et al. (2014) Phylotranscriptomic analysis of the origin and early diversification of land plants. Proceedings of the National Academy of Sciences 111(45), E4859–E4868. DOI: 10.1073/pnas.1323926111.

    Wisser, R.J., Oppenheim, S.J., Ernest, E.G., Mhora, T.T., Dumas, M.D. et al. (2021) Genome assembly of a Mesoamerican derived variety of lima bean: a foundational cultivar in the Mid-Atlantic USA. G3|Genes|Genomes|Genetics 11(11), jkab207. DOI: 10.1093/g3journal/jkab207.

    Xia, Q., Pan, L., Zhang, R., Ni, X., Wang, Y. et al. (2019) The genome assembly of asparagus bean, Vigna unguiculata ssp. sesquipedialis. Scientific Data 6(1), 1–10. DOI: 10.1038/s41597-019-0130-6.

    Yang, K., Tian, Z., Chen, C., Luo, L., Zhao, B., et al. (2015) Genome sequencing of adzuki bean (Vigna angularis) provides insight into high starch and low fat accumulation and domestication. Proceedings of the National Academy of Sciences 112(43), 13213–13218. DOI: 10.1073/pnas.1420949112.

    Yin, D., Ji, C., Ma, X., Li, H., Zhang, W. et al. (2018) Genome of an allotetraploid wild peanut Arachis monticola: a de novo assembly. GigaScience 7(6), 1–9. DOI: 10.1093/gigascience/giy066.

    Young, N.D., Debellé, F., Oldroyd, G.E.D., Geurts, R., Cannon, S.B. et al. (2011) The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 480(7378), 520–524. DOI: 10.1038/nature10625.

    Zhao, J., Bayer, P.E., Ruperao, P., Saxena, R.K., Khan, A.W. et al. (2020) Trait associations in the pangenome of pigeon pea (Cajanus cajan). Plant Biotechnology Journal 18(9), 1946–1954. DOI: 10.1111/pbi.13354.

    Zhou, P., Silverstein, K.A.T., Ramaraj, T., Guhlin, J., Denny, R. et al. (2017) Exploring structural variation and gene family architecture with de novo assemblies of 15 Medicago genomes. BMC Genomics 18(1), 1–14. DOI: 10.1186/s12864-017-3654-1.

    2 Plant Transcriptomics: Data-driven Global Approach to Understand Cellular Processes and Their Regulation in Model and Non-Model Plants

    Ai Kitazumi¹, Isaiah C.M. Pabuayon¹, Kevin R. Cushman¹, Kentaro Yano² and Benildo G. de los Reyes¹*

    ¹Department of Plant and Soil Science, Texas Tech University, Lubbock, Texas, USA; ²School of Agriculture, Meiji University, Kawasaki, Japan

    *Corresponding author: benildo.reyes@ttu.edu

    © CAB International 2023. Plant Omics: Advances in Big Data Biology (eds H. Ohyanagi et al.)

    DOI: 10.1079/9781789247534.0002

    Abstract

    Under the new paradigms of integrative and network biology, comparison of expression changes among a small subset of candidate genes across phenotypic variants is hardly informative or conclusive in the context of cellular response regulation and the underlying genetic mechanisms. Integration of global changes in gene expression under multiple conditions with existing genomic databases that have been curated systematically are key for the efficient extraction of robust and biologically meaningful patterns and signatures that are reflective of cellular states. Despite the increasing availability of a wide array of computational tools, the resolution of RNA-seq-based transcriptome profiling is as good as the experimental design that determines the window of information revealed relative to the hypothesis being tested, and this intricacy is often underestimated. In this chapter, we discuss the important aspects of data analytics and the basic principles that must be taken into consideration to better bridge the design of the wet-lab experiments with the requirements of a robust dry-lab knowledge dissection and integration. We also highlight the unique assumptions and requirements between transcriptome experiments conducted using plant genetic models with comprehensive and annotated genomes for reference-guided assembly and extraction of biological knowledge, in comparison with the non-model plant species, which rely on a de novo assembly of transcriptome datasets followed by homology-based comparison with closely related species with reference genome.

    2.1 Introduction

    The inherent potential of every single cell in multicellular organisms such as plants is defined by the same nuclear genome. Multicellularity is achieved because the genome is expressed in many different ways, facilitating differentiation, morphogenesis, growth, and adaptive responses. Intricately regulated expression of the genome in time and space creates a vast, dynamic, and enormously complex transcriptome that varies under different cell types in response to intrinsic and extrinsic signals (Araújo et al., 2017). Transcriptome profiles represent different networks of gene induction and repression that uniquely identify every cell type under specific conditions, i.e., spatio-temporal signatures. Thus, transcriptome analysis is a central connecting bridge between genotype and phenotype. It also represents a core component of large data-driven exploratory research and hypothesis-driven discovery in integrative plant biology.

    Being sessile organisms, plants exhibit a high degree of plasticity. The core of such inherent adaptive potential is the dynamic changes in the transcriptome, which represent the outcomes of integrating different developmental and environmental signals (Paaby and Rockman, 2014). An important aspect of the dynamic nature of the transcriptome is the contribution of extensive gene duplication, which is an important feature of the genomes of many plant species. Duplicated genes in large families serve as substrates for sub-functionalization through the creation of novel and/or specialized networks comprising distinct subsets of genes with multiple paralogs (Panchy et al., 2016). Differential regulation of individual paralogs and their interaction with their associated genes across the genome contribute to the large permutation of transcriptome configurations that define different adaptive responses in plants (Das et al., 2016; Kitazumi et al., 2018; Pabuayon et al., 2020). The diversity by which the genome could be expressed differentially to configure adaptive transcriptomic responses is the consequence of multiple layers of regulation. Gene expression fluxes are the outcomes of regulation at the level of transcriptional initiation (i.e., cis-regulation by enhancers and silencers, and trans-regulation by transcriptional activators and repressors) (Shlyueva et al., 2014; de los Reyes et al., 2015), post-transcriptional transcript degradation by microRNAs (miRNAs) (Jones-Rhoades et al., 2006; Kitazumi et al., 2015; Pabuayon et al., 2020), and epigenomic or chromatin-level control through DNA methylation, noncoding RNAs (ncRNA), and histone modification (Gibney and Nolan, 2010; Law and Jacobsen, 2010; de los Reyes et al., 2018; de los Reyes, 2019). Collectively, these layers of regulation define the full potential of the genome to configure a vast array of transcriptome status (i.e., spatio-temporal signatures) to account for the complex requirements of multi-cellularity and adaptation. Understanding the biological implications of spatio-temporal fluxes in the transcriptome, qualitatively and quantitatively, is a critical first step for understanding the intricate mechanisms governing cellular-level and whole-organismal-level responses.

    During the past three decades, we have seen the evolution of technology and approaches used for profiling the transcriptome, from the semi-global clone-by-clone sequencing of expressed sequenced tags (ESTs) to the global hybridization-based profiling by microarray and later to the first-generation global sequencing-based platforms such as massively parallel signature sequencing (MPSS) (Wang et al., 2009). The more recent innovation was an application of the next-generation sequencing (NGS) technology, which led to a paradigm shift that allowed an even more universal scope of profiling the spatio-temporal transcriptome fluxes by direct sampling and deep-sequencing of transcripts (RNA-seq technology), which was not possible with the earlier technologies. The RNA-seq technology not only provided a powerful means for capturing at high resolution and dynamic range the most subtle changes in transcriptome fluxes, but it also allowed the profiling of qualitative changes by revealing the contributions of alternative splicing (Trapnell et al., 2012; Sibley et al., 2016). In effect, the RNA-seq technology afforded a truly comprehensive view of the vast array of expression capacities of the genome, further allowing the interpretation of such changes in the context of regulatory networks, and synergistic interactions, which are keys for a meaningful view of the intricacy of cellular and biological functions in relation to genotype and adaptive phenotypes.

    2.2 Overview of RNA-Seq-Based Transcriptome Profiling

    Transcriptomics by RNA-seq facilitates the profiling of transcript abundance for every gene locus and their alternative splicing variants, miRNAs, and all other classes of ncRNAs across the entire euchromatic and heterochromatic regions of the genome (Wang et al., 2009). The individual reverse transcribed RNAs from fragmented pools of cellular RNA are called a read or short read, as opposed to long reads generated from RNA without fragmentation. Sequence reads are generated from one end (single-end reads) or both ends (paired-end reads) of fragmented RNA molecules. The sequences generated are mapped against a reference genome (i.e., complete or near-complete genomic sequence of the target organism, hereafter referred to as reference for the mapping) by finding the most complementary site between a read and the target genome, which represents the sample in a process called reference-based mapping. The genomic location and the number of reads per genomic location (i.e., depth) are used to extrapolate the degree of transcriptional activity. Alternatively, de novo assembly is conducted in the absence of a reference genome, which will be discussed in subsequent sections.

    RNA-seq-based transcriptome analysis has a wide range of applications from the exploratory investigation and comparative quantitative and qualitative analysis of cell-type-, tissue-, organ-, developmental stage-, and/or treatment-specific patterns of transcription, to a more hypothesis-driven investigation and confirmation of downstream target genes of a single mutation event (e.g., overexpression and knockout), and global analysis of large-scale co-activation or co-suppression of genes in a regulatory network. According to the same principle, the RNA-seq-based transcriptome profiling can be used to identify the various signals, processing, interaction, targeting, and fate of transcribed protein-coding genes across the genome, including the mapping of transcription start sites (TSS-seq) (Yamashita et al., 2011), detection of selective polyadenylation for maturation of mRNA (3′-Seq) (Sanfilippo et al., 2017), profiling small RNA (miRNA-seq) (Addo-Quaye et al., 2008), detection of RNA–protein interaction (RIP-seq) (Zhao et al., 2010), ribosome-associated mRNA quantification (ribo-seq) (Ingolia et al., 2009), detection of post-transcriptional RNA methylation (Meyer et al., 2012), quantification of RNA stability against degradation (BRIC-seq) (Imamachi et al., 2014), and sequencing upstream for variation in cis-element reporters (CRE-seq) (Kwasnieski et al., 2012). More recent application of this technology is the profiling of short and/or long noncoding RNAs (ncRNA-seq) with potential transcriptional and post-transcriptional regulatory functions in the cell (Guttman et al., 2009; Ulitsky, 2016).

    RNA-seq-based transcriptome profiling has been increasingly used as a powerful approach for building comprehensive spatio-temporal profiles of all genes (transcriptome roadmap or atlas) across human populations (e.g., Encyclopedia of DNA Elements (ENCODE) project) as well as across the entire spectrum of genetic diversity that represents a certain species used as models for genetic studies, such as mouse (e.g., functional annotation of mammalian genome (FANTOM) project) (Fantom Consortium and the Riken Genome Exploration Research Group Phase I & Il Team, 2002; Encode Project Consortium, 2012). In these recent examples of global-scope transcriptome projects, profiles specific to tissue type, disease state and developmental stages are surveyed in a systematic manner to allow direct comparison of samples or individuals across contrasting biological states. These projects were not only successful in identifying new isoforms and regulatory elements beyond what can be achieved by DNA sequence-based prediction alone (Encode Project Consortium et al., 2020), but also identified a significant number of ncRNA loci and novel gene-coding loci that may have functions in transcriptional regulation and translational modulation, which would not have been identified by conventional ab initio gene prediction. These advances paved the way for investigating the contributions of epigenetic regulation to the dynamic nature of the transcriptome. Therefore, comprehensive profiling of the transcriptome is a crucial step for understanding the functional context of genes in biological processes in which the field of plant biology is currently lagging behind, compared with the more rapid advances in the field of human and animal biology (Klepikova and Penin, 2019; Sjöstedt et al., 2020).

    The resolution and biological interpretability of most typical transcriptome studies are largely dependent on a number of key factors, including the extent of sampling (i.e., time-point, tissue types, and developmental stages) and scope of comparative panel (i.e., few representative genotypes for direct comparison versus larger populations of individuals across genetic populations) for mining of common trends and patterns. Unlike the relatively more straightforward analysis of the genome sequence, the dynamic and stochastic nature of the transcriptome makes it impossible to add more genotypes or to compare different transcriptomes if experiments were not conducted at the same time or in a directly comparable time window. Prior optimization of experimental conditions in relation to reference datasets in existing plant databases (e.g., Plant Omics Data Center: http://plantomics.mind.meiji.ac.jp/podc/; RiceXPro: https://ricexpro.dna.affrc.go.jp/) (all accessed July 2022) is beneficial so that spatial profiles established in large datasets can be used to add meaningful annotations. For this purpose, it is often recommended to include model species with well-investigated transcriptomes to serve as a baseline. The scope of discovery and depth of biological interpretation is also limited by the availability, or lack thereof, of comprehensively annotated reference genomes, which are readily available in plant species used as genetic models but often lacking or inadequate among less investigated non-model crops or orphan plant species. In reference-guided transcriptome analysis, loci that are not represented in the reference genome or have extensive variation in exon structures or open reading frames, large genomic rearrangements (e.g., duplication, inversion, insertion, deletion, and translocation) and transposable elements are major sources of errors in mapping and annotation. In this chapter, we present and discuss the factors that are crucial for a standard transcriptomics experiment (Fig. 2.1). We describe the important aspects of data analytics in the context of the typical bulk RNA-seq approach and the appropriate strategies on how to better bridge the design of the wet-lab experiments with the requirements for a robust dry-lab knowledge dissection and integration. We also highlight the unique assumptions and requirements between transcriptome experiments conducted using plant genetic models with comprehensibly annotated reference genome sequences for reference-guided assembly and extraction of biological knowledge, in comparison with the non-model plant species, which rely on de novo assembly of transcriptome datasets followed by homology-based comparison with closely related species where annotated reference genome sequences or assemblies are available.

    2.2.1 Phase-IA: Sampling time-point, replication, and depth of coverage

    The first step in establishing a robust RNA-seq-based transcriptome profiling experiment in plants involves choosing the appropriate tissue or organ as the source of target RNA, determining the biologically meaningful temporal sampling design and methodology, and choosing the appropriate sequencing strategy that will generate the data resolution adequate to the scope and nature of the central biological question (Fig. 2.1, top panel). In eukaryotic cells including plants, the bulk products of synthesis during the process of transcription are derived from housekeeping and maintenance genes such as rRNAs (and tRNAs), overshadowing the representation of protein-coding mRNAs, which make up only around 5% of the total RNA pool (Warner, 1999). To improve the representation of mRNAs in the target sample for RNA-seq library construction, enrichment by poly-A selection is performed to filter out the high-abundance rRNAs (and tRNAs). Alternatively, a procedure for rRNA depletion is performed in cases of reduced full-length mRNA abundance due to degradation. This approach is also effective for capturing target RNAs without poly-A tails, including the non-polyadenylated ncRNAs. Coupled with a procedure for target RNA size fractionation, the RNA-seq technology can also be used to profile the expression of different types of small (21 nt to 27 nt) RNAs, i.e., miRNAs and small interfering RNAs (siRNAs), which are crucial for understanding the epigenomic-level regulation of transcriptome fluxes.

    Compared with the static DNA information, the process of RNA transcription is inherently dynamic and noisy, where subtle differences in sampling strategies and background conditions could influence the relative abundances and steady state of the final mRNA products. Moreover, the bulk sampling approach (as opposed to the single-cell sampling approach) may easily capture the stochastic nature of the transcriptome from one cell to another within a tissue or organ, thus reflecting the sum of multiple nuclei undergoing the process of transcription at slightly variable stages of development and/or physiological status. It is important to acknowledge that the outliers across a wide range of expression due to the stochastic nature of gene expression among a group of cells are most likely lost or averaged out in bulk sampling experiment, and this reduces the resolution of the data and undermines biological interpretation.

    A flow diagram shows the different phases of an RNA-sequencing experiment and analysis.

    Fig. 2.1. Standard workflow of a typical RNA-seq experiment and analysis (prior to gene network analysis) composed of four major component phases: (I) sampling from the biological experiments and sequencing of fragmented RNA molecules; (II) pre-processing of raw sequence reads; (III) mapping of the pre-processed sequence reads for model species or non-model species; and (IV) detection of differentially expressed genes and the degree of response in both a quantitative and qualitative manner by integrating available knowledge databases. The recommendation for sequencing and mapping strategy is described alongside the expected resolution for each analysis.

    Click to see the long description.

    The overall purpose of most transcriptomics experiments is essentially to uncover major shifts in the transcriptional machinery as a consequence of upstream signaling, and to use the transcriptional changes as a means to understand the various components of genetic machinery that contribute to the phenotype. Considering such purpose, the sampling time-point for transcriptome profiling should encompass the entire window of the biological processes. For example, upstream signaling cascades involving short-lived molecules such as reactive oxygen species (ROS) could end within a short period, often a few minutes, from the onset of the treatment. Direct responses to these types of signals are often also relatively short-lived and should be captured by narrowly spaced sampling time-points. On the other hand, secondary or tertiary effects to gene expression of the initial or primary short-lived changes would occur a bit later and are often sustained for a much longer period of time (Yun et al., 2010). To minimize the impact of background noise due to the effects of circadian rhythms, it is ideal that the sampling from the control experiment (i.e., Control t0,

    Enjoying the preview?
    Page 1 of 1