Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Gene Discovery for Disease Models
Gene Discovery for Disease Models
Gene Discovery for Disease Models
Ebook1,049 pages11 hours

Gene Discovery for Disease Models

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book provides readers with new paradigms on the mutation discovery in the post-genome era. The completion of human and other genome sequencing, along with other new technologies, such as mutation analysis and microarray, has dramatically accelerated the progress in positional cloning of genes from mutated models. In 2002, the Mouse Genome Sequencing Consortium stated that “The availability of an annotated mouse genome sequence now provides the most efficient tool yet in the gene hunter's toolkit. One can move directly from genetic mapping to identification of candidate genes, and the experimental process is reduced to PCR amplification and sequencing of exons and other conserved elements in the candidate interval. With this streamlined protocol, it is anticipated that many decades-old mouse mutants will be understood precisely at the DNA level in the near future.” The implication of such a statement should be similar to the identification of mutated genes from human diseases and animal models, when genome sequencing is completed for them. More than five years have passed, but genes in many human diseases and animal models have not yet been identified. In some cases, the identification of the mutated genes has been a bottleneck, because the genetic mechanism holds the key to understand the basis of the diseases. However, an integrative strategy, which is a combination of genetic mapping, genome resources, bioinformatics tools, and high throughput technologies, has been developed and tested. The classic paradigm of positional cloning has evolved with completely new concepts of genomic cloning and protocols. This book describes new concepts of gene discovery in the post-genome era and the use of streamlined protocols to identify genes of interest. This book helps identify not only large insertions/deletions but also single nucleotide mutations or polymorphisms that regulate quantitative trait loci (QTL).
LanguageEnglish
PublisherWiley
Release dateMar 3, 2011
ISBN9781118002179
Gene Discovery for Disease Models

Related to Gene Discovery for Disease Models

Related ebooks

Science & Mathematics For You

View More

Related articles

Reviews for Gene Discovery for Disease Models

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Gene Discovery for Disease Models - Weikuan Gu

    ACKNOWLEDGMENTS

    We thank everyone who contributed to this book for their dedicated work in making this book available. This is a unique, international team with strong scientific background and broad experience. We appreciated the discussions and exchanging of ideas during the preparation of each chapter.

    We would like to thank the following people for kindly reviewing the chapters for this book: Beth Bennett, Cong-Yi Wang, Daniel Goldowitz, David C. Airey, Griffin Gibson, Junming Yue, Qing Xiong, and Yan Cui.

    Special thanks to Drs. Bruce Roe, Hongwen Deng, Xinmin Li, Xingen Lei, and Wesley Beamer for their suggestions and kind support during the preparation of this book.

    We appreciate the assistance of David L. Armbruster and Griffin Gibson for their contributions in editing the chapters and, finally, we would also like to thank Griffin Gibson, XiaoYue Liu, Lishi Wang, and Yue Huang for their assistance in formatting the chapters.

    CONTRIBUTORS

    David C. Airey, Department of Pharmacology, Vanderbilt University School of Medicine Nashville, TN, United States

    Rudi Alberts, Department of Infection Genetics, Helmholtz Center for Infection Research & University of Veterinary Medicine, Hannover, Germany

    Yun Bai, Department of Medical Genetics, Third Military Medical University, Chongqing, China

    Bo Chang, Jackson Laboratory, Bar Harbor, ME, United States

    Yan Cui, Department of Molecular Sciences, University of Tennessee Health Science Center, Memphis, TN, United States

    Bouchra Edderkaoui, School of Medicine, Loma Linda University, Loma Linda, CA, and Research Scientist, Musculoskeletal Disease Center, JLP Memorial VA Medical Center, Loma Linda, CA, United States

    Hanlin Gao, DNA core, City of Hope National Medical Center, Duarte, CA, United States

    Daniel Goldowitz, Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada

    Jochen Graw, Helmholtz Center Munich, German Research Center for Environmental Health, Institute of Developmental Genetics, Neuherberg, Germany

    Weikuan Gu, Department of Orthopedic Surgery—Campbell Clinic, University of Tennessee Health Science Center, Memphis, TN, United States

    Yulin Jia, USDA-ARS Dale Bumpers National Rice Research Center, University of Arkansas, Stuttgart, AR, United States

    Yan Jiao, Department of Orthopedic Surgery—Campbell Clinic, University of Tennessee Health Science Center, Memphis, TN, United States

    Michal Korostynski, Department of Molecular Neuropharmacology, Institute of Pharmacology Polish Academy of Sciences, Krakow, Poland

    Ching-Wan Lam, Department of Pathology, the University of Hong Kong, Queen Mary Hospital, Hong Kong, China

    Kin-Chong Lau, Department of Pathology, the University of Hong Kong, Queen Mary Hospital, Hong Kong, China

    Chun Li, Department of Biostatistics, Vanderbilt University School of Medicine Nashville, TN, United States

    Kai Li, Department of Pharmacology, Suzhou University, Suzhou, Jiangsu, China

    Zhao Long, Respiratory Department, The Second Hospital Affiliated with Dalian Medical University, Dalian, China

    Hector Martinez-Valdez, Department of Immunology, The University of Texas M. D. Anderson Cancer Center, Houston, TX, United States

    Arijit Mukhopadhyay, Genomics & Molecular Medicine, Institute of Genomics & Integrative Biology (CSIR), Delhi, India

    Blanca Ortiz-Quintero, Department of Immunology, The University of Texas M. D. Anderson Cancer Center, Houston, TX, United States

    Wang Qi, Respiratory Department, The Second Hospital Affiliated with Dalian Medical University, Dalian, China

    Kunal Ray, Molecular & Human Genetics Division, Indian Institute of Chemical Biology (CSIR), Kolkata, India

    Klaus Schughart, Department of Infection Genetics, Helmholtz Center for Infection Research & University of Veterinary Medicine, Hannover, Germany

    Mainak Sengupta, Molecular & Human Genetics Division, Indian Institute of Chemical Biology (CSIR), Kolkata, India

    Zhang Shu, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China

    Wanping Sun, Department of Pharmacology, College of Pharmacy, Suzhou University, Suzhou, Jiangsu, China

    Theodore W. Thannhauser, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States

    Joris A. Veltman, Radboud University Nijmegen Medical Centre, Nijmegen Centre of Molecular Life Sciences, Department of Human Genetics, Nijmegen, The Netherlands

    Lisenka E.L.M. Vissers, Radboud University Nijmegen Medical Centre, Nijmegen Centre of Molecular Life Sciences, Department of Human Genetics, Nijmegen, The Netherlands

    Cong-Yi Wang, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Center for Biotechnology and Genomic Medicine, Medical College of Georgia, Augusta, GA, United States

    Yongjun Wang, Beijing Tiantan Hospital, Capital Medical University, Beijing, China

    Song Wu, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, United States

    Gary Guishan Xiao, Hospital Central Laboratory, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu, China

    Hong-Guang Xie, Hospital Central Laboratory, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu, China

    Xueqing Xu, Department of Medical Genetics, Third Military Medical University, Chongqing, China

    Ping Yang, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China

    Yong Yang, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States

    Caroline J. Zeiss, Department of Comparative Medicine, Yale School of Medicine, New Haven, CT, United States

    Jia Zhang, DNA core, GNF Institute, San Diego, CA, United States

    Sheng Zhang, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States

    Wei Zhao, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, United States

    Hongwei Zheng, Beijing Tiantan Hospital, Capital Medical University, Beijing, China

    CHAPTER 1

    Gene Discovery: From Positional Cloning to Genomic Cloning

    WEIKUAN GU and DANIEL GOLDOWITZ

    Contents

    1.1 Concept of Classic Positional Cloning

    1.2 Concept of Gene Discovery in the Post-Genome Era

    1.3 Strategies for Gene Discovery in the Post-Genome Era

    1.4 Future Direction

    1.5 References

    Despite the highly significant advances in studying the genetics and genomics of human populations, there are still large gaps in our understanding of the molecular genetic mechanisms involved in the pathogenesis of many human diseases. The mutated genes in many human diseases remain unknown. Identification of these mutations is crucial for correlating disease pathology and biology to the molecular basis of the disease. Discovery of new gene functions depends on the identification of the mutated genes responsible for disease in humans and other species. The techniques of positional cloning have oftentimes discovered new functions of known genes or new genes for known diseases. The goal of this book is to provide illustrations of the strategy in the post-genomic era for the identification and initial characterization of mutated genes in inherited human diseases and animal models.

    1.1 CONCEPT OF CLASSIC POSITIONAL CLONING

    Positional cloning, also called reverse genetics, is the identification and cloning of a specific gene, with its chromosomal location being the only available information about that gene (Collins, 1990). The identification of the X-linked gene for chronic granulomatous disease in 1986 was the first report employing such a strategy (Baehner et al., 1986; Royer-Pokora et al., 1986). For the past several decades, positional cloning has been widely used in humans, animals, and plants to isolate genes known only by their phenotypic effects. Underlying positional cloning is the assumption that a gene’s location can be pinpointed with sufficient precision to narrow down its location to a DNA segment that is small enough to be sequenced and/or subjected to transformation/complementation experiments.

    The classic procedure for positional cloning usually includes several steps as shown in Figure 1.1. It starts with the phenotype collection from a genetically mappable population. The population genetics necessary for creating the mappable population is beyond the scope of this chapter (Holsinger and Weir, 2009; Zou, 2009). Briefly, however, a mutant phenotype can be genetically mapped when (1) the phenotype shows Mendelian inheritance, (2) the phenotype is differentially distributed among individuals within the population, and (3) a population is large enough to reach a statistical significance when the phenotype is analyzed using mapping software. Parallel to the phenotype collection, genotype information of the same individuals in the same population is collected. Usually, molecular markers that segregate in the population along each and every chromosome are analyzed.

    Figure 1.1. Procedure of identification of a mutated gene using strategy of classic positional cloning.

    c01f001

    The collected phenotype and genotype data from the population are used in conducting linkage analysis by one of a variety of softwares to define the chromosomal regions that the locus is likely to occupy. If a trait is controlled by a single gene or locus, the linkage analysis should point to a single chromosomal region. For traits regulated by multiple genes, multiple loci, or quantitative trait loci, multiple chromosomal regions are identified. To actually identify the gene underlying the trait of interest, fine mapping has to be conducted to narrow down the chromosomal regions so that genomic searching is practical. The next step, then, is to construct a genomic contiguous region (contig), which is defined as a set of overlapping segments of DNA, to connect and cover all the genomic elements in the targeted area. After a precise contig is constructed, it will be sequenced and analyzed by a technique termed chromosomal walking. This is a lengthy procedure that involves the recognition of potential genes, noncoding genes, and/or coding and noncoding regions. Finally, potential candidate genes should be confirmed using a variety of genetic and biochemical methods.

    Because all of these procedures require a large amount of work, positional cloning typically requires a team effort and positional cloning projects have been known to take many years. First, the genetic region needs to be narrowed down as precisely as possible by means of initial linkage analysis and fine mapping. Second, linkage analysis requires both the availability of a large pedigree and PCR-based analysis of microsatellite markers of that pedigree to allow a whole-genome search for linkage. Fine mapping is a particularly difficult task consisting of breaking the linkage and identifying useful markers in the targeted region. Contig construction entails identification of a large insert genomic library, either BAC (bacterial artificial chromosomes) or YAC (yeast artificial chromosomes), with known markers. Analysis of genetic elements within a contig can be very difficult because of the lack of knowledge of both genes and gene organization.

    However, the recent completion of the human and mouse genome projects (e.g., Mouse Genome Sequencing Consortium. 2002), along with other new technology, such as mutation analysis and microarrays, allows unprecedented progress in positional cloning of mutant genes. There are four major changes in the technique of positional cloning (Hinkes et al., 2006): (1) Contig construction is no longer needed because of the availability of whole genomes that have been sequenced. (2) Sequencing of an entire region—usually 10 Mbp of the genome, is no longer necessary, as those sequences are now readily available through public (Ensembl) and private databases (Celera). (3) Sequence analysis requires much less time and effort since annotations of whole genomes have been done (e.g., we now know that the majority of the mouse genome is made up of repetitive sequences, such as transposons, that are easy to identify and, therefore, can be eliminated from further analysis). (4) Because of the availability of whole genome sequences and high-throughput technologies, we can now work on a much larger genomic regions, which eliminates fine mapping. (5) Annotations of genomes and bioinformatic algorithms has paralleled the rapid acquisition of genomic data and has permitted an in silico assessment of candidate genes. This is the major theme of this book. As a result of new high-throughput technologies and whole-genome libraries, a genome-based integrative strategy is the most practical method for gene discovery in our current post-genome era (Gu et al., 2002; Jiao et al., 2005a, 2005b, 2007, 2008).

    Consequently, pure positional cloning in humans, animals, or plants is no longer necessary. The definition of positional cloning is cloning or identifying a gene with specific function purely according to its position. In humans, mice, and rats it is rare to localize mutations to a gene or the expression of that gene is unknown. For example, microarray technology has arrayed every gene into their chips. As a result, microarray analysis of gene expression profiles has become routine in many laboratories. Therefore, soon we may find out that expression data of every gene in every tissue is available to public. Thus, for any gene, even if nothing else is known about that gene, its expression level in a tissue can be assessed. As such, the classic positional cloning method is of little utility in the rapidly evolving arena of functional genomics. A new procedure that integrates both genomic and high-throughput technology has been created and will be, and should be, the next generation’s tool of choice.

    1.2 CONCEPT OF GENE DISCOVERY IN THE POST-GENOME ERA

    The strategy for gene discovery using positional cloning depends on the availability of genetic-based data and technology. The new approach for gene discovery is highly integrative and is based on the availability of genome resources and biotechnology (Rintisch et al., 2008). There are three distinct and significant differences between new gene discovery strategies and classical positional cloning. The first one is the elimination of fine mapping. Rather than narrowing down the genomic regions using several approaches, a large number of genomic regions can be searched to discover the genes of interest all at once. The second is the direct investigation of genetic elements within the targeted region, without construction of contig or sequencing, because of the availability of genomic sequences and annotation of genomic elements. The third one is the high-throughput screening of candidates within the targeted region. The high-speed analytical methods include mutation screening, resequencing, and both gene expression profiling and functional predictions (Jiao et al., 2008). The following chapters provide detailed information on each of those aspects. The first part of this book introduces the technologies and resources used in gene discovery in our post-genome era. The second part of this book provides experimental procedures and methodologies for gene discovery using both genome resources and high-throughput technologies. The third and final part of this book predicts the future direction of gene discovery based on the elucidation of genomes and developing technologies.

    We are living in an era of both technology explosion and unparalleled expansion of biological resources. of the advances in gene discovery, however, are rooted in the technology of genome sequencing. Without the completion of whole genome sequences for humans and other species, gene discovery would still be stuck in the classic positional cloning approach. Therefore, gene discovery in every chapter is based on the fact that genomic sequences are available for the subjects of interest. Parallel with the necessity of completed genomes is the demand for, and rapid development of, high-throughput technologies necessary for mutation screening, genome analysis, and bioinformatics. Without these tools, there would be no effective method for capitalizing on the completion of whole genomes and for allowing our current rapid methods for gene discovery. Due to the significance of these various technologies, Chapters 2–4 introduce these technologies.

    Chapters 2–6 illustrate a variety approaches, including SNP analysis, DNA methylation, protein turnover rate measurement, microarray analysis, and bioinformatic tools. Finally, the integrative analysis of data from a variety procedures provides clues for potential candidate genes for the follow-up experiments, such as RT-PCR, DNA sequencing of the potential mutation(s), and/or northern or western blot analysis to determine the significance of the mutated gene.

    An important reminder to readers is that although this book mainly focuses on coding sequences known as genes, mutations in many other genetic elements could be identified using the same or similar technologies or procedures. Those none-gene elements of the genome include not only the introns, 5′ and 3′ ends of the genes, but also many others (Chen et al., 2008), such as transcription factor binding sites, microRNAs, cis-acting elements, palindromic motifs, and/or conserved k-tuples (phylogenetic footprints) (Hui and Bindereif, 2010). Readers should keep in mind that gene regulation is a complicated process and regulators are not necessarily near the genes that they influence. They can be located at long distances, called distant regulatory elements (REs) (Gotea and Ovcharenko, 2008), such as enhancers, repressors, and silencers. In addition, repetitive sequences sometimes play unexpected roles in gene regulations (Hui and Bindereif, 2005).

    1.3 STRATEGIES FOR GENE DISCOVERY IN THE POST-GENOME ERA

    Current experimental procedure strategies for mutation screening have been summarized (Jiao et al., 2008) and are shown in Figure 1.2. Individual chapters in this book focus on one or more steps or different approaches of this strategy. We briefly touch on screening for mutations in DNA in this introduction using the mouse as the model. Detailed procedure and methodologies are presented in Chapters 7–13.

    Figure 1.2. Strategy of gene discovery through mutation models.

    c01f002

    The first step is to determine the total number of genes/transcripts within the targeted region. Chapter 7 describes the genetic markers and methods for determine the genomic location of target genetic loci. Any of the many recently developed software programs (see, for example, www.genediscovery.org/pgmapper/index.jsp; Xiong et al., 2008a) can be used to identify every candidate gene from a defined genomic region. The next step is to evaluate candidate genes to reduce the number of genes in the list to a more workable and feasible amount (Chapters 8–13). At this step, obvious candidate genes are first evaluated. We believe that a large number of differences exist between the gene of interest (GOI) in mutation and in wild type (control). Our current knowledge of gene function and bioinformatics should allow us to eliminate most of the unlikely candidate genes. Series of comparisons and function analyses should be made to rule out the candidacy of variation in introns sequences, if those sequences do not affect the phenotype (Chapters 11–13) At the end, a short list of candiate genes are expected or, in the best case senario, only one gene will remain. Finally, mutation evaluation or testing is carried out (Chapters 14–20). This evaluation considers differences between the GOI and control, sequence differences in these genes, potential gene function changes due to these differences, and whether other strains or populations have similar differences. Information on differences is combined with gene expression profiling and possible gene function to determine a list of candidate genes. Finally, selected candidate genes are tested and confirmed using a variety of experimental approaches, such as gene knockout and/or knockin.

    1.4 FUTURE DIRECTION

    Gene discovery or mutation identification has gone through two stages, as we have discussed: the classical and the post-genome era. The next stage of gene discovery will depend on development of high-throughput technology and bioinformatic tools. As shown in Figure 1.3, in the first stage, positional cloning a GOI (the classical approach) has to go through every step, including initial mapping, fine mapping, contig construction, and candidate searching based on genome sequences. Currently at the second stage, in most cases, fine mapping and contig construction are not necessary because of the available information of genomic sequences and genetic elements within the targeted region.

    Figure 1.3. Different stages of positional cloning (from left to right): classic, post-genome era, and future (dashed blue line).

    c01f003

    The next stage of genomic cloning will allow researchers to conduct a search of candidate genes without mapping information (shown as dashed lines in Figure 1.3). At that stage, once a phenotype is found from an animal model or an individual, a search of candidate genes can be done based on the annotation of every gene or regulatory element in the genome. To reach the next stage, two critical improvements in our genomic research are needed. The first one is the complete evaluation of potential function of every gene and regulatory element in the whole genome. This seemingly large amount of work is most likely to be done within a decade or even sooner, as technologies for the analysis of gene function, SNP analysis, and proteomics are rapidly developing. The second is the availability of software for rapid automatic high-throughput searching. Currently, some programs such as PGmapper (Xiong et al., 2008a) has provided the capability to search genome regions of several megabases. The capability of searching whole chromosomes and whole genomes within a reasonable time (under an hour) will follow development of computational tools in coordination with genome and literature databases.

    1.5 REFERENCES

    Baehner RL, Kunkel LM, Monaco AP, Haines JL, Conneally PM, Palmer C, Heerema N, Orkin SH. (1986). DNA linkage analysis of X chromosome-linked chronic granulomatous disease. Proc Natl Acad Sci U S A 83(10):3398–401.

    Chen HP, Lin A, Bloom JS, Khan AH, Park CC, Smith DJ. (2008). Screening reveals conserved and nonconserved transcriptional regulatory elements including an E3/E4 allele-dependent APOE coding region enhancer. Genomics 92(5):292–300. Epub Sept. 3.

    Collins FS. (1990). Identifying human disease genes by positional cloning. Harvey Lect 86:149–64.

    Gotea V, Ovcharenko I. (2008). DiRE: identifying distant regulatory elements of co-expressed genes. Nucleic Acids Res 36:W133–39. Epub May 17.

    Gu W, Li X, Lau KH, Edderkaoui B, Donahae LR, Rosen CJ, Beamer WG, Shultz KL, Srivastava A, Mohan S, Baylink DJ. (2002). Gene expression between a congenic strain that contains a quantitative trait locus of high bone density from CAST/EiJ and its wild-type strain C57BL/6J. Funct Integr Genomics 1(6):375–86.

    Hinkes B, Wiggins RC, Gbadegesin R, Vlangos CN, Seelow D, Nürnberg G, Garg P, Verma R, Chaib H, Hoskins BE, Ashraf S, Becker C, Hennies HC, Goyal M, Wharram BL, Schachter AD, Mudumana S, Drummond I, Kerjaschki D, Waldherr R, Dietrich A, Ozaltin F, Bakkaloglu A, Cleper R, Basel-Vanagaite L, Pohl M, Griebel M, Tsygin AN, Soylu A, Müller D, Sorli CS, Bunney TD, Katan M, Liu J, Attanasio M, O’toole JF, Hasselbacher K, Mucha B, Otto EA, Airik R, Kispert A, Kelley GG, Smrcka AV, Gudermann T, Holzman LB, Nürnberg P, Hildebrandt F. (2006). Positional cloning uncovers mutations in PLCE1 responsible for a nephrotic syndrome variant that may be reversible. Nat Genet 38(12):1397–405. Epub Nov. 5.

    Holsinger KE, Weir BS. (2009). Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet 10(9):639–50.

    Hui J, Bindereif A. (2005). Alternative pre-mRNA splicing in the human system: unexpected role of repetitive sequences as regulatory elements. Biol Chem 386(12): 1265–71.

    Jiao Y, Li X, Beamer WG, Yan J, Tong Y, Goldowitz D, Roe B, Gu W. (2005a). Identification of a deletion causing spontaneous fracture by screening a candidate region of mouse chromosome 14. Mammal Genome 16(1):20–31.

    Jiao Y, Yan J, Zhao Y, Donahue LR, Beamer WG, Li X, Roe BA, Ledoux MS, Gu W. (2005b). Carbonic anhydrase-related protein VIII deficiency is associated with a distinctive lifelong gait disorder in waddles mice. Genetics Epub Aug. 22.

    Jiao Y, Yan J, Jiao F, Yang H, Donahue LR, Li X, Roe BA, Stuart J, Gu W. (2007). A single nucleotide mutation in Nppc is associated with a long bone abnormality in lbab mice. BMC Genet 8:16.

    Jiao Y, Jin X, Yan J, Zhang C, Jiao F, Li X, Roe BA, Mount DB, Gu W. (2008). A deletion mutation in Slc12a6 is associated with neuromuscular disease in gaxp mice. Genomics 91(5):407–14.

    Koppel I, Aid-Pavlidis T, Jaanson K, Sepp M, Palm K, Timmusk T. (2010). BAC transgenic mice reveal distal cis-regulatory elements governing BDNF gene expression. Genesis 48(4):214–19.

    Mouse Genome Sequencing Consortium. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–62.

    Rintisch C, Ameri J, Olofsson P, Luthman H, Holmdahl R. (2008). Positional cloning of the Igl genes controlling rheumatoid factor production and allergic bronchitis in rats. Proc Natl Acad Sci U S A 105(37):14005–10. Epub Sept. 8.

    Royer-Pokora B, Kunkel LM, Monaco AP, Goff SC, Newburger PE, Baehner RL, Cole FS, Curnutte JT, Orkin SH. (1986). Cloning the gene for an inherited human disorder—chronic granulomatous disease—on the basis of its chromosomal location. Nature 322(6074):32–38.

    Xiong Q, Qiu Y, Gu W. (2008a). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24(7):1011–13. Epub Jan. 18.

    Xiong Q, Jiao Y, Hasty KA, Stuart JM, Postlethwaite A, Kang AH, Gu W. (2008b). Genetic and molecular basis of QTL of rheumatoid arthritis in rat: genes and polymorphisms. J Immunol 181(2):859–64.

    Xiong Q, Jiao Y, Hasty KA, Canale ST, Stuart JM, Beamer WG, Deng HW, Baylink D, Gu W. (2009). Quantitative trait loci, genes, and polymorphisms that regulate bone mineral density in mouse. Genomics 93(5):401–14.

    Zou F. (2009). QTL mapping in intercross and backcross populations. Methods Mol Biol 573:157–73.

    CHAPTER 2

    High-Throughput Gene Expression Analysis and the Identification of Expression QTLs

    RUDI ALBERTS and KLAUS SCHUGHART

    Contents

    2.1 Concepts in High-Throughput Gene Expression Analysis

    2.2 Technologies of High-Throughput Gene Expression Analysis

    2.2.1 Gene Expression Microarrays

    2.2.2 One-Channel Versus Two-Channel Microarrays

    2.2.3 Oligonucleotide Versus Spotted Microarrays

    2.2.4 Whole-Transcript Arrays

    2.2.5 Genome Tiling Arrays

    2.2.6 MicroRNA Arrays

    2.3 Protocols

    2.3.1 Image Analysis

    2.3.2 Normalization

    2.3.3 Quality Control

    2.4 Applications and Limitations

    2.4.1 Identification of Expression QTL and Gene Regulatory Networks

    2.4.2 Identification of Differentially Expressed Genes

    2.4.3 Identification of Cell-Type-Specific Genes

    2.4.4 Determination of the Downstream Effects of a Mutation

    2.4.5 Determination of the Downstream Effects of a Signaling Molecule

    2.4.6 Predicting Vaccine Efficacy

    2.4.7 Determination of Host Responses after Infection

    2.4.8 Limitations

    2.5 Questions and Answers

    2.6 Acknowledgments

    2.7 References

    2.1 CONCEPTS IN HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS

    Many diseases have a genetic basis. Together with influences from the environment, these genetic factors determine whether a certain disease will develop and how severe it will be. In some cases, a disease is determined by only one gene. The sickle-cell disease, for example, is caused by a mutation in the hemoglobin gene. This causes red blood cells to adopt an abnormal sickle shape. It results in a risk of various complications and a shortened life expectancy. Another example of a single gene disease is cystic fibrosis. This disease affects the exocrine glands of the lungs, liver, pancreas, and intestines and results in progressive disability and a severely shortened life expectancy. It is caused by a mutation in the cystic fibrosis transmembrane conductance regulator (CFTCR) gene.

    However, in most human diseases, multiple genes play a role in the development of the pathological symptoms. Examples for these, so-called complex genetic diseases are cancer, obesity, diabetes, hypertension, asthma, and heart disease. Here, each gene contributes to a certain degree to the establishment of the phenotype. And we can assume that the contributing genes and their products operate in regulatory networks. They may enhance or inhibit each other. If multiple genes contribute to the development of a disease and individual contributions of each gene are small, it is a major challenge to identify the causal disease genes and their interactions. The advent of new high-throughput analyses makes it now possible to study such complex genetic interactions and thus unravel the molecular basis of complex genetic diseases in humans. For example, high-throughput gene expression analysis allows one to measure the expression of tens of thousands of genes at the same time. Researchers can now compare complete gene expression profiles for diseased and healthy samples and obtain a direct insight into global gene expression changes. Thus these new technologies allow them to unravel the interplay between genes and to reconstruct gene regulatory networks for biological processes.

    The analysis of gene expression is based on the following basic biological principles: The genetic information of a cell is stored in genes, which are part of the DNA in the nucleus. DNA is transcribed into RNA and then processed to messenger RNA (mRNA), which transfers the information to the cytoplasm. Here the mRNA is translated into protein. Many proteins are enzymes that catalyze biochemical reactions. Other proteins have mechanical or structural functions. But proteins are also important in biological signaling processes, such as growth factor responses, immune responses, cell adhesion, and the cell cycle. Since proteins are major players in living organisms, they are also involved in the development of diseases. Therefore, to gain an understanding of processes that lead to disease, it is of great value to have a global picture of the amount of mRNA of all genes that are expressed in diseased and in healthy subjects. High-throughput gene expression microarrays measure these global changes and differences of mRNA and, therefore, give a very good indication of the processes that are abnormal in disease tissues.

    2.2 TECHNOLOGIES OF HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS

    2.2.1 Gene Expression Microarrays

    Gene expression microarray technology enables the measurement of mRNA abundances in a high-throughput manner. Instead of directly using mRNA, more stable cDNA molecules are used, which are an inverse copy of the RNA. This copy is created by a viral enzyme, reverse transcriptase, in a process called reverse transcription. Microarrays are small glass plates that are subdivided into thousands of spots. Short sequences of the nucleotides A, C, T, and G, commonly referred to as probes, are bound as spots to the glass surface (Fig. 2.1). All probes in one spot have a sequence that is reverse complementary to part of the sequence of the cDNA of a specific gene. The idea is that the cDNA generated from the mRNA that is expressed from this gene will hybridize (bind) to the probes on the specific spot. To make it possible to measure the amount of cDNA hybridized to the microarray, the cDNA is labeled with a fluorescent dye. After hybridization and removal of the cDNA that did not bind, the microarray is inserted into a scanner that reads the amount of fluorescence for each of the spots. These measurements represent the level of gene expression for all genes on the microarray and are generally represented in the manner shown in Figure 2.2. Using specialized software, the intensities of each spots in the image are quantified, providing a quantitative value of the mRNA expression level for each of the genes on the microarray.

    Figure 2.1. Hybridization of labeled cDNAs to a gene expression microarray. The small glass plate contains millions of probes. Fluorescently labeled (spheres) cDNA binds to the probes on the microarray.

    Image courtesy of Affymetrix.

    c02f001

    Figure 2.2. (See color insert.) A microarray produced by a scanner. Each of the spots on the microarray represents a gene, and the color represents the amount of fluorescence that is measured, hence the amount of cDNA that was present in the original sample.

    Reprinted from Reinke (2006).

    c02f002

    2.2.2 One-Channel Versus Two-Channel Microarrays

    There exist different kinds of microarrays. A general distinction that can be made is one-channel versus two-channel microarrays. On a one-channel microarray, one fluorescently labeled sample is hybridized and the resulting expression values are read as absolute expression values for that sample. To compare the expression values between multiple samples, it is necessary to use multiple microarrays. The most widely known provider of one-channel microarrays is Affymetrix (www.affymetrix.com).

    On two channel microarrays, one can directly compare the gene expression values of two different samples. Each of the samples is fluorescently labeled using a different dye. In most cases a Cy5 (red) dye is used for one sample and Cy3 (green) for the other. This produces images like Figure 2.2 with black, yellow, red, and green spots. A red spot indicates that the sample with the red labeling has a higher expression values (vice versa for green) and a yellow spot indicates that both samples have a similar expression values. If the spot remains black, there is no expression in either of the samples. Well-known providers of two-channel microrarrays are Agilent (www.agilent.com) and Illumina (www.illumina.com).

    2.2.3 Oligonucleotide Versus Spotted Microarrays

    A second important distinction between microarray setups are oligonucleotide arrays versus spotted arrays. On oligonucleotide arrays, the probes are attached to the microarray by the manufacturer. For example Affymetrix uses chemical synthesis and photolithographic masks to build up the probes on the microarray. Here, all probes on the microarray are simultaneously synthesized nucleotide by nucleotide. This results in high-density prefabricated microarrays.

    On the other hand, probes on spotted microarrays are synthesized before they are added (spotted) onto the glass. Such microarrays are sold without probes, and laboratories have to design and fabricate their own probes and fix them onto the microarray. This is often a cheaper solution since the gene density can be much lower. Also, the researcher can customize the microarray to each experiment.

    2.2.4 Whole-Transcript Arrays

    The classical microarrays mentioned earlier interrogate the mRNA at only one specific location. Agilent, for example, uses one 60-mer probe per gene to measure its expression. Affymetrix uses multiple 25-mer probes per gene to measure mRNA abundances, all of them located at the end of the gene, either in the 3′ untranslated region (3′ UTR) or in the last exon or exons (Fig. 2.3).

    Figure 2.3. Probe coverage along the transcript. Gray regions represent exons, and black regions are introns that are removed during splicing. The short dashes underneath the exon regions indicate probes of the exon array and the classical 3′ array setup.

    c02f003

    Recent studies, however, indicate that alternative splicing plays a major role in the generation of proteins and thereby functional diversity in metazoan organisms (Blencowe 2006). Alternative splicing means that different transcript isoforms are produced from the same gene, by variations in pre-mRNA splicing. It is estimated that 40–60% of human genes have multiple splice forms (Modrek and Lee, 2002). These findings led to the development of a new type of microarray, the whole-transcript array, which is able to measure mRNA levels over the whole length of the gene. As depicted in Figure 2.3, Affymetrix exon arrays cover every exon of a gene with, on average, four probes. By using these microarrays, one can study global gene expression profiles like before but also detect different isoforms of a gene, such as transcripts with alternative 5′ start sites or an undefined 3′ end, nonpolyadenylated messages, or truncated or alternatively spliced transcripts.

    2.2.5 Genome Tiling Arrays

    The design of gene expression arrays and whole-transcript arrays is based on sequence information and annotation of known transcripts. Genome tiling arrays contain probes that are tiled over the whole genome at regular intervals, including both annotated regions of the genome and regions considered to be noncoding. Tiling arrays can thus be used to discover novel transcripts. The Affymetrix Human Tiling 1.0 R array set is a set of 14 microarrays that contain 45 million oligonucleotide probes covering the whole human genome. Probes have a length of 25 nucleotides and are tiled at an average resolution of 35 bp, leaving an average gap of 10 bp between probes.

    2.2.6 MicroRNA Arrays

    MicroRNAs (miRNAs) are single-stranded RNAs of very short size, 21–23 nucleotides in length. They do not code for proteins but are complementary to certain mRNA sequences. Binding to their target mRNA causes its degradation. In this way, miRNAs can regulate gene expression. It has been shown that miRNAs have an effect on various biological processes—for example, the development of cancer (He et al., 2005) and heart disease (Thum et al., 2007). Several commercial products are available for large scale identification of miRNA. Example vendors are Affymetrix, Agilent, Invitrogen, Applied Biosystems, and Exiqon.

    2.3 PROTOCOLS

    2.3.1 Image Analysis

    After the microarrays have been scanned, one obtains a figure with thousands of individual spots (Fig. 2.2) representing the mRNA levels for each gene. Now image analysis is needed to quantify the intensity for each spot. Most of the microarray vendors provide software that performs image analysis and outputs quantitative intensity values per gene. Several steps are performed in such an image analysis. First, the image will be filtered. This is a cleaning procedure by which small contamination artifacts such as dust particles are removed. Next, the location of the center of each spot is identified. This is called gridding. Next, is a process called segmentation; for each of the pixels in the spot area, it is decided whether it belongs to the signal or to the background (signal detected by the scanner in the area where no hybridization has taken place). Finally, in the quantification step, the pixel values of each spot are summarized into one gene expression value and a background value.

    2.3.2 Normalization

    During a microarray experiment, there can be multiple factors that introduce unwanted variation into the data. For example, if the experiment includes analysis of many samples that cannot be labeled and hybridized in one day, the quality of the labeling may be different on different days. This will lead to global expression differences between different microarrays that are not due to biological differences. The aim of normalization is to remove such unwanted variations from the data so that different samples can be properly compared and real biological differences detected. This process is called normalization and several techniques exist. In the following sections, we describe the ones that are most commonly used.

    As a rule of thumb in microarray normalization, Wit and McClure (2004) suggest first normalizing all local features and then gradually progressing to normalizations that deal with several or all microarrays. This procedure involves the following steps.

    2.3.2.1 Spatial Correction

    Since probes are randomly distributed over the microarray, one expects a similar distribution of signals on each location on the array. After performing microarray experiments, there might however be microarrays where this is not the case. For example, there might be an array where all signals tend to be structurally lower in one corner of the array. Yang et al. (2002) observed that the variation in signal can also be different at different locations on the array.

    The spatial effect can be removed by robust smoothing of the expression data across the array in each channel separately. Here, a smooth surface is fit to the data and subsequently subtracted from the data. To also correct for differences in variation on different locations of the array, one can divide by a location-dependent scale parameter. This parameter is obtained by smoothing the absolute differences between the expression values and the first smoothed surface (Wit and McClure, 2004).

    In cases of very strong abnormal local effects, it may even be best to exclude this array and to repeat the experiment.

    2.3.2.2 Background Correction

    Microarray scanners always detect a background signal, even in places where no true signal is present. To obtain more accurate quantifications of gene expression values, several methods have been proposed to adjust for this background. Some methods work with local background values per spot. These background values are measured directly near the spot. Eisen. (1999) simply subtracts the background value from the observed value to obtain a signal value. Kooperberg et al. (2002) apply a Bayesian approach, assuming that the mean of the observed pixel values is the sum of the mean true signal and the mean background signal. Because of the close vicinity of the background measurements and the signal measurements, there is a possibility that the background values are contaminated with true signal. Therefore, several global background correction methods have been proposed that do not use the background values per spot but global approaches. Wit and McClure (2004) suggest calculating the mean value of all empty spots on the array, subtracting that mean from all measurements and putting the negative values obtained to zero. Irizarry et al. (2003) propose a probabilistic model that determines the conditional expectation of the true signal given the observed signal, assuming that the observed signal is the sum of the true signal and a background signal and that the spot intensities are drawn from one exponential distribution and the background intensities from a normal distribution. Both methods give similar results.

    2.3.2.3 Dye-Effect Correction

    The most commonly used dyes in two-channel microarray experiments are Cy5 (red) and Cy3 (green). Slight differences in the characteristics of these dyes, such as in the size of the molecules, lead to unwanted effects in the observed intensity signals. For Cy5 and Cy3 it was observed that the dyes often have an intensity-dependent effect. That is, for large expression values, one of the dyes tends to give higher expression values, while for small expression values they may give lower expression values (Fig. 2.4a). Yang and Speed (2003) suggest transforming the Cy3 vs. Cy5 scatter plot into an MA plot, which is basically a 45° rotation of the Cy3 vs. Cy5 scatter plot (Fig. 2.4b). The values of M and A are calculated as follows:

    (2.1) c02e001

    (2.2) c02e002

    Figure 2.4. (a) Scatterplot of Cy3 versus Cy5 signal. (b) MA plot. Gray curve fitted by loess. (c) Normalized MA plot. (d) Normalized Cy3 versus Cy5 plot.

    c02f004

    Then, using a function such as loess, a smooth curve is fitted through all data points and a normalized MA plot is created by subtracting the distance to the line (Fig. 2.4c). The MA plot is transferred back into a normalized Cy5 vs. Cy3 scatter plot by applying the inverse of equations 2.1 and 2.2 on the calculated data (Fig. 2.4d):

    (2.3) c02e003

    (2.4) c02e004

    2.3.2.4 Normalization between Arrays

    The global range of gene expression values can differ between arrays from one experiment to the next. These global changes are often the result of slight variations during the process of sample preparation, labeling, microarray hybridization, and washes. The aim of normalization between arrays is to remove global expression differences so that multiple samples can be properly compared and real biological differences detected.

    The most straightforward way to normalize between microarrays is to equalize the median or mean value for each of the arrays and to adjust the scale to some fixed value. A disadvantage of this method is that it performs a linear scaling, which is not optimal if the distributions of the expression values differ.

    Therefore, another method called quantile normalization was proposed by several authors (e.g., Bolstad et al., 2003). This method equalizes the distributions of the expression values of all microarrays. The procedure is as follows:

    1. Given n arrays of length g, form matrix A of dimension g × n where each array is a column and each gene is a row.

    2. Sort each column of A to give Asort.

    3. Take the means across rows of Asort and assign this mean to each element in the row to get Asortmean.

    4. Get A normalized by rearranging each column of Asortmean to have the same ordering as the original A.

    Figure 2.5 shows the distribution of all gene expression values before and after the quantile normalization procedure has been applied. After normalization, the distributions of values are all equal.

    Figure 2.5. Quantile normalization. (a) Box plots of the log2 signals for four microarrays before normalization, showing differences in the distributions. (b) The distributions of the log2 signals for the same microarrays after quantile normalization.

    c02f005

    2.3.3 Quality Control

    The performance of microarray experiments involves many steps, and there are many stages where things can go wrong. Here we describe the most common procedures for quality control and explain how they can be used to inspect the quality of the data.

    2.3.3.1 Inspection of Signal Plots

    As a first quality control measure one can make a signal plot of all measured signals for all microarrays before and after normalization. If probes are randomly distributed over the microarrays, there should not be any patterns visible in these plots. Visual inspection of these images may reveal cases in which, for example, a hair or pieces of dust disturb the signals. Also, one can detect if spatial effects are properly corrected by the normalization methods.

    2.3.3.2 Dissimilarity Measures

    To detect deviating microarrays, Wit and McClure (2004) suggest calculating similarity measures between all pairs of microarrays. Suppose one wants to investigate two types of dissimilarity measures: absolute similarities indicating whether genes have similar levels over different arrays and correlations indicating coordinated changes of genes between arrays. As absolute similarity measures they use power distances

    (2.5) c02e005

    and use both the Manhattan distance d1 and the Euclidean distance d2. Investigation of the dissimilarity matrices directly identifies microarrays in which processing problems may have occurred.

    2.3.3.3 Dimensionality Reduction

    Another way to check for possible problems in the data is to perform a dimensionality reduction. A popular method is principal component analysis (PCA), which is a method that transfers a number of variables (gene expression values in this case) into a number of uncorrelated variables called principal components. The first component accounts for as much of the variability in the data as possible, and each following component accounts for as much of the remaining variability as possible.

    A quicker method for dimensionality reduction is Sammon mapping (Sammon, 1969). Instead of using the whole gene expression data matrix, it uses the distance matrix between arrays. It aims to find a representation of the arrays in a lower-dimensional space in such a way that the distances between the arrays are closest to the distances in the original matrix. Inspection of a two-dimensional (2D) Sammon mapping can indicate whether samples have been swapped or specific arrays have a deviating behavior. For example, the Sammon mapping in Figure 2.6 indicates that samples A.2.3 and B.2.3 have probably been swapped.

    Figure 2.6. Sammon mapping of microarray data of two strains of mice (A and B) infected with a virus, measured 3 days after the infection. Each measurement has been performed in three replicates: A.1.3 means mouse A, day 1 postinfection, replicate 3.

    c02f006

    2.3.3.4 Pairwise Scatter Plots

    Another good way to detect deviating microarrays is to inspect scatter plots of all expression values for all possible pairs of microarrays. Normally, the amount of differentially expressed genes between two experimental conditions is small, relative to the total amount of genes on the microarray used. A figure in which the expression values of all genes in the two conditions are plotted against each other should reveal a cloud of points on the diagonal with relatively few points off the diagonal. Comparing the scatter plots of all possible pairs of microarrays might reveal a single microarray that shows deviating scatter plots with all other microarrays—for example, scatter plots in which the cloud on the diagonal is broader than in the other pairwise plots. This would indicate that the micro­array shows many more and larger changes in gene expression compared to other samples than do other comparisons. If these changes are not expected from a biological point of view, there might have been technical problem causing these changes, and it will be better to repeat the microarray experiment for this sample.

    2.3.3.5 Sex-Specific Gene Expression

    If the experiment involves both male and female samples, mislabelings of the microarrays can be detected by comparing the expression values for sex-specific genes. Xist is one example of a female-specific gene. It should be expressed only in female samples. In this way, samples that have been mixed up can be easily identified.

    2.4 APPLICATIONS AND LIMITATIONS

    2.4.1 Identification of Expression QTL and Gene Regulatory Networks

    Combining gene expression profiles with genetic information represents a new, powerful approach for identifying genes in disease models. This approach is generally referred to as the identification of expression quantitative trait loci (eQTL) (Rockman and Kruglyak, 2006) or genetical genomics (Jansen and Nap, 2001).

    A quantitative trait locus is a specific region on the genome where one or more genes are located that most likely regulate a phenotypic trait. By making use of specific populations of organisms that are genetically related and by measuring the trait values for many individuals of the population and combining them with the genetic information of the individual organisms, one can identify the locations on the genome regulating the trait.

    Recombinant inbred lines (RIL) are often used for QTL analysis. In mice, these lines are obtained by breeding two genetically different inbred parental lines and by performing brother–sister mating from a large number of F1 hybrid pairs for about 20 generations. As can be seen from Figure 2.7, the parental strains produce F1 offspring that are heterozygous. The F1 offspring are mated to produce F2 animals. After many generations of brother–sister matings that start from a given F2 pair, recombinant inbred lines have evolved whose genomes represent a fixed mixture of the parental genomes and in which all individuals are again homozygous at every location. The fixed parental genome mixture is, however, different from one line (starting with a given F2 pair) to another line (starting with a different F2 pair). The genetic makeup of these recombinant inbred lines is then determined for each RIL using molecular markers. RILs now allow one to perform phenotype analysis, subsequently relating them to the genotypes.

    Figure 2.7. Generation of recombinant inbred lines.

    c02f007

    For the identification of eQTL, global gene expression profiles are determined by gene expression microarrays for each RIL. Subsequently, the gene expression profile for each gene is taken as a quantitative trait and is compared in a genome scan with the distribution of molecular markers (Fig. 2.8). For each molecular marker, the expression trait values are divided into two groups, according to the alleles that the individuals carry for that marker (Fig. 2.8c). Then a statistical test is performed that determines whether the means of both groups differ significantly. In this case, eQTL may be determined, as shown for the second marker in Figure 2.8. This result indicates that, with a high probability, there are one or several factors (genes) at that genomic location that regulate the expression of the target gene, since all individuals carrying the one allele have a low expression and all individuals carrying the other allele have a high expression value. Once a QTL is identified, one can compare the location of the QTL with the location of the gene. If they coincide, the QTL is referred to as a cis-QTL or local QTL, otherwise it is called a trans-QTL, or distant QTL.

    Figure 2.8. (See color insert.) (a) The genomewide genotypes of eight recombinant inbred lines generated from a cross between two homozygous parents (A and B). Each row indicates the genome of a single RIL. The light or dark gray color in each of the RILs indicates whether that part of its genome was inherited from parent A or B. (b) Gene expression values are determined by microarrays. Four values are shown for each parent and one value for each of the RILs. (c) For three molecular markers, the gene expression values of the RILs are dissected into two groups, according to the allele they carry for that molecular marker (light or dark gray). A statistical test of each marker location calculates whether the means of both groups differ. The significances of the tests are plotted in a genomewide plot as a QTL plot. Here, a QTL peak is found for the second marker. Triangles in the QTL plot indicate the position of the gene, whose expression was used. If the gene coincides with the QTL peak, the QTL is referred to as a cis-QTL, otherwise, it is called a trans-QTL.

    Adapted from Alberts et al. (2005) by permission of Oxford University Press.

    c02f008

    There exist several methods for the identification of quantitative trait loci. The most straightforward method is called single-marker analysis. Here, a genomewide scan is performed and at each molecular marker a regression test determines whether a QTL is present. In a second method, called interval mapping, the QTL likelihood is determined at locations in between markers. At fixed genomic intervals and by making use of the information for the surrounding markers, this method is able to calculate QTL scores at the markers themselves and at places in between. Based on the idea that multiple QTL can regulate a quantitative trait, Jansen (1993) and Zeng (1993) proposed the method of composite interval mapping (multiple QTL mapping). Here, the existence of multiple QTLs regulating the expression of one trait is modeled. This allows identifying epistatic QTL—that is, multiple QTL regions that regulate the trait by interacting with each other. Also, it allows for the identification of multiple linked QTL.

    The GeneNetwork (www.genenetwork.org) has been established as a rich resource for systems genetics. It contains a large collection of genotypes, phenotypes, and gene expression profiles for multiple organisms and genetic reference populations. It offers good tools for QTL and correlation analysis and the identification of QTL genes and gene networks.

    The identification of a trans-QTL means that the location probably regulates the expression of another gene, the target gene. Furthermore, several genes may map to the same trans-QTL, which indicates that all these genes appear to have a common regulator. By following links between genes, revealed by trans-QTLs, one can build up gene regulatory networks. These regulatory networks have the potential to explain the complex interplay of genes and their products affecting complex traits and diseases.

    Ferrara et al. (2008) demonstrated how eQTL can be used to reconstruct networks. They use an F2 intercross between a diabetes-resistant and a diabetes-susceptible mouse strain and identified expression QTLs (eQTLs) as well as metabolite QTLs (mQTLs). mQTLs were determined by taking metabolite abundances as quantitative traits. For one metabolite, glutamate, they identified an mQTL interval that also contains eQTLs and transcripts with eQTLs elsewhere. Using this information, they reconstructed a regulatory network, demonstrating the validity of the network by showing that the genes respond to changes in glutamate.

    Crawford et al. (2008) described how eQTLs can be used to derive a transcriptional network that predicts breast cancer survival. In previous work, it was shown that extracellular matrix (ECM) gene dysregulation predicts both mouse mammary tumorigenesis and human breast cancer. They identified three reproducible eQTLs that regulate ECM gene expression. By correlation analyses and known association with metastasis, they identified seven candidate genes. Six out of the seven candidates appeared to suppress metastasis.

    2.4.2 Identification of Differentially Expressed Genes

    Microarrays are most often used for the identification of differentially expressed genes. In disease gene discovery, healthy samples and disease samples are compared and genes that are differentially expressed are identified. Thuong et al. (2008) for example, compared gene expression profiles of macrophages from individuals with different clinical manifestations of Mycobacterium tuberculosis infection. For three clinical phenotypes—latent, pulmonary, and meningeal tuberculosis—they identified lists of differentially expressed genes. Comparing the three phenotypes, they identified 261 genes having a greater than fivefold change in expression between any of the three conditions. Pennings et al. (2008) compared multiple microarray studies on acute lung inflammation models. The models included air pollutants; bacterial, viral, and parasitic infections; and allergic asthma models. They identified a cluster of 383 genes with an expression response that was common to all pulmonary diseases.

    2.4.3 Identification of Cell-Type-Specific Genes

    Another application of microarrays is to identify genes that are expressed in specific cell types. Sugimoto et al. (2006), for example, compared gene expression profiles in CD25+CD4+ regulatory T cells and CD25−CD4+ naive T cells. They found multiple genes that were expressed in a pattern that is specific for regulatory T cells. These genes are thought to be involved in differentiation and homeostatis of regulatory T cells.

    2.4.4 Determination of the Downstream Effects of a Mutation

    Von Bernuth et al. (2008) used microarrays to determine the downstream effects of a MyD88 mutation in human. Nine patients with MyD88 deficiency suffered from life-threatening, often recurrent pyogenic bacterial infections. The authors identified the functional pathways in healthy fibroblasts that were regulated after treatment with interlukin 1β (IL-1β), tumor necrosis factor α (TNF), or Poly(IC) and compared them to the expression levels obtained from cells derived from patients. They identified a complete, specific lack of response to IL-1β as a defining characteristic of MyD88 deficiency.

    2.4.5 Determination of the Downstream Effects of a Signaling Molecule

    Type 1 interferon (IFN) contributes significantly to innate immune responses. Malakhova et al. (2006) reported that UBP43 is highly expressed in macrophages and inhibits type 1 IFN signaling. To understand the effect of UBP43 and type 1 IFN signaling, Zou et al. (2007) analyzed the genomewide gene expression profiles of IFN-β-stimulated genes in wild type and UBP43−/− bone marrow–derived macrophages (BMMs). They identified 749 genes that were uniquely upregulated in UBP43−/− BMMs, including a large number of previously unidentified IFN-stimulated genes.

    2.4.6 Predicting Vaccine Efficacy

    Another application of microarrays is the identification of gene signatures that have a predictive value for a biological response. For example, Querec et al. (2009) used this approach to predict vaccine efficacy. They vaccinated humans with the yellow fever vaccine YF-17D and performed microarray experiments on 0, 1, 3, 7, and 21 days after vaccination in two independent trials. Using the DAMIP classification model (Lee 2007, Brooks and Lee 2008), they identified innate immune signatures that could predict subsequent adaptive immune responses. One signature predicted YF-17D CD8+ T cell responses with up to 90% accuracy and another signature predicted the neutralizing antibody response with up to 100% accuracy.

    2.4.7 Determination of Host Responses after Infection

    Microarrays have been successfully used in the characterization of host responses after infection. For example, Kash et al. (2006) infected mice with a contemporary human influenza A/Texas/36/91 H1N1 virus (Tx91) and a reconstructed 1918 (H1N1) recombinant virus (r1918) that caused about 50 million deaths worldwide. They found that mice infected with the r1918 virus revealed a much stronger inflammatory response. As another example, Ding et al. (2008) found differences in mouse strains after infection with influenza A.

    2.4.8 Limitations

    A limitation of using microarrays is that they measure mRNA abundances and not protein levels. Posttranscriptional modifications or mRNA degradation might cause actual protein levels to be different from gene expression levels measured with microarrays. In these situations, the transcriptional profiles obtained by microarrays do not fully correspond to the proteome within the cell.

    2.5 QUESTIONS AND ANSWERS

    Q1. Why are multiple microarrays in an experiment normalized?

    Q2. What is an expression QTL (eQTL)? And how can it be used to discover gene-interaction networks?

    A1. Multiple microarrays are normalized to remove the nonbiological variation, such as technical variation, to maintain pure biological variation.

    A2. An eQTL is an expression quantitative trait locus—a genomic region that very likely contains one or multiple genes regulating the expression of another gene. Trans-eQTLs represent genomic regions that influence the expression of another gene located distantly. The trans-QTL region will very likely contain genes that directly influence the expression of the target gene(s). By relating multiple trans-QTLs with multiple target genes, one may obtain valuable hypotheses for gene–gene regulatory interactions.

    2.6 ACKNOWLEDGMENTS

    We would like to thank Dr. Robert Geffers for fruitful discussions.

    This work was supported by intramural grants from the Helmholtz-Association (Program Infection and Immunity) and a research grant for the virtual institute GeNeSys (German Network for Systems Genetics, No VH-VI-242) from the Helmholtz Association.

    2.7 REFERENCES

    Alberts R, Fu J, Swertz MA, Lubbers LA, Albers CJ, Jansen RC. (2005). Combining microarrays and genetic analysis. Briefings Bioinformatics 6(2):135–45.

    Blencowe BJ. (2006). Alternative splicing: new insights from global analyses. Cell 126:37–47.

    Bolstad BM, Irizarry RA, Astrand M, Speed TP. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–93.

    Brooks JP, Lee EK. (2008). Analysis of the consistency of a mixed integer programming based multi-category constrained discriminant model. Ann Oper Res 164:1–20.

    Crawford NP, Walker RC, Lukes L, Officewala JS, Williams RW, Hunter KW. (2008). The Diasporin pathway: a tumor progression-related transcriptional network that predicts breast cancer survival. Clin Exp Metastasis 25(4):357–69.

    Ding M, Lu L, Toth LA. (2008). Gene expression in lung and basal forebrain during influenza infection in mice. Genes Brain Behav 7(2):173–83.

    Eisen M. (1999). ScanAlyze User Manual. Available at http://rana.lbl.gov/manuals/ScanAlyzeDoc.pdf.

    Ferrara CT, Wang P, Neto EC, Stevens RD, Bain JR, Wenner BR, Ilkayeva OR, Keller MP, Blasiole DA, Kendziorski C, Yandell BS, Newgard CB, Attie AD. (2008). Genetic networks of liver metabolism revealed by integration of metabolic and transcriptional profiling. PLoS Genet 4(3):e1000032.

    He L, Thomson JM, Hemann MT, Hernando-Monge E, Mu D, Goodson S, Powers S, Cordon-Cardo C, Lowe SW, Hannon GJ, Hammond SM. (2005). A microRNA polycistron as a

    Enjoying the preview?
    Page 1 of 1