Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Statistics for Bioinformatics: Methods for Multiple Sequence Alignment
Statistics for Bioinformatics: Methods for Multiple Sequence Alignment
Statistics for Bioinformatics: Methods for Multiple Sequence Alignment
Ebook193 pages2 hours

Statistics for Bioinformatics: Methods for Multiple Sequence Alignment

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Statistics for Bioinformatics: Methods for Multiple Sequence Alignment provides an in-depth introduction to the most widely used methods and software in the bioinformatics field. With the ever increasing flood of sequence information from genome sequencing projects, multiple sequence alignment has become one of the cornerstones of bioinformatics. Multiple sequence alignments are crucial for genome annotation, as well as the subsequent structural, functional, and evolutionary studies of genes and gene products. Consequently, there has been renewed interest in the development of novel multiple sequence alignment algorithms and more efficient programs.

  • Explains the dynamics that animate health systems
  • Explores tracks to build sustainable and equal architecture of health systems
  • Examines the advantages and disadvantages of the different approaches to care integration and the management of health information
LanguageEnglish
Release dateNov 24, 2016
ISBN9780081019610
Statistics for Bioinformatics: Methods for Multiple Sequence Alignment
Author

Julie Thompson

Julie Dawn Thompson is a Senior Scientist at the French National Center for Scientific Research with expertise in theoretical bioinformatics, data mining, knowledge engineering, integrative bioinformatics and genomics, (LBGI) Stochastic Optimization and Nature inspired Computing (SONIC)

Related to Statistics for Bioinformatics

Related ebooks

Computers For You

View More

Related articles

Reviews for Statistics for Bioinformatics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Statistics for Bioinformatics - Julie Thompson

    Statistics for Bioinformatics

    Methods for multiple sequence alignment

    Julie Dawn Thompson

    Statistics for Bioinformatics Set

    coordinated by

    Guy Perrière

    Table of Contents

    Cover image

    Title page

    Copyright

    Preface

    Part 1: Fundamental Concepts

    1: Introduction

    Abstract

    1.1 Biological sequences: DNA/RNA/proteins

    1.2 From DNA to RNA and proteins

    1.3 RNA sequence, structure and function

    1.4 Protein sequence, structure and function

    1.5 Sequence evolution

    1.6 MSA: basic concepts

    1.7 Multiple sequence alignment applications

    Part 2: Traditional Multiple Sequence Alignment Methods

    Introduction

    2: Heuristic Sequence Alignment Methods

    Abstract

    2.1 Optimal sequence alignment

    2.3 Iterative alignment

    2.4 Consistency-based alignment

    2.5 Cooperative alignment strategies

    3: Statistical Alignment Approaches

    Abstract

    3.1 Probabilistic models of sequence evolution

    3.2 Profile HMM-based alignment

    3.3 Simulated annealing

    3.4 Genetic algorithms

    4: Multiple Alignment Quality Control

    Abstract

    4.1 Objective scoring functions

    4.2 Determination of reliable regions

    4.3 Estimation of homology

    5: Benchmarking

    Abstract

    5.1 Criteria for benchmark construction

    5.2 Multiple alignment benchmarks

    5.3 Comparison of multiple alignment benchmarks

    Part 3: Large-scale Multiple Sequence Alignment Methods

    Introduction

    6: Whole Genome Alignment

    Abstract

    6.1 Pairwise genome alignment

    6.2 Progressive methods for multiple genome alignment

    6.3 Graph-based methods for multiple genome alignment

    6.4 Meta-aligners for multiple genome alignment

    6.5 Accuracy measures for genome alignment methods

    6.6 Benchmarking genome alignment

    7: Multiple Alignment of Thousands of Sequences

    Abstract

    7.1 Extension of the progressive alignment approach

    7.2 Meta-aligners for large numbers of sequences

    7.3 Extending seed alignments

    7.4 Benchmarking large numbers of sequences

    8: Future Perspectives: High-Performance Computing

    Abstract

    8.1 Coarse-grain parallelism: grid computing

    8.2 Fine-grain parallelism: GPGPU

    8.3 MSA in the cloud

    Bibliography

    Index

    Copyright

    First published 2016 in Great Britain and the United States by ISTE Press Ltd and Elsevier Ltd

    Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

    ISTE Press Ltd

    27-37 St George’s Road

    London SW19 4EU

    UK

    www.iste.co.uk

    Elsevier Ltd

    The Boulevard, Langford Lane

    Kidlington, Oxford, OX5 1GB

    UK

    www.elsevier.com

    Notices

    Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

    Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

    To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

    For information on all our publications visit our website at http://store.elsevier.com/

    © ISTE Press Ltd 2016

    The rights of Julie Dawn Thompson to be identified as the author of this work have been asserted by her in accordance with the Copyright, Designs and Patents Act 1988.

    British Library Cataloguing-in-Publication Data

    A CIP record for this book is available from the British Library

    Library of Congress Cataloging in Publication Data

    A catalog record for this book is available from the Library of Congress

    ISBN 978-1-78548-216-8

    Printed and bound in the UK and US

    Preface

    Julie Dawn Thompson

    In the past 10 years, biology has been transformed by the development of new genome sequencing technologies known as next-generation sequencing (NGS). This has led to a rapid reduction in the cost of generating genomic data and has made DNA sequencing, RNA-seq and high-throughput screening an increasingly important part of biological and biomedical research. However, the completion of the genome sequences is just a first step toward deciphering the meaning of the genetic instruction book. The bottleneck is that genome analysis has now shifted to finding efficient and effective ways to analyze the new data in order to leverage their ability to generate insights into the function of biological systems. Whole-genome sequencing is commonly associated with sequencing human genomes, where the genetic data represent a treasure trove for discovering how genes contribute to our health and well-being. However, the scalable, flexible nature of NGS technology makes it equally useful for sequencing any species, such as agriculturally important livestock, plants or disease-related microbes.

    The major challenge today is to understand how the genetic information encoded in the genome sequence is translated into the complex processes involved in the organism and the effects of environmental factors on these processes. Bioinformatics plays a crucial role in the systematic interpretation of genome sequence information in association with data from other high-throughput experimental techniques, such as structural genomics, proteomics or transcriptomics. One of the cornerstones of bioinformatics, since its beginnings in the 1980s, has been the comparative analysis of sequences from different organisms known as multiple sequence comparison or multiple sequence alignment (MSA). A variety of computational algorithms have been applied to the sequence alignment problem in diverse domains, most notably in natural language processing. Nevertheless, the alignment of biological sequences involves more than abstract string parsing, since the string of bases or amino acids is a result of complex molecular and evolutionary processes. This book aims to describe the methods that are designed to capture some of this complexity by modeling macromolecular sequences and taking into account their three-dimensional (3D) structures, their cellular functions and their evolution.

    The comparison of biological sequences is used to reveal the regions that are conserved in all members of a family of genetic material (genome, gene, RNA, protein, promoter, etc.). This allows identification of regions that have been selected in different organizations during evolution and which are therefore potentially essential for the function at the molecular, cellular or organism levels. As a result, the comparison of nucleic acid or protein sequences has had a major impact on our understanding of the relationships between sequence, structure, function and evolution [LEC 01]. Multiple sequence comparisons or alignments were originally used in evolutionary analyses to explore the phylogenetic relationships between organisms [MOR 06]. Later, new sequence database search methods exploited multiple alignments to detect more and more distant homologues [ALT 97]. MSAs of nucleic acid or protein sequences are also used to highlight conserved functional features and to identify major evolutionary events, such as duplications, recombinations or mutations. They have led to a significant improvement in predictions of both 3D fold [MOU 05] and function [WAT 05]. Of course, in the current era of complete genome sequences, it is now possible to perform comparative multiple sequence analysis at the genome level [DEW 06].

    Such studies have important implications in numerous fields in biology. Nucleic acid divergence is used as a molecular clock to study organism divergence under the evolutionary forces of natural selection, genetic drift, mutation and migration [FEL 04], with applications from the scientific classification or taxonomy of species to genetic fingerprinting. Conserved sequence features or markers are used to characterize groups of individuals in population genetics [SCH 15]. Genotype/phenotype correlations can reveal candidate genes associated with a particular trait (e.g. plant height) or inherited disease, such as schizophrenia or asthma [MOR 12]. In drug discovery, a protein family perspective can identify specific structural or functional features that facilitate protein–ligand interaction studies for high-throughput virtual compound screening methods [LEN 00]. Thus, multiple alignments now play a fundamental role in most of the computational methods used in genomic or proteomic projects for gene identification and the functional characterization of the gene products.

    The first part of this book will introduce the fundamental concepts required to understand the development of MSA methods, including a description of the main characteristics of biological sequences and a more complete definition of what a multiple sequence alignment is and why it is so important. The second part of the book will then describe the traditional methods that are most widely used for the construction and analysis of MSAs. The literature is vast, and hence our presentation of these topics is necessarily selective. We will address the problems of alignment construction and survey the range of practical techniques for computing MSAs, with a focus on practical methods that have demonstrated good performance on real-world benchmarks. The third part of the book will then introduce the new bioinformatics approaches that are being developed in order to manage and extract pertinent information from the mass of data generated by the new high-throughput genome sequencing technologies.

    September 2016

    Part 1

    Fundamental Concepts

    1

    Introduction

    Abstract

    Some basic concepts in biology are necessary for understanding almost any part of this book, so this chapter represents a brief primer on the key ideas and concepts. For many readers, this will be familiar territory and in this case, they may want to skip this section and go directly.

    keywords

    Alignment; DNA; Drug discovery; Gene prediction and validation; Genetics; Interaction networks; MSA; Proteins; RNA; Sequence evolution

    1.1 Biological sequences: DNA/RNA/proteins

    Some basic concepts in biology are necessary for understanding almost any part of this book, so this chapter represents a brief primer on the key ideas and concepts. For many readers, this will be familiar territory and in this case, they may want to skip this section and go directly to section 1.2.

    A genome is the genetic material of an organism. Each genome contains the entire set of hereditary instructions needed to build that organism and allow it to grow and develop. The instructions in the genome are encoded in very long DNA molecules, organized into pairs of chromosomes. The chromosomes are made up of chains of four nucleotide bases, adenine (A), guanine (G), thymine (T) and cytosine (C). The human genome, for example, contains 23 pairs of chromosomes and has more than 3 billion base pairs. The chromosomes can be further broken down into smaller pieces of code called genes, including over 20,000 protein-coding genes and many thousands of non-coding RNA (ncRNA) genes.

    RNA is another molecule consisting of chains of four nucleotide bases, in this case adenine (A), cytosine (C), guanine (G) or uracil (U). RNA plays a key role in all steps of gene expression as an intermediate carrier

    Enjoying the preview?
    Page 1 of 1