Ebook489 pages4 hours

Patterns for Fault Tolerant Software

Name: Patterns for Fault Tolerant Software
Brand: Wiley
Rating: 3.5 (1 reviews)

By Robert S. Hanmer

Rating: 3.5 out of 5 stars

3.5/5

()

Read preview

About this ebook

Software patterns have revolutionized the way developer’s and architects think about how software is designed, built and documented.

This new title in Wiley’s prestigious Series in Software Design Patterns presents proven techniques to achieve patterns for fault tolerant software. This is a key reference for experts seeking to select a technique appropriate for a given system.

Readers are guided from concepts and terminology, through common principles and methods, to advanced techniques and practices in the development of software systems. References will provide access points to the key literature, including descriptions of exemplar applications of each technique.

Organized into a collection of software techniques, specific techniques can be easily found with sufficient detail to allow appropriate choices for the system being designed.

Skip carousel

Programming

LanguageEnglish

PublisherWiley

Release dateJul 12, 2013

ISBN9781118351543

Author

Robert S. Hanmer

Related to Patterns for Fault Tolerant Software

Titles in the series (13)

Skip carousel

Pattern-Oriented Software Architecture, A Pattern Language for Distributed Computing
Ebook
Pattern-Oriented Software Architecture, A Pattern Language for Distributed Computing
byFrank Buschmann
Rating: 3 out of 5 stars
3/5
Pattern-Oriented Software Architecture, Patterns for Resource Management
Ebook
Pattern-Oriented Software Architecture, Patterns for Resource Management
byMichael Kircher
Rating: 3 out of 5 stars
3/5
Pattern-Oriented Software Architecture, A System of Patterns
Ebook
Pattern-Oriented Software Architecture, A System of Patterns
byFrank Buschmann
Rating: 0 out of 5 stars
0 ratings
Pattern-Oriented Software Architecture, On Patterns and Pattern Languages
Ebook
Pattern-Oriented Software Architecture, On Patterns and Pattern Languages
byFrank Buschmann
Rating: 5 out of 5 stars
5/5
Patterns for Parallel Software Design
Ebook
Patterns for Parallel Software Design
byJorge Luis Ortega-Arjona
Rating: 0 out of 5 stars
0 ratings
Architecting Enterprise Solutions: Patterns for High-Capability Internet-based Systems
Ebook
Architecting Enterprise Solutions: Patterns for High-Capability Internet-based Systems
byPaul Dyson
Rating: 2 out of 5 stars
2/5
Model-Driven Software Development: Technology, Engineering, Management
Ebook
Model-Driven Software Development: Technology, Engineering, Management
byMarkus Völter
Rating: 4 out of 5 stars
4/5
Patterns for Computer-Mediated Interaction
Ebook
Patterns for Computer-Mediated Interaction
byTill Schummer
Rating: 5 out of 5 stars
5/5
Security Patterns: Integrating Security and Systems Engineering
Ebook
Security Patterns: Integrating Security and Systems Engineering
byMarkus Schumacher
Rating: 0 out of 5 stars
0 ratings
Remoting Patterns: Foundations of Enterprise, Internet and Realtime Distributed Object Middleware
Ebook
Remoting Patterns: Foundations of Enterprise, Internet and Realtime Distributed Object Middleware
byMarkus Völter
Rating: 4 out of 5 stars
4/5
Security Patterns in Practice: Designing Secure Architectures Using Software Patterns
Ebook
Security Patterns in Practice: Designing Secure Architectures Using Software Patterns
byEduardo Fernandez-Buglioni
Rating: 3 out of 5 stars
3/5
Patterns for Fault Tolerant Software
Ebook
Patterns for Fault Tolerant Software
byRobert S. Hanmer
Rating: 4 out of 5 stars
4/5
Server Component Patterns: Component Infrastructures Illustrated with EJB
Ebook
Server Component Patterns: Component Infrastructures Illustrated with EJB
byMarkus Völter
Rating: 0 out of 5 stars
0 ratings

Related ebooks

Skip carousel

Pattern-Oriented Software Architecture, Patterns for Resource Management
Ebook
Pattern-Oriented Software Architecture, Patterns for Resource Management
byMichael Kircher
Rating: 3 out of 5 stars
3/5
Design Methods for Reactive Systems: Yourdon, Statemate, and the UML
Ebook
Design Methods for Reactive Systems: Yourdon, Statemate, and the UML
byR. J. Wieringa
Rating: 3 out of 5 stars
3/5
Hello World: Student to Software Professional - a Transformation Guide
Ebook
Hello World: Student to Software Professional - a Transformation Guide
byAshish Vaidya
Rating: 0 out of 5 stars
0 ratings
Design Patterns in C#: A Hands-on Guide with Real-world Examples
Ebook
Design Patterns in C#: A Hands-on Guide with Real-world Examples
byVaskaran Sarcar
Rating: 0 out of 5 stars
0 ratings
Relating System Quality and Software Architecture
Ebook
Relating System Quality and Software Architecture
byIvan Mistrik
Rating: 0 out of 5 stars
0 ratings
Security Patterns in Practice: Designing Secure Architectures Using Software Patterns
Ebook
Security Patterns in Practice: Designing Secure Architectures Using Software Patterns
byEduardo Fernandez-Buglioni
Rating: 3 out of 5 stars
3/5
Pattern-Oriented Software Architecture, A System of Patterns
Ebook
Pattern-Oriented Software Architecture, A System of Patterns
byFrank Buschmann
Rating: 0 out of 5 stars
0 ratings
SOA Patterns
Ebook
SOA Patterns
byArnon Rotem-Gal-Oz
Rating: 0 out of 5 stars
0 ratings
Regex Quick Syntax Reference: Understanding and Using Regular Expressions
Ebook
Regex Quick Syntax Reference: Understanding and Using Regular Expressions
byZsolt Nagy
Rating: 0 out of 5 stars
0 ratings
Pattern-Oriented Software Architecture, On Patterns and Pattern Languages
Ebook
Pattern-Oriented Software Architecture, On Patterns and Pattern Languages
byFrank Buschmann
Rating: 5 out of 5 stars
5/5
Security Patterns: Integrating Security and Systems Engineering
Ebook
Security Patterns: Integrating Security and Systems Engineering
byMarkus Schumacher
Rating: 0 out of 5 stars
0 ratings
Exploring the Python Library Ecosystem: A Comprehensive Guide
Ebook
Exploring the Python Library Ecosystem: A Comprehensive Guide
byKameron Hussain
Rating: 0 out of 5 stars
0 ratings
Beginners Guide to TI-84 Plus CE Python Programming Calculator
Ebook
Beginners Guide to TI-84 Plus CE Python Programming Calculator
byObakoma G. Martins
Rating: 0 out of 5 stars
0 ratings
Go Programming Cookbook: Over 75+ recipes to program microservices, networking, database and APIs using Golang
Ebook
Go Programming Cookbook: Over 75+ recipes to program microservices, networking, database and APIs using Golang
byIan Taylor
Rating: 0 out of 5 stars
0 ratings
Practical Knowledge Engineering
Ebook
Practical Knowledge Engineering
byRichard Kelly
Rating: 0 out of 5 stars
0 ratings
Radioastronomical Methods of Antenna Measurements
Ebook
Radioastronomical Methods of Antenna Measurements
byA.D. Kuzmin
Rating: 0 out of 5 stars
0 ratings
Parallel Sorting Algorithms
Ebook
Parallel Sorting Algorithms
bySelim G. Akl
Rating: 5 out of 5 stars
5/5
Lighttpd
Ebook
Lighttpd
byAndre Bogus
Rating: 4 out of 5 stars
4/5
Learning NHibernate 4
Ebook
Learning NHibernate 4
bySuhas Chatekar
Rating: 0 out of 5 stars
0 ratings
Unit Tests A Complete Guide - 2021 Edition
Ebook
Unit Tests A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Advanced Python Development: Using Powerful Language Features in Real-World Applications
Ebook
Advanced Python Development: Using Powerful Language Features in Real-World Applications
byMatthew Wilkes
Rating: 0 out of 5 stars
0 ratings
Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools
Ebook
Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools
byJoseph A. Fisher
Rating: 0 out of 5 stars
0 ratings
Practical MATLAB Deep Learning: A Project-Based Approach
Ebook
Practical MATLAB Deep Learning: A Project-Based Approach
byMichael Paluszek
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Quantum Computers
Ebook
Machine Learning with Quantum Computers
byMaria Schuld
Rating: 0 out of 5 stars
0 ratings
Scalability A Complete Guide - 2021 Edition
Ebook
Scalability A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Distributed Computing with Python
Ebook
Distributed Computing with Python
byFrancesco Pierfederici
Rating: 0 out of 5 stars
0 ratings
Mastering Unix Shell Scripting: Bash, Bourne, and Korn Shell Scripting for Programmers, System Administrators, and UNIX Gurus
Ebook
Mastering Unix Shell Scripting: Bash, Bourne, and Korn Shell Scripting for Programmers, System Administrators, and UNIX Gurus
byRandal K. Michael
Rating: 4 out of 5 stars
4/5
A Primer on Statistical Distributions
Ebook
A Primer on Statistical Distributions
byN. Balakrishnan
Rating: 0 out of 5 stars
0 ratings
OpenCL Programming by Example
Ebook
OpenCL Programming by Example
byRavishekhar Banger
Rating: 0 out of 5 stars
0 ratings
Deep Learning on Microcontrollers: Learn how to develop embedded AI applications using TinyML (English Edition)
Ebook
Deep Learning on Microcontrollers: Learn how to develop embedded AI applications using TinyML (English Edition)
byAtul Krishna Gupta
Rating: 5 out of 5 stars
5/5

Programming For You

Skip carousel

PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Ebook
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
byMark Chan
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Ebook
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
byJason Scotts
Rating: 4 out of 5 stars
4/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
C++ Learn in 24 Hours
Ebook
C++ Learn in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
HTML & CSS: Learn the Fundaments in 7 Days
Ebook
HTML & CSS: Learn the Fundaments in 7 Days
byMichael Knapp
Rating: 4 out of 5 stars
4/5
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
Ebook
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
byTravis Plunk
Rating: 0 out of 5 stars
0 ratings
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
Ebook
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
byJames Tudor
Rating: 5 out of 5 stars
5/5
C# 7.0 All-in-One For Dummies
Ebook
C# 7.0 All-in-One For Dummies
byBill Sempf
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5
Beginning Programming with Python For Dummies
Ebook
Beginning Programming with Python For Dummies
byJohn Paul Mueller
Rating: 3 out of 5 stars
3/5
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
Ebook
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
byBrady Ellison
Rating: 5 out of 5 stars
5/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
Ebook
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
byHeath Haskins
Rating: 5 out of 5 stars
5/5
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
Ebook
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
byTimothy C. Needham
Rating: 4 out of 5 stars
4/5
Linux: Learn in 24 Hours
Ebook
Linux: Learn in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Game Development with Unreal Engine 5: Learn the Basics of Game Development in Unreal Engine 5 (English Edition)
Ebook
Game Development with Unreal Engine 5: Learn the Basics of Game Development in Unreal Engine 5 (English Edition)
byMitchell Lynn
Rating: 0 out of 5 stars
0 ratings
Python: Learn Python in 24 Hours
Ebook
Python: Learn Python in 24 Hours
byAlex Nordeen
Rating: 4 out of 5 stars
4/5
Data Structures and Algorithm Analysis in Java, Third Edition
Ebook
Data Structures and Algorithm Analysis in Java, Third Edition
byClifford A. Shaffer
Rating: 4 out of 5 stars
4/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 5 out of 5 stars
5/5
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
Ebook
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
byAnthony Adams
Rating: 4 out of 5 stars
4/5
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
Ebook
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
byi Code Academy
Rating: 5 out of 5 stars
5/5
LINUX Beginner's Crash Course: Linux for Beginner's Guide to Linux Command Line, Linux System & Linux Commands
Ebook
LINUX Beginner's Crash Course: Linux for Beginner's Guide to Linux Command Line, Linux System & Linux Commands
byQuick Start Guides
Rating: 4 out of 5 stars
4/5
SQL All-in-One For Dummies
Ebook
SQL All-in-One For Dummies
byAllen G. Taylor
Rating: 3 out of 5 stars
3/5

Related podcast episodes

Skip carousel

Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Deploying Edge and Embedded AI Systems with Heather Gorr - #655
Podcast episode
Deploying Edge and Embedded AI Systems with Heather Gorr - #655
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
What is beyond PoCs? ML project-hurdles you should be prepared to take with Balázs Kégl - 016: Why do we do PoCs all the time and why do we struggle with Real projects? We are going to talk about ML project-hurdles with the head of AI at Huawei Paris, Balazs Kegl.
Podcast episode
What is beyond PoCs? ML project-hurdles you should be prepared to take with Balázs Kégl - 016: Why do we do PoCs all the time and why do we struggle with Real projects? We are going to talk about ML project-hurdles with the head of AI at Huawei Paris, Balazs Kegl.
byMachine Learning Cafe
0 ratings
0% found this document useful
Hasty Treat - Why should I use React Hooks?: In this Hasty Treat, Scott and Wes talk about React Hooks and why you might want to use them instead of class components. Sentry - Sponsor If you want to know what’s happening with your errors, track them with . Sentry is open-source error...
Podcast episode
Hasty Treat - Why should I use React Hooks?: In this Hasty Treat, Scott and Wes talk about React Hooks and why you might want to use them instead of class components. Sentry - Sponsor If you want to know what’s happening with your errors, track them with . Sentry is open-source error...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14: CRDTs, Conflict Resolution, and Distributed Consensus in Real World Systems (Interview)
Podcast episode
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14: CRDTs, Conflict Resolution, and Distributed Consensus in Real World Systems (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
Podcast episode
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
byFinding Genius Podcast
0 ratings
0% found this document useful
#51 Francois Chollet - Intelligence and Generalisation
Podcast episode
#51 Francois Chollet - Intelligence and Generalisation
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
MLA 020 Kubeflow: Conversation with Dirk-Jan Kubeflow (vs cloud native solutions like SageMaker) - Data Scientist at Dept Agency . (From the website:) The Machine Learning Toolkit for Kubernetes. The Kubeflow project is dedicated to making deployments of...
Podcast episode
MLA 020 Kubeflow: Conversation with Dirk-Jan Kubeflow (vs cloud native solutions like SageMaker) - Data Scientist at Dept Agency . (From the website:) The Machine Learning Toolkit for Kubernetes. The Kubeflow project is dedicated to making deployments of...
byMachine Learning Guide
0 ratings
0% found this document useful
Object Oriented Programming, with Alena Holligan: Most modern web applications and frameworks are built on the principles of Object-Oriented Programming (OOP),. Laravel itself is object-oriented. But many of us learn about OOP bit by bit as we're coding, not in any formal way. In this episode we talk to Alena Holligan, veteran PHP programmer, OOP implementer, and educator about what exactly OOP is and how a deeper knowledge of OOP can help us.
Podcast episode
Object Oriented Programming, with Alena Holligan: Most modern web applications and frameworks are built on the principles of Object-Oriented Programming (OOP),. Laravel itself is object-oriented. But many of us learn about OOP bit by bit as we're coding, not in any formal way. In this episode we talk to Alena Holligan, veteran PHP programmer, OOP implementer, and educator about what exactly OOP is and how a deeper knowledge of OOP can help us.
byThe Laravel Podcast
0 ratings
0% found this document useful
LLMs, Retrieval Augmented Generation, Knowledge Graph, Vector Databases with Mike Dillinger: <p>RAG, Retrieval Augemented Generation, is the term you now constantly hear in conjunction with LLM that provides context. But how does it actually work? And what's the relationship with Vector Databases and Knowledge Graphs? This will be a geeky AI e...
Podcast episode
LLMs, Retrieval Augmented Generation, Knowledge Graph, Vector Databases with Mike Dillinger: <p>RAG, Retrieval Augemented Generation, is the term you now constantly hear in conjunction with LLM that provides context. But how does it actually work? And what's the relationship with Vector Databases and Knowledge Graphs? This will be a geeky AI e...
byCatalog & Cocktails: The Honest, No-BS Data Podcast
0 ratings
0% found this document useful
Zeroing in on what makes adversarial examples possible: Adversarial examples are really, really weird: pi…
Podcast episode
Zeroing in on what makes adversarial examples possible: Adversarial examples are really, really weird: pi…
byLinear Digressions
0 ratings
0% found this document useful
React Hooks - 1 Year Later: In this episode of Syntax, Scott and Wes talk about React Hooks, one year later — what’s changed, how to use them, and more! Sanity - Sponsor is a real-time headless CMS with a fully customizable Content Studio built in React. Get a Sanity...
Podcast episode
React Hooks - 1 Year Later: In this episode of Syntax, Scott and Wes talk about React Hooks, one year later — what’s changed, how to use them, and more! Sanity - Sponsor is a real-time headless CMS with a fully customizable Content Studio built in React. Get a Sanity...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Generative models: exploration to deployment: get Fully-Connected with Chris & Daniel
Podcast episode
Generative models: exploration to deployment: get Fully-Connected with Chris & Daniel
byPractical AI: Machine Learning, Data Science
100%
100% found this document useful
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
Podcast episode
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
byTest and Code
0 ratings
0% found this document useful
Graph Analytic Systems with Zachary Hanif - TWiML Talk #188: In this, the final episode of our Strata Data Conference series, we’re joined by Zachary Hanif, Director of Machine Learning at Capital One’s Center for Machine Learning. Zach led a session at Strata called “Network effects: Working with modern...
Podcast episode
Graph Analytic Systems with Zachary Hanif - TWiML Talk #188: In this, the final episode of our Strata Data Conference series, we’re joined by Zachary Hanif, Director of Machine Learning at Capital One’s Center for Machine Learning. Zach led a session at Strata called “Network effects: Working with modern...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
170: pytest for Data Science and Machine Learning - Prayson Daniel
Podcast episode
170: pytest for Data Science and Machine Learning - Prayson Daniel
byTest and Code
0 ratings
0% found this document useful
Natural Language Processing and How ML Models Understand Text
Podcast episode
Natural Language Processing and How ML Models Understand Text
byThe Real Python Podcast
0 ratings
0% found this document useful
Hyperparameter Optimization through Neural Network Partitioning with Christos Louizos - #627
Podcast episode
Hyperparameter Optimization through Neural Network Partitioning with Christos Louizos - #627
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Recurrent Neural Nets: This week, we're doing a crash course in recurren…
Podcast episode
Recurrent Neural Nets: This week, we're doing a crash course in recurren…
byLinear Digressions
0 ratings
0% found this document useful
Being Bayesian: This episode explores the root concept of what it is to be Bayesian: describing knowledge of a system probabilistically, having an appropriate prior probability, know how to weigh new evidence, and following Bayes's rule to compute the revised...
Podcast episode
Being Bayesian: This episode explores the root concept of what it is to be Bayesian: describing knowledge of a system probabilistically, having an appropriate prior probability, know how to weigh new evidence, and following Bayes's rule to compute the revised...
byData Skeptic
0 ratings
0% found this document useful
Python, Django, and Channels: with Andrew Godwin, creator of Django Channels
Podcast episode
Python, Django, and Channels: with Andrew Godwin, creator of Django Channels
byThe Changelog: Software Development, Open Source
0 ratings
0% found this document useful
Heavy Networking 707: Getting Real With Selector’s AIOps (Sponsored): AI and machine learning are finally being applied to networking in meaningful ways. On today's sponsored show we talk with Selector about its AIOps platform, which ingests networking logs, flows, configurations, SNMP,
Podcast episode
Heavy Networking 707: Getting Real With Selector’s AIOps (Sponsored): AI and machine learning are finally being applied to networking in meaningful ways. On today's sponsored show we talk with Selector about its AIOps platform, which ingests networking logs, flows, configurations, SNMP,
byHeavy Networking
0 ratings
0% found this document useful
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
Podcast episode
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
byMaintainable
0 ratings
0% found this document useful
41. Bob Nystrom
Podcast episode
41. Bob Nystrom
byIt's All Widgets! Flutter Podcast
0 ratings
0% found this document useful
Jacob Aronoff - At Least One Person Who Cares To See It Through: Robby has a chat with Staff Software Engineer at Lightstep from ServiceNow, Jacob Aronoff, about the vital signs of a thriving open source software project, the importance of a passionate community behind such projects, why understanding an open source project's own dependencies is crucial before adopting it, the nuances of evaluating a project's health through performance metrics, the organizational dynamics of the OpenTelemetry community, and so much more.
Podcast episode
Jacob Aronoff - At Least One Person Who Cares To See It Through: Robby has a chat with Staff Software Engineer at Lightstep from ServiceNow, Jacob Aronoff, about the vital signs of a thriving open source software project, the importance of a passionate community behind such projects, why understanding an open source project's own dependencies is crucial before adopting it, the nuances of evaluating a project's health through performance metrics, the organizational dynamics of the OpenTelemetry community, and so much more.
byMaintainable
0 ratings
0% found this document useful
#111 The Rise of the Julia Programming Language
Podcast episode
#111 The Rise of the Julia Programming Language
byDataFramed
0 ratings
0% found this document useful
Watch Investing, Rolex Bubble & Collecting Luxury Watches with Dan York
Podcast episode
Watch Investing, Rolex Bubble & Collecting Luxury Watches with Dan York
byCasual Watch Talk
0 ratings
0% found this document useful

Skip carousel

Fast And Easy Image Processing
Linux Format
Article
Fast And Easy Image Processing
Apr 4, 2023
Credit: https://github.com/oguzhaninan/korkut Shashank Sharma is a trial lawyer in New Delhi and an avid Arch user. He’s been writing about open source software for 20 years and lawyering for 10. You wouldn’t think of the command line as the go-to re
4 min read
How To File Your Taxes For Free Online With Deadline Day Nearing
Los Angeles Times
Article
How To File Your Taxes For Free Online With Deadline Day Nearing
May 11, 2021
Taxes may be one of life's certainties. But paying to pay them isn't. In many other countries, the government does the math for you and tells you how much to pay, at no cost to you (beyond the taxes you are paying in the first place). In America, the
3 min read
Pick An OS, Any OS…
Linux Format
Article
Pick An OS, Any OS…
Apr 5, 2022
4 min read
Brave Browser 1.48.171
Linux Format
Article
Brave Browser 1.48.171
Apr 4, 2023
2 min read
Plastic Chips
Maximum PC
Article
Plastic Chips
Sep 14, 2021
1 min read
Build The Kernel
Linux Format
Article
Build The Kernel
Mar 8, 2022
1 min read
Visualise Complex Data In Style Using Timelion
Linux Format
Article
Visualise Complex Data In Style Using Timelion
Oct 20, 2020
Simon Quain is a site reliability engineer who likes discovering open datasets online to play around with in the Elastic Stack. You’ve probably heard of Elasticsearch – the search engine that enables you to index and then quickly search through your
9 min read
An Introduction To Rabbitmq
Linux Format
Article
An Introduction To Rabbitmq
Jun 29, 2021
RabbitMQ is a Message Broker, which means that it can safely hold messages generated by applications and make them available to other applications. The main advantages are reliability, support for clustering and high-availability queues, tracing capa
1 min read
RISC-V on Ubuntu
Linux Format
Article
RISC-V on Ubuntu
May 2, 2023
1 min read
Mucking About With AI
APC
Article
Mucking About With AI
May 22, 2023
2 min read
Tiny Core Linux 14.0
Linux Format
Article
Tiny Core Linux 14.0
May 30, 2023
2 min read
Using EBPF To Monitor Filesystems
Linux Format
Article
Using EBPF To Monitor Filesystems
Dec 13, 2022
10 min read
Set Up Your First Database
Linux Format
Article
Set Up Your First Database
Aug 25, 2020
1 min read
What Is The Future Of Game Streaming Now That Stadia Is Dead?
APC
Article
What Is The Future Of Game Streaming Now That Stadia Is Dead?
Oct 31, 2022
Once hyped as being ‘the future of gaming’, the Google Stadia game streaming service was officially, just three years after launch and before even making it to Australian shores. When game streaming first launched we did have some apprehension about
2 min read
Getting Started With The Powerful EBPF
Linux Format
Article
Getting Started With The Powerful EBPF
Sep 20, 2022
Credit: https://ebpf.io Don’t miss next issue! Subscribe on page 16 Mihalis Tsoukalos is a systems engineer and a technical writer. You can reach him at www. mtsoukalos.eu and @mactsouk. Get the code for this tutorial from the Linux Format archive:
10 min read
Autonomous Robots Prepped For Cave Search And Rescue Mission
AppleMagazine
Article
Autonomous Robots Prepped For Cave Search And Rescue Mission
Sep 24, 2021
4 min read
Step In To Your Time Machine
MacLife
Article
Step In To Your Time Machine
Sep 13, 2022
SAVED THE WRONG edit of a document? Time Machine! Spent hours on a project only to find someone else has saved an old version over it? Time Machine! Time Machine isn’t just a great way to retrace your steps. It’s an efficient backup app too; once set
3 min read
LISP - Exploring The Original AI Language
Linux Format
Article
LISP - Exploring The Original AI Language
May 30, 2023
11 min read
Charts And Diagrams
Linux Format
Article
Charts And Diagrams
Nov 15, 2022
1 min read
Host Your Own Music Streaming Server
Linux Format
Article
Host Your Own Music Streaming Server
Apr 4, 2023
Credit: https://mstream.io Nick Peers would like to introduce himself: “Hi, I’m Nick and I’m an optical disc addict.” At last count, he had nearly 1,100 albums. Don’t ask about the Blu-rays or DVDs. The likes of Spotify may give you access to an almo
5 min read
Mind Your Language!
Linux Format
Article
Mind Your Language!
Apr 4, 2023
9 min read
Endless OS 5.0.1
Linux Format
Article
Endless OS 5.0.1
Apr 4, 2023
2 min read
Intel Core i3-13100F
Linux Format
Article
Intel Core i3-13100F
Jun 27, 2023
2 min read
Best New Apps
TechLife
Article
Best New Apps
May 3, 2021
3 min read
iPhone 15 Pro LEVELING UP THE LINE WITH A NEW CAMERA SET, TITANIUM FRAME & MORE
TechLife News
Article
iPhone 15 Pro LEVELING UP THE LINE WITH A NEW CAMERA SET, TITANIUM FRAME & MORE
Sep 16, 2023
3 min read
Syncing
Linux Format
Article
Syncing
May 4, 2021
1 min read
CPU Architectures What’s The Difference And Why It Matters
PC Pro Magazine
Article
CPU Architectures What’s The Difference And Why It Matters
Feb 9, 2023
8 min read
Text Docs To Rich Docs
Linux Format
Article
Text Docs To Rich Docs
Dec 17, 2019
6 min read
When I’m C64
Retro Gamer
Article
When I’m C64
Sep 29, 2022
2 min read
Ubuntu Vs Fedora
Linux Format
Article
Ubuntu Vs Fedora
Nov 16, 2021
1 min read

Related categories

Skip carousel

Reviews for Patterns for Fault Tolerant Software

Rating: 3.5 out of 5 stars

3.5/5

1 rating0 reviews

Book preview

Patterns for Fault Tolerant Software - Robert S. Hanmer

Introduction

An Imperfect World

We live in an imperfect world. The things we make break when we least expect them to. This includes computer programs and the systems that we build from computers. Even the things that we think of as being the most reliable are occasionally unavailable because they’ve broken. This book is about how to make these systems of software (and hardware) work even though they might break occasionally.

Consider the United States’ manned space flight program: Apollo 13 had a dramatic failure that almost killed the three person crew as they were heading toward the moon. Think also of the failures that are detected during space shuttle assembly that delay the launch for a period of days. These space systems, which are highly complicated systems of hardware and software components, were designed to operate flawlessly, and yet failures happened.

Consider also the WYSIWYG document editing program that just won’t let you number the first page of a document, such as this book manuscript, page one. Page numbering is a feature that is expected by the program’s creators and users to work flawlessly.

Or consider systems such as telephone switching equipment or web-based e-commerce systems or automatic teller machines (ATMs). These are expected to work flawlessly and continuously. They are built of combinations of hardware and software components that work together to provide the desired service.

This book is about what to design into software to make these complicated systems of software (and hardware) tolerate an occasional error in the software so that they can provide service without failures being perceived.

CHAPTER 1

Introduction to Fault Tolerance

Like any subject of study, there is a specialized language associated with fault tolerance. This chapter introduces these terms.

The focus of this book is on ‘Fault Tolerance’ in general and in particular on things that can be done during the design of software to support fault tolerant operation. A system of software or hardware and software that is fault tolerant is able to operate even though some part is no longer performing correctly. Thus the focus of this book is on the software structures and mechanisms that can be designed into a system to enable its continued operation, even though a different part isn’t working correctly. This book describes practices to improve the reliability and availability of software systems. These practices are currently in use in a variety of software application domains.

The next few sections define the vocabulary needed to discuss fault tolerance.

Fault -> Error -> Failure

The terms fault, error and failure have very specific meanings.

A system failure occurs when the delivered service no longer complies with the specification, the latter being an agreed description of the system’s expected function and/or service. An error is that part of the system state that is liable to lead to subsequent failure; an error affecting the service is an indication that a failure occurs or has occurred. The adjudged or hypothesized cause of an error is a fault. [Lap91, p. 4]

Every fault tolerant system composed of software and hardware must have a specification that describes what it means for that system to operate without failure. The system’s specification defines its expected behavior, such as available 99.999% of the time. When the system doesn’t behave in the manner specified in its requirements, it has failed. The term failure refers to system behavior that does not conform to the systems specification.

These are examples of failures: The system crashes to a stop when it shouldn’t, the system computes an incorrect result, the system is not available for service, the system is unable to respond to user interaction. Whenever the system does the wrong thing it has failed.

Failures are detected by the observer and users of the system.

Failures are dependant upon the requirements and the definition of agreed-upon correct operation of the system. If there is not a specification of what the system should do, there cannot be a failure.

Failures are caused by errors.

An error is the incorrect system behavior from which a failure may occur. Errors can be categorized into two types, timing or value. Errors that manifest as value errors might be incorrect discrete values or incorrect system state. Timing errors can include total non-performance (the time was infinite).

Some common examples of errors include:

Timing or Race conditions: communicating processes get out of synchronization and a race for resources occurs.

Infinite Loops: continuous execution of a tight loop without pausing and without acknowledging the requests of others for shared resources.

Protocol Error: errors in the messaging stream because of non-conformance with the protocol in use. Unexpected messages sent to other parts of the system, messages sent at inappropriate times, or out of sequence.

Data inconsistency: Data may be different between two locations, for example memory and disk, or between different elements in a network.

Failure to Handle Overload conditions: the system is unable to handle the workload.

Wild Transfer or Wild Write: Data written to an incorrect location of memory or a transfer to an incorrect location occurs if there is a fault in the system.

Any of these example errors could be failures if they deviate from the system’s specification.

Errors are important when talking about fault tolerant systems because errors can be detected before they become failures. Errors are the manifestation of faults, and errors are the way that we can look into the system to discover if faults are present.

A fault is the defect that is present in the system that can cause an error. It is the actual deviation from correctness. In a computer program it is the misplaced comma or period, or the missing break statement in a C++ switch statement. Colloquially the fault is often called a ‘bug’, but that word will not appear elsewhere in this book.

The fault might be a latent software defect, or it might be a garbled message received on a communications channel, or a variety of other things. In general, neither the software nor the observers are aware of the presence of a fault until an error occurs.

A number of causes lead to the introduction of a fault into software. These include:

Incorrect Requirement Specification: Sometimes the software designers and coders were told to build the wrong thing.

Incorrect Designs: Translating system requirements into a working software design is a complicated process that sometimes results in incorrect designs. The design might not be workable from a pure software standpoint, or it might not be an accurate translation of the requirements. In either case it is faulty.

Coding Errors: Translating the design into working code can also introduce faults into the system. The compiler/interpreter/code examination tool can catch some faults or a fault can produce syntactically correct code that just does not perform the specified task.

Faults are present in every system. When a fault is lying dormant and not causing any mischief it is said to be latent. When the circumstances arise that the latent fault causes something incorrect to happen it is said to become active. A fault’s activation results in an error.

Examples of Fault -> Error -> Failure

To help make these very important definitions clear, here are a few examples.

A misrouted telephone call is an example of a failure. Telephone system requirements specify that calls should be delivered to the correct recipient. When a faulty system prevents them from being delivered correctly, the system has failed. In this case the fault might have been an incorrect call routing data being stored in the system. The error occurs when the incorrect data is accessed and an incorrect network path is computed with that incorrect data.

A robotic arm used to drill a part in a manufacturing environment provides another example. Consider the fault of a misplaced decimal point in a data constant that is used in the computation of the rotation of the robot’s arm. The data constant might be the number of steps required to rotate the robotic arm one degree. The error might be that it rotates in the wrong direction because of the erroneous computation made with the faulty decimal point. The arm fails by lowering its drill at the wrong location

The preparation of an incorrect bill for service is another example of a failure. The system requirements specify that the customer will be accurately charged for service received. A faulty identification received in a message by a billing system can result in the charges being erroneously applied to the wrong account. The fault in this case might have been in the communications channel (a garbled message), or in the system component that prepares the message for transmission. The error was applying the charges to the wrong account. The fact that the customer receives an incorrect charge is the failure, since they agreed with the carrier to pay for the service that they used and not for unused service.

Consider a spacecraft that is given an updated set of program instructions by the Earth station controlling it. An error occurs because someone designing the update incorrectly computed the memory range to be updated. The new program was updated to this incorrect range, which corrupted another part of the programming. The corrupted instructions caused the spacecraft’s antenna to point away from Earth, breaking off communications between Earth and the spacecraft, which led to the mission being considered a failure. The initial fault was the computation of the incorrect memory range.

Banking systems fail when they do not safeguard funds. An example of failure is when a bank’s automatic teller machine (ATM) dispenses too much cash to a customer. Several errors might lead to this failure. One error is that the machine counted out more bills than it should have. In this case the fault might be an incorrect computation module, or a faulty currency sorting mechanism. A different error that can result in the same failure is that the bills were loaded incorrectly into the ATM. The fault was that the courier that loaded the machine put money in the wrong dispensaries, i.e. $20 bills were placed in the $5 storage location and vice versa.

The last example illustrates how the same failure might result from different faults as shown in Figure 2.

Figure 2 Multiple faults create the same error

Another example is the failure of the first Ariane 5 rocket from the European Space Agency. Flight 501 veered off its intended course, broke up and exploded shortly into the flight. The inertial reference system for the Ariane 5 was reused from the Ariane 4. The initial period of the flight the Ariane 5’s flight path took was different enough than Ariane 4 for the inertial reference system to encounter errors in the horizontal velocity calculations. These errors resulted in the failure of the backup inertial reference system, followed by a failure of the active inertial reference system. The loss of inertial reference systems resulted in a large deviation from the desired flight path, which resulted in a mechanical failure that triggered self-destruct circuitry. The fault in this case can be traced to a change in the requirements between Ariane 4 and Ariane 5 that enables for a more rapid buildup of horizontal velocities in Ariane 5. The error that resulted from the horizontal velocity increasing too rapidly resulted in the failure. [ESA96]

Failure Perception [Lap91][Kop97]

A fail-silent failure is one in which the failing unit either presents the correct result or no result at all. A crash failure is one where the unit stops after the first fail-silent failure. When a crash failure is visible to the rest of the system, it is called a fail-stop failure.

A set-top entertainment system computer fails quietly, without announcing to the world that it has failed. When it fails it just stops providing service. The computer in the Voyager spacecraft fails in a crash failure mode after it detects its first failure, which is detected by the backup computer, which assumes primary control. [Tom88]

Failures can be categorized as either consistent or inconsistent. Consistency refers to whether the failure appears the same each time it is observed. Examining the failure occurs from the viewpoint of the user, the person or other system that is determining that the failing system did not conform to its specifications. Consistent failures are seen as the same kind of failure by all users or observers of a system. An example of failing consistently is reporting ‘1’ in response to all questions that the system is asked.

Inconsistent failures are ones that appear different to different observers. These are sometimes called two-faced failures, malicious failures or Byzantine failures. These are the most difficult to isolate and correct because the failure is presenting multiple faces to the error detection, processing, and fault treatment phases of recovery.

An example of an inconsistent failure is to respond with ‘1’ to questions asked by one peer and ‘2’ to questions from all other peers. Another example is when the failing system misroutes all network traffic to a certain network address, and not to other network addresses. The observers of the system, the network peers, see one of two behaviors: either they see a complete absence of network traffic, or they see a flood of network traffic of which most of it is incorrect and should not have been received. This failure is inconsistent because the perception of whether the system is sending traffic or not sending traffic depends on which peer is the observer.

Inconsistent failures are very hard to detect and to correct because they appear different to each observer. In particular they might appear correct to the part that would detect a failure and incorrect to all other parts of the system. To counter the risk of the failure appearing differently to different observers, fault tolerant design attempts to turn the potentially inconsistent failures into consistent failures. This is accomplished by creating boundaries around failing functionalities, and transforming all failures into fail-silent failures.

Fail-silent failures are the easiest type of failures to be tolerated because the observed failure is that the failing unit has stopped working. The reason for the failure is unclear, but the failing element is identified and the failure is contained and is not spreading throughout the system.

Single Faults

Much of the fault tolerant design over the years has been created to handle only one error at a time. The assumption is that only one error will occur at a time and recovery from it has completed before another error occurs. A further assumption is that errors are independent of each other.

While this is a common design principle in real life, many failures have occurred when this assumption has been invalid.

To understand why this is a valuable assumption, consider Table 1.1. It shows the theoretical results that indicate how many redundant units are required to tolerate independent faults of three kinds: fail-silent, consistent and malicious (inconsistent). The type of failures tolerated influences the number of components required to tolerate failures. From this table, most designers will see that the most desirable situation is to have the failing unit fail silently, because that requires only two units to tolerate the failures.

Table 1.1 Minimum number of components to tolerate failures [Kop97, p. 121]

To gain perspective of the ramifications in Table 1.1, the computer control system in the Space Shuttle is designed to tolerate two simultaneous failures which must be consistent but need not be silent and, as a result, it has five general purpose computers. [Skl76] A typical telephone switching system is designed to tolerate single failures. Many components are duplicated because two units are all that are required to tolerate single failures.

Examples of How Vocabulary Makes a Difference

When debugging failures it is very useful to determine what is the fault, what is the error and what is the failure. Here are a few examples. These also show that the terms, while specific, depend on the viewpoint and the depth of examination.

Consider the robotic arm failure presented above. Was the fault that the arm software rotated in the wrong direction, or was it the incorrect data that drove the state change? Knowing which the fault was helps us know what to fix.

As another example, consider the Ariane 5 failure mentioned earlier. Was the fault that the specification didn’t reflect the expected flight path? Or was the fault that the reused component was insufficiently tested to detect the fault? Was the error that the incorrect specification was used, or was the error that the flight path deviated from the Ariane 4 flight path? Identifying and correctly labeling faults and errors simplifies the fault treatment.

Coverage

The coverage factor is an important metric of a system’s fault tolerance. Highly reliable and highly available systems strive for high coverage factors, 95% or higher.

The coverage is the conditional probability that the system will recover automatically within the required time interval given that an error has occurred.

equation

In the Space Shuttle avionics nearly perfect coverage is attained in a complex of four off-the-shelf processors by comparing the output of simultaneous computations in each of the processors. Each Shuttle processor is equipped with a small amount of redundancy management hardware to manage the receipt of the values to be compared. Through the use of this hardware the processor can identify with certainty which of its peers computed an incorrect value. The coverage was increased to 100% through the additional technique of placing a timer on the buses used to communicate between the processors. [Skl76]

Coverage can be computed from the probability associated with detection and recovery.

equation

Obtaining the probabilities used to compute the coverage factor is difficult. Extensive stability testing and fault insertion testing are required to obtain these values.

Reliability

A system’s reliability is the probability that it will perform without deviations from agreed-upon behavior for a specific period of time. In other words, that there will be no failures during a specified time.

The parameters used to describe reliability are Mean Time To Failure (MTTF) and Mean Time to Repair (MTTR). The Mean Time To Failure is the average time from start of operation until the time when the first failure occurs. The Mean Time to Repair is a measure of the average time required to restore a failing component to operation. In the case of hardware this means the time to replace the faulty hardware component in addition to the time to travel to the site to be able to perform the repair actions. The Mean Time Between Failures, or MTBF, is similar to MTTF but reflects the time from the start of operation until the component is restored to operation after repair. MTBF is the sum of MTTF and MTTR. MTBF is used in situations where the system is repairable, and MTTF is used when it cannot be repaired. The start of operations for both MTTF and MTBF refers to when normal operations are resumed, either after initial startup or after recovery has completed. The reliability can be computed with the following equation.

equation

Failure rate is the inverse of MTTF. A commonly used measurement of failure rate is FITs, or Failures in Time. FITs are the number of failures in 1 × 10⁹ hours.

Reliability Examples

Mars Landers

The Mars Exploratory Rovers, Spirit and Opportunity, had a design duration of 90 days. The reliability of these two Mars explorers has been so good that they lasted more than 1000 days. However, note that this refers only to complete system failures. There have been partial failures requiring workarounds or fault treatment, such as finding a way to keep the Mars Rover Sprit operating on only five of its six wheels. [NASA04][NASA06].

Airplane Navigation System

Many modern airplanes rely extensively on computers to control critical systems. While the aircraft is in the air, the navigational computers must operate failure-free. On a flight from Chicago to Los Angeles, the navigation system must be failure-free for between four and five hours. The MTTF during the operational phase of the system must be greater than five hours; if it were less the flight crew could expect at least one failure on their flight. If the navigational system fails while the airplane is at the gate on the ground, repairs can return it to operational status before its next flight. Before or after a flight it is still a failure, but it might not be considered into the system’s reliability computations. The MTTR must be low because airlines require their planes to be highly available in order to maximize their return on investment.

Measuring Reliability

There are two primary methods of determining the reliability of a system. The first is to watch the system for a long time and calculate the probability of failure at the end of the time. The other is to predict the number of faults and from that number to predict the probability of failures (both numbers of failures and durations). Software Reliability Engineering focuses on measuring and predicting reliability.

Availability

A system’s availability is the percentage of time that it is able to perform its designed function. Uptime is when the system is available, downtime is when it is not. A common way to express availability is in terms of a number of nines, as indicated in Table 1.2.

Table 1.2 Availability as a number of nines

Availability is computed as:

equation

Availability and Reliability are two concepts that are easy to get confused. Availability is concerned with what percentage of time the system can perform its function. Reliability is concerned with the probability that the system will perform failure-free for a specified period of time.

Availability Examples

The 4ESS™ Switch from Alcatel-Lucent had an explicit requirement when it was designed in the 1970s of two hours of downtime every 40 years. This equates to an unavailability of three minutes per year, which is slightly better than five 9s. The 5ESS® Switch from Alcatel-Lucent has achieved six 9’s of availability for a number of years.

Dependability

Dependability is a measure of a system’s trustworthiness to be relied upon to perform the desired function. The attributes of dependability are reliability, availability, safety and security. Safety refers to the non-occurrence of catastrophic failures, whose consequences are much greater than the potential benefit. Security refers to the unauthorized access or unauthorized handling of information. Since dependability includes both reliability and availability, the correctness of the result is important. [Lap91]

Hardware Reliability

Unlike software, hardware faults can be analyzed statistically based upon behavior and occurrence and also the physics of materials. The reliability of hardware has been studied for a long time, and covered in great depth. Hardware reliability includes the study of the physics and the materials, as well as the way things wear out. There is an array of technical conferences and journals that address this topic, such as the International Reliability Physics Symposium and the Electronic Components Technology Conference and IEEE journals Device and Materials Reliability, Advanced Packaging and Solid State Circuits.

Reliability Engineering and Analysis

Software Reliability Engineering is the practice of monitoring and managing the reliability of a system. By collecting fault, error, and failure statistics during development, testing, and field operation, monitoring and managing the parameters of reliability and availability is possible. The Handbook of Software Reliability Engineering [Lyu96] contains a number of articles on topics related to Software Reliability Engineering.

A widely used technique is Reliability Growth Modeling, which graphs the cumulative number of faults corrected versus time. Prediction methods calculate the cumulative number of faults expected, which enables comparison with the measured results. This, in turn, enables the determination of the number of faults remaining in the system.

Markov modeling of systems (including software components) is another technique useful for predicting the reliability of a system. These models enable analysis of redundancy techniques and prediction of MTTF.

Markov models are constructed by defining the possible system states. Transitions between the states are defined and are assigned a probability factor. The probability indicates the likelihood that the transition will occur. An important aspect of the model is that the probability of a state transition depends only on the current state; history is not considered. Figure 3 shows a simple Markov model for a duplex system in which either system may fail with probability λ and be restored to service with probability µ and a coverage factor c. The failure rate, (λ), is the inverse of the MTTF, and the repair rate (µ) is the inverse of

Enjoying the preview?

Page 1 of 1

Patterns for Fault Tolerant Software

About this ebook

Robert S. Hanmer

Read more from Robert S. Hanmer

Related authors

Related to Patterns for Fault Tolerant Software

Titles in the series (13)

Related ebooks

Programming For You

Related podcast episodes

Related articles

Related categories

Reviews for Patterns for Fault Tolerant Software

What did you think?

Book preview

Patterns for Fault Tolerant Software - Robert S. Hanmer

An Imperfect World

CHAPTER 1

Introduction to Fault Tolerance

Fault -> Error -> Failure

Examples of Fault -> Error -> Failure

Failure Perception [Lap91][Kop97]

Single Faults

Examples of How Vocabulary Makes a Difference

Coverage

Reliability

Reliability Examples

Mars Landers

Airplane Navigation System

Measuring Reliability

Availability

Availability Examples

Dependability

Hardware Reliability

Reliability Engineering and Analysis