Ebook496 pages6 hours

Building Dependable Distributed Systems

Name: Building Dependable Distributed Systems
Author: Wenbing Zhao
ISBN: 9781118912638

By Wenbing Zhao

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book covers the most essential techniques for designing and building dependable distributed systems. Instead of covering a broad range of research works for each dependability strategy, the book focuses only a selected few (usually the most seminal works, the most practical approaches, or the first publication of each approach) are included and explained in depth, usually with a comprehensive set of examples. The goal is to dissect each technique thoroughly so that readers who are not familiar with dependable distributed computing can actually grasp the technique after studying the book.

The book contains eight chapters. The first chapter introduces the basic concepts and terminologies of dependable distributed computing, and also provide an overview of the primary means for achieving dependability. The second chapter describes in detail the checkpointing and logging mechanisms, which are the most commonly used means to achieve limited degree of fault tolerance. Such mechanisms also serve as the foundation for more sophisticated dependability solutions. Chapter three covers the works on recovery-oriented computing, which focus on the practical techniques that reduce the fault detection and recovery times for Internet-based applications. Chapter four outlines the replication techniques for data and service fault tolerance. This chapter also pays particular attention to optimistic replication and the CAP theorem. Chapter five explains a few seminal works on group communication systems. Chapter six introduces the distributed consensus problem and covers a number of Paxos family algorithms in depth. Chapter seven introduces the Byzantine generals problem and its latest solutions, including the seminal Practical Byzantine Fault Tolerance (PBFT) algorithm and a number of its derivatives. The final chapter covers the latest research results on application-aware Byzantine fault tolerance, which is an important step forward towards practical use of Byzantine fault tolerance techniques.

Skip carousel

Software Development & Engineering

LanguageEnglish

PublisherWiley

Release dateMar 6, 2014

ISBN9781118912638

Author

Wenbing Zhao

Related authors

Skip carousel

Related to Building Dependable Distributed Systems

Related ebooks

Skip carousel

Binary Decision Diagrams and Extensions for System Reliability Analysis
Ebook
Binary Decision Diagrams and Extensions for System Reliability Analysis
byLiudong Xing
Rating: 0 out of 5 stars
0 ratings
Quantitative Assessments of Distributed Systems: Methodologies and Techniques
Ebook
Quantitative Assessments of Distributed Systems: Methodologies and Techniques
byDario Bruneo
Rating: 0 out of 5 stars
0 ratings
Fundamentals of Reliability Engineering: Applications in Multistage Interconnection Networks
Ebook
Fundamentals of Reliability Engineering: Applications in Multistage Interconnection Networks
byIndra Gunawan
Rating: 0 out of 5 stars
0 ratings
Machine Tool Reliability
Ebook
Machine Tool Reliability
byBhupesh K. Lad
Rating: 0 out of 5 stars
0 ratings
CASP+ CompTIA Advanced Security Practitioner Study Guide: Exam CAS-004
Ebook
CASP+ CompTIA Advanced Security Practitioner Study Guide: Exam CAS-004
byNadean H. Tanner
Rating: 0 out of 5 stars
0 ratings
SQL Server 2019 AlwaysOn: Supporting 24x7 Applications with Continuous Uptime
Ebook
SQL Server 2019 AlwaysOn: Supporting 24x7 Applications with Continuous Uptime
byPeter A. Carter
Rating: 0 out of 5 stars
0 ratings
Digital Control Engineering: Analysis and Design
Ebook
Digital Control Engineering: Analysis and Design
byM. Sami Fadali
Rating: 3 out of 5 stars
3/5
Conduct of Operations and Operational Discipline: For Improving Process Safety in Industry
Ebook
Conduct of Operations and Operational Discipline: For Improving Process Safety in Industry
byCCPS (Center for Chemical Process Safety)
Rating: 5 out of 5 stars
5/5
Information Storage and Management: Storing, Managing, and Protecting Digital Information
Ebook
Information Storage and Management: Storing, Managing, and Protecting Digital Information
byEMC Education Services
Rating: 0 out of 5 stars
0 ratings
Service Orchestration as Organization: Building Multi-Tenant Service Applications in the Cloud
Ebook
Service Orchestration as Organization: Building Multi-Tenant Service Applications in the Cloud
byMalinda Kapuruge
Rating: 0 out of 5 stars
0 ratings
The Effects of Traffic Structure on Application and Network Performance
Ebook
The Effects of Traffic Structure on Application and Network Performance
byJay Aikat
Rating: 0 out of 5 stars
0 ratings
Cost Estimation: Methods and Tools
Ebook
Cost Estimation: Methods and Tools
byGregory K. Mislick
Rating: 5 out of 5 stars
5/5
Network Reliability: Measures and Evaluation
Ebook
Network Reliability: Measures and Evaluation
bySanjay K. Chaturvedi
Rating: 0 out of 5 stars
0 ratings
Aeration Control System Design: A Practical Guide to Energy and Process Optimization
Ebook
Aeration Control System Design: A Practical Guide to Energy and Process Optimization
byThomas E. Jenkins
Rating: 0 out of 5 stars
0 ratings
(ISC)2 CCSP Certified Cloud Security Professional Official Study Guide
Ebook
(ISC)2 CCSP Certified Cloud Security Professional Official Study Guide
byMike Chapple
Rating: 5 out of 5 stars
5/5
Professional Microsoft SQL Server 2008 Administration
Ebook
Professional Microsoft SQL Server 2008 Administration
byBrian Knight
Rating: 0 out of 5 stars
0 ratings
Analysis of Financial Time Series
Ebook
Analysis of Financial Time Series
byRuey S. Tsay
Rating: 4 out of 5 stars
4/5
Software Performance and Scalability: A Quantitative Approach
Ebook
Software Performance and Scalability: A Quantitative Approach
byHenry H. Liu
Rating: 0 out of 5 stars
0 ratings
Probabilistic Design for Optimization and Robustness for Engineers
Ebook
Probabilistic Design for Optimization and Robustness for Engineers
byBryan Dodson
Rating: 0 out of 5 stars
0 ratings
Principles of Transaction Processing
Ebook
Principles of Transaction Processing
byPhilip A. Bernstein
Rating: 4 out of 5 stars
4/5
SSCP (ISC)2 Systems Security Certified Practitioner Official Study Guide
Ebook
SSCP (ISC)2 Systems Security Certified Practitioner Official Study Guide
byGeorge Murphy
Rating: 0 out of 5 stars
0 ratings
Guidelines for Asset Integrity Management
Ebook
Guidelines for Asset Integrity Management
byCCPS (Center for Chemical Process Safety)
Rating: 5 out of 5 stars
5/5
Handbook of Modeling High-Frequency Data in Finance
Ebook
Handbook of Modeling High-Frequency Data in Finance
byFrederi G. Viens
Rating: 0 out of 5 stars
0 ratings
Software Mistakes and Tradeoffs: How to make good programming decisions
Ebook
Software Mistakes and Tradeoffs: How to make good programming decisions
byTomasz Lelek
Rating: 0 out of 5 stars
0 ratings
Building the Agile Enterprise: With Capabilities, Collaborations and Values
Ebook
Building the Agile Enterprise: With Capabilities, Collaborations and Values
byFred A. Cummins
Rating: 4 out of 5 stars
4/5
Instrumentation and Control Systems
Ebook
Instrumentation and Control Systems
byWilliam Bolton
Rating: 0 out of 5 stars
0 ratings
Building Software for Simulation: Theory and Algorithms, with Applications in C++
Ebook
Building Software for Simulation: Theory and Algorithms, with Applications in C++
byJames J. Nutaro
Rating: 0 out of 5 stars
0 ratings
Trustworthy Compilers
Ebook
Trustworthy Compilers
byVladimir O. Safonov
Rating: 0 out of 5 stars
0 ratings
Practical Design and Application of Model Predictive Control: MPC for MATLAB® and Simulink® Users
Ebook
Practical Design and Application of Model Predictive Control: MPC for MATLAB® and Simulink® Users
byNassim Khaled
Rating: 3 out of 5 stars
3/5
Guidelines for Engineering Design for Process Safety
Ebook
Guidelines for Engineering Design for Process Safety
byCCPS (Center for Chemical Process Safety)
Rating: 0 out of 5 stars
0 ratings

Software Development & Engineering For You

Skip carousel

Hand Lettering on the iPad with Procreate: Ideas and Lessons for Modern and Vintage Lettering
Ebook
Hand Lettering on the iPad with Procreate: Ideas and Lessons for Modern and Vintage Lettering
byLiz Kohler Brown
Rating: 4 out of 5 stars
4/5
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Ebook
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
byMark Chan
Rating: 5 out of 5 stars
5/5
Python For Dummies
Ebook
Python For Dummies
byStef Maruch
Rating: 4 out of 5 stars
4/5
Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards
Ebook
Modern C++ for Absolute Beginners: A Friendly Introduction to C++ Programming Language and C++11 to C++20 Standards
bySlobodan Dmitrović
Rating: 0 out of 5 stars
0 ratings
The Inmates Are Running the Asylum (Review and Analysis of Cooper's Book)
Ebook
The Inmates Are Running the Asylum (Review and Analysis of Cooper's Book)
by BusinessNews Publishing
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
Ebook
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
byTJ Books
Rating: 3 out of 5 stars
3/5
SQL For Dummies
Ebook
SQL For Dummies
byAllen G. Taylor
Rating: 0 out of 5 stars
0 ratings
How Do I Do That In InDesign?
Ebook
How Do I Do That In InDesign?
byDave Clayton
Rating: 5 out of 5 stars
5/5
How to Write Effective Emails at Work
Ebook
How to Write Effective Emails at Work
byRamakrishna Reddy
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
OneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done
Ebook
OneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done
byChris Will
Rating: 1 out of 5 stars
1/5
Level Up! The Guide to Great Video Game Design
Ebook
Level Up! The Guide to Great Video Game Design
byScott Rogers
Rating: 4 out of 5 stars
4/5
Beginning Programming For Dummies
Ebook
Beginning Programming For Dummies
byWallace Wang
Rating: 4 out of 5 stars
4/5
Lua Game Development Cookbook
Ebook
Lua Game Development Cookbook
byMário Kašuba
Rating: 0 out of 5 stars
0 ratings
Beginning C++ Programming
Ebook
Beginning C++ Programming
byRichard Grimes
Rating: 3 out of 5 stars
3/5
How Do I Do That in Photoshop?: The Quickest Ways to Do the Things You Want to Do, Right Now!
Ebook
How Do I Do That in Photoshop?: The Quickest Ways to Do the Things You Want to Do, Right Now!
byScott Kelby
Rating: 4 out of 5 stars
4/5
Learning Python
Ebook
Learning Python
byRomano Fabrizio
Rating: 5 out of 5 stars
5/5
Photoshop For Beginners: Learn Adobe Photoshop cs5 Basics With Tutorials
Ebook
Photoshop For Beginners: Learn Adobe Photoshop cs5 Basics With Tutorials
byNisha Ramavat
Rating: 0 out of 5 stars
0 ratings
Expert Python Programming - Third Edition: Become a master in Python by learning coding best practices and advanced programming concepts in Python 3.7, 3rd Edition
Ebook
Expert Python Programming - Third Edition: Become a master in Python by learning coding best practices and advanced programming concepts in Python 3.7, 3rd Edition
byMichał Jaworski
Rating: 0 out of 5 stars
0 ratings
Good Code, Bad Code: Think like a software engineer
Ebook
Good Code, Bad Code: Think like a software engineer
byTom Long
Rating: 5 out of 5 stars
5/5
10x Software Engineer
Ebook
10x Software Engineer
byFabio Cicerchia
Rating: 0 out of 5 stars
0 ratings
Tiny Python Projects: Learn coding and testing with puzzles and games
Ebook
Tiny Python Projects: Learn coding and testing with puzzles and games
byKen Youens-Clark
Rating: 5 out of 5 stars
5/5
Adobe Certified: Complete Step By Step Guide To Quickly Pass All Adobe Exams And Improve Your Job Position Real And Unique Practice Test Included
Ebook
Adobe Certified: Complete Step By Step Guide To Quickly Pass All Adobe Exams And Improve Your Job Position Real And Unique Practice Test Included
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
Reversing: Secrets of Reverse Engineering
Ebook
Reversing: Secrets of Reverse Engineering
byEldad Eilam
Rating: 4 out of 5 stars
4/5
Beginning C++ Game Programming - Second Edition: Learn to program with C++ by building fun games, 2nd Edition
Ebook
Beginning C++ Game Programming - Second Edition: Learn to program with C++ by building fun games, 2nd Edition
byJohn Horton
Rating: 0 out of 5 stars
0 ratings
The Python Workshop: Learn to code in Python and kickstart your career in software development or data science
Ebook
The Python Workshop: Learn to code in Python and kickstart your career in software development or data science
byAndrew Bird
Rating: 5 out of 5 stars
5/5
Programming Problems: A Primer for The Technical Interview
Ebook
Programming Problems: A Primer for The Technical Interview
byBradley Green
Rating: 4 out of 5 stars
4/5
Agile Practice Guide
Ebook
Agile Practice Guide
byProject Management Institute Project Management Institute
Rating: 4 out of 5 stars
4/5
Gray Hat Hacking the Ethical Hacker's
Ebook
Gray Hat Hacking the Ethical Hacker's
byÇağatay Şanlı
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Software Supply Chain Security & MITRE's System of Trust - Robert Martin - PSW #768: This session explores software supply chain security and the details of System of Trust, a community effort to develop and validate a process for integrating evidence of the organizational, technical, and transactional trustworthiness of supply chain...
Podcast episode
Software Supply Chain Security & MITRE's System of Trust - Robert Martin - PSW #768: This session explores software supply chain security and the details of System of Trust, a community effort to develop and validate a process for integrating evidence of the organizational, technical, and transactional trustworthiness of supply chain...
bySecurity Weekly Podcast Network (Video)
0 ratings
0% found this document useful
From QAos to Chaos Engineering
Podcast episode
From QAos to Chaos Engineering
byThe Cloudcast
0 ratings
0% found this document useful
Maximum Availability Architecture: Join Lois Houston and Nikita Abraham, along with Alex Bouchereau, as they talk about Oracle Maximum Availability Architecture, which provides architecture, configuration, and lifecycle best practices for Oracle Databases. Oracle MyLearn: ...
Podcast episode
Maximum Availability Architecture: Join Lois Houston and Nikita Abraham, along with Alex Bouchereau, as they talk about Oracle Maximum Availability Architecture, which provides architecture, configuration, and lifecycle best practices for Oracle Databases. Oracle MyLearn: ...
byOracle University Podcast
0 ratings
0% found this document useful
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
Podcast episode
MLOps Coffee Sessions #10 Analyzing the Article “Continuous Delivery and Automation Pipelines in Machine Learning" // Part 2
byMLOps.community
0 ratings
0% found this document useful
Episode 136 - Sidecar Proxy (Pros and Cons)
Podcast episode
Episode 136 - Sidecar Proxy (Pros and Cons)
byThe Backend Engineering Show with Hussein Nasser
0 ratings
0% found this document useful
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
Podcast episode
An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem: Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.
byData Engineering Podcast
0 ratings
0% found this document useful
New Tools for Cloud Native Developers
Podcast episode
New Tools for Cloud Native Developers
byThe Cloudcast
0 ratings
0% found this document useful
"Saga of a Gnarly Report" with Owen and Dan: Elixir Wizards Owen and Dan delve into the complexities of building advanced reporting features within software applications. They share personal insights and challenges encountered while developing reporting solutions for user-generated data, leveraging both Elixir/Phoenix and Ruby on Rails.
Podcast episode
"Saga of a Gnarly Report" with Owen and Dan: Elixir Wizards Owen and Dan delve into the complexities of building advanced reporting features within software applications. They share personal insights and challenges encountered while developing reporting solutions for user-generated data, leveraging both Elixir/Phoenix and Ruby on Rails.
byElixir Wizards
0 ratings
0% found this document useful
Selenium Insight from All-Star SeleniumConf Speakers!: In this episode, you'll hear from five SeleniumConf Chicago 2023 speakers and/or project core committers about their upcoming talks, the reasons for their participation, and the benefits attendees can expect to gain from the conference. Additionally,...
Podcast episode
Selenium Insight from All-Star SeleniumConf Speakers!: In this episode, you'll hear from five SeleniumConf Chicago 2023 speakers and/or project core committers about their upcoming talks, the reasons for their participation, and the benefits attendees can expect to gain from the conference. Additionally,...
byTestGuild Automation Podcast
0 ratings
0% found this document useful
Internet Outages: Why One Small Link Can Break the Whole Chain | Pulse Update
Podcast episode
Internet Outages: Why One Small Link Can Break the Whole Chain | Pulse Update
byThe Internet Report
0 ratings
0% found this document useful
Reconciling The Data In Your Databases With Datafold: A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
Podcast episode
Reconciling The Data In Your Databases With Datafold: A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 51 - Design & Access Statements
Podcast episode
Episode 51 - Design & Access Statements
byPart3 With Me
0 ratings
0% found this document useful
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
Podcast episode
Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer: Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
byData Engineering Podcast
0 ratings
0% found this document useful
195 - How 9/11 and Katrina Changed Scanning: What changed in the radio landscape in 20 years post 9/11 and Hurricane Katrina here in the US? Have you ever wondered why we have so many new P25 systems showing up in the 700 MHz band? These are the questions we are asking in today's podcast. ...
Podcast episode
195 - How 9/11 and Katrina Changed Scanning: What changed in the radio landscape in 20 years post 9/11 and Hurricane Katrina here in the US? Have you ever wondered why we have so many new P25 systems showing up in the 700 MHz band? These are the questions we are asking in today's podcast. ...
byScanner School - Everything you wanted to know about the Scanner Radio Hobby
0 ratings
0% found this document useful
2015 - DevOpsDays DC - 13 - Continuous Acceleration with a Software Supply Chain Approach: Joshua Corman, Sonatype
Podcast episode
2015 - DevOpsDays DC - 13 - Continuous Acceleration with a Software Supply Chain Approach: Joshua Corman, Sonatype
byDevOps Days Podcast
0 ratings
0% found this document useful
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Podcast episode
Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary: Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
byData Engineering Podcast
0 ratings
0% found this document useful
Episode Sixteen - Empowering Service Members in Tech with Red Hat
Podcast episode
Episode Sixteen - Empowering Service Members in Tech with Red Hat
byThe Veterans Career Compass: An ACP Podcast
0 ratings
0% found this document useful
035 When Multi-Threading, Micro Services and Garbage Collection Turn Sour: For our one year anniversary episode, we go “back to basics”, or, better said, “back problem patterns”. We picked three patterns that have come up frequently in recent “Share Your PurePath” sessions from our global user base and try to give some...
Podcast episode
035 When Multi-Threading, Micro Services and Garbage Collection Turn Sour: For our one year anniversary episode, we go “back to basics”, or, better said, “back problem patterns”. We picked three patterns that have come up frequently in recent “Share Your PurePath” sessions from our global user base and try to give some...
byPurePerformance
0 ratings
0% found this document useful
(Dispatch from the Scenius) Fabulous Fortunes, Fewer Failures, and Faster Fixes from Functional Fundamentals: Scott Havens’ 2019 DevOps Enterprise Summit Talk with Commentary from Gene Kim: In this episode of The Idealcast, Gene Kim shares and gives commentary on Scott Havens’ talk from the 2019 DevOps Enterprise Summit Las Vegas. Havens is a Director of Engineering at Wayfair, where he leads Engineering for the Wayfair Fulfillment Netw...
Podcast episode
(Dispatch from the Scenius) Fabulous Fortunes, Fewer Failures, and Faster Fixes from Functional Fundamentals: Scott Havens’ 2019 DevOps Enterprise Summit Talk with Commentary from Gene Kim: In this episode of The Idealcast, Gene Kim shares and gives commentary on Scott Havens’ talk from the 2019 DevOps Enterprise Summit Las Vegas. Havens is a Director of Engineering at Wayfair, where he leads Engineering for the Wayfair Fulfillment Netw...
byThe Idealcast with Gene Kim by IT Revolution
0 ratings
0% found this document useful
E84: Using Process Mapping and Regression to Reduce Electricity Usage
Podcast episode
E84: Using Process Mapping and Regression to Reduce Electricity Usage
byLean Six Sigma Bursts
0 ratings
0% found this document useful
The Future of Service Mesh with Jim Barton: When building a distributed system, we have to co…
Podcast episode
The Future of Service Mesh with Jim Barton: When building a distributed system, we have to co…
byThe InfoQ Podcast
0 ratings
0% found this document useful
Enabling the Enterprise Financial Data Cloud
Podcast episode
Enabling the Enterprise Financial Data Cloud
byThe Cloudcast
0 ratings
0% found this document useful
Addressing Electrical Design Compliance and Certification: With increased complexity throughout the industry, the traditional spreadsheet-based approach to compliance is no longer an option. New innovations along with the increased use of electrification are forcing OEMs and their suppliers to consider new ways to achieve electrical design compliance and certification.Welcome to Talking Aerospace Today – a podcast for the Aerospace & Defense industry. The place that brings the promise of tomorrow’s technology to the ears of our listeners today.In this episode “Addressing electrical design compliance and certification” we’ll be looking at new ways to achieve compliance through a model-based approach as well as how to best utilize the comprehensive digital twin and digital thread. We know this is a time of great innovation and transformation for the industry – and when it comes to electrical system compliance and certification, there’s a lot to cove
Podcast episode
Addressing Electrical Design Compliance and Certification: With increased complexity throughout the industry, the traditional spreadsheet-based approach to compliance is no longer an option. New innovations along with the increased use of electrification are forcing OEMs and their suppliers to consider new ways to achieve electrical design compliance and certification.Welcome to Talking Aerospace Today – a podcast for the Aerospace & Defense industry. The place that brings the promise of tomorrow’s technology to the ears of our listeners today.In this episode “Addressing electrical design compliance and certification” we’ll be looking at new ways to achieve compliance through a model-based approach as well as how to best utilize the comprehensive digital twin and digital thread. We know this is a time of great innovation and transformation for the industry – and when it comes to electrical system compliance and certification, there’s a lot to cove
byTalking Aerospace Today
0 ratings
0% found this document useful
Episode 110 - Retrofit using EnerPHit
Podcast episode
Episode 110 - Retrofit using EnerPHit
byPart3 With Me
0 ratings
0% found this document useful
Take Five #66 - Cybersecurity Strategies for Addressing Cyber Risk: Service Providers: This episode was recorded on 6/7/2023 Welcome t…
Podcast episode
Take Five #66 - Cybersecurity Strategies for Addressing Cyber Risk: Service Providers: This episode was recorded on 6/7/2023 Welcome t…
byFortinet Cybersecurity Podcast
0 ratings
0% found this document useful
Analyzing the Google Paper on Continuous Delivery in ML // Part 4 // MLOps Coffee Sessions #17
Podcast episode
Analyzing the Google Paper on Continuous Delivery in ML // Part 4 // MLOps Coffee Sessions #17
byMLOps.community
0 ratings
0% found this document useful
A Case for Threat Informed Penetration Testing - Dan DeCloss - PSW #763: Every penetration test should have specific goals. Coverage of the MITRE ATT&CK framework or the OWASP Top Ten is great, but what other value can a pentest provide by shifting your mindset further left or with a more strategic approach? How often...
Podcast episode
A Case for Threat Informed Penetration Testing - Dan DeCloss - PSW #763: Every penetration test should have specific goals. Coverage of the MITRE ATT&CK framework or the OWASP Top Ten is great, but what other value can a pentest provide by shifting your mindset further left or with a more strategic approach? How often...
bySecurity Weekly Podcast Network (Video)
0 ratings
0% found this document useful
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
Podcast episode
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
byData Engineering Podcast
0 ratings
0% found this document useful
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Podcast episode
Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems: Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
byData Engineering Podcast
0 ratings
0% found this document useful
Machine Learning in Performance with Gopal Brugalette: Managing the performance of complex systems requires more than simply running load tests. You need to perform a careful analysis of test results and production metrics. The sheer amount of data generated makes analysis a challenge that is often left...
Podcast episode
Machine Learning in Performance with Gopal Brugalette: Managing the performance of complex systems requires more than simply running load tests. You need to perform a careful analysis of test results and production metrics. The sheer amount of data generated makes analysis a challenge that is often left...
byTestGuild Devops Toolchain Podcast
0 ratings
0% found this document useful

Skip carousel

Robo-marshal
Racecar Engineering
Article
Robo-marshal
May 3, 2019
8 min read
Integrated Workplace Management Systems
Facility Management
Article
Integrated Workplace Management Systems
Dec 23, 2018
Property and facilities management are data-rich operating worlds. This is becoming even more complex as the Internet of Things (IoT) provides the capability to imbed sensors and diagnostic tools to monitor the use and performance of everything in re
4 min read
How To Test Your Mac’s Internet Speed And Quality
MacWorld
Article
How To Test Your Mac’s Internet Speed And Quality
Apr 19, 2022
5 min read
How To Test Your Mac’s Internet Speed And Quality
Macworld UK
Article
How To Test Your Mac’s Internet Speed And Quality
Feb 11, 2022
4 min read
Electromechanical Keying System
Facility Management
Article
Electromechanical Keying System
Nov 4, 2020
Electromechanical key systems are now very widely accepted in the security landscape and used in a vast array of applications through our region. They provide a hybrid solution that combines many of the benefits of access control with the functionali
3 min read
Lag Is Killing Games
Linux Format
Article
Lag Is Killing Games
Jan 11, 2022
8 min read
Thriving As An Ecosystem Partner
The European Business Review
Article
Thriving As An Ecosystem Partner
Sep 30, 2022
Researching ecosystems that span industries from e-commerce and publishing to semiconductors and healthcare over the past decade, we found companies that have been successful for years by contributing to an ecosystem. Sometimes, by contributing as pa
10 min read
Network-monitoring software 2024
PC Pro Magazine
Article
Network-monitoring software 2024
Feb 8, 2024
4 min read
Running The Show
Yachts & Yachting magazine
Article
Running The Show
Feb 14, 2020
6 min read
Leveling Up Remote Management
Residential Tech Today
Article
Leveling Up Remote Management
Sep 29, 2023
2 min read
Is It Time For AI?
Racecar Engineering
Article
Is It Time For AI?
Jul 2, 2021
The upper echelons of motorsport are riddled with vast amounts of complex technology, both on and off the cars. Onboard contemporary high-end race machines, especially in the hybrid categories, you’ll find an array of electronic systems. And for effi
3 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read
5 Tools That Integrate Your Cloud Storage Into Windows File Explorer
PCWorld
Article
5 Tools That Integrate Your Cloud Storage Into Windows File Explorer
Apr 30, 2024
6 min read
“What You See Is A Mirage, As Tuck-up Picture That Doesn’t Describe What’s Happening To Your Packets”
PC Pro Magazine
Article
“What You See Is A Mirage, As Tuck-up Picture That Doesn’t Describe What’s Happening To Your Packets”
Oct 8, 2020
6 min read
How To Implement Edge Computing in Your Organization?
Techfastly
Article
How To Implement Edge Computing in Your Organization?
Jun 1, 2022
5 min read
Buyer’s Guide Network Monitoring
PC Pro Magazine
Article
Buyer’s Guide Network Monitoring
Feb 9, 2023
4 min read
Grid Modeling Overview: Four Types of Models Guiding the Transition to Clean Electricity
Union of Concerned Scientists
Article
Grid Modeling Overview: Four Types of Models Guiding the Transition to Clean Electricity
Apr 25, 2022
6 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
Article
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
Remote Support Software 2023
PC Pro Magazine
Article
Remote Support Software 2023
Sep 7, 2023
3 min read
Facilities Systems
Facility Management
Article
Facilities Systems
Oct 21, 2018
5 min read
A Practical Guide To Kick-starting Your Cyber Supply Chain Risk Programme
The European Business Review
Article
A Practical Guide To Kick-starting Your Cyber Supply Chain Risk Programme
Mar 27, 2024
5 min read
Retrospect Backup 18.5
PC Pro Magazine
Article
Retrospect Backup 18.5
Oct 8, 2022
2 min read
Contesting
CQ Amateur Radio
Article
Contesting
Oct 1, 2019
10 min read
Choose The Tech Kit And Services That Are Best For The Planet
PC Pro Magazine
Article
Choose The Tech Kit And Services That Are Best For The Planet
Apr 6, 2023
6 min read
Remote Support Software 2022
PC Pro Magazine
Article
Remote Support Software 2022
Sep 11, 2022
4 min read
What European Banks Need to Know about Competing with Ecosystems
The European Business Review
Article
What European Banks Need to Know about Competing with Ecosystems
Dec 3, 2019
6 min read
ManageEngine OpManager Professional 12.7
PC Pro Magazine
Article
ManageEngine OpManager Professional 12.7
Feb 8, 2024
2 min read
Scrum Project Management: The Ideal Agile Practice
Techfastly
Article
Scrum Project Management: The Ideal Agile Practice
May 3, 2021
7 min read
“We Might Beliving On ‘The Edge’, But That’s A Passing Label That Now Only Reflects A By Gone Way Of Working”
PC Pro Magazine
Article
“We Might Beliving On ‘The Edge’, But That’s A Passing Label That Now Only Reflects A By Gone Way Of Working”
Aug 13, 2020
8 min read
Real World Computing
PC Pro Magazine
Article
Real World Computing
May 11, 2023
Migrating to Azure isn’t necessarily the toughest part of a successful cloud migration, explains our guest columnist Many organisations succeed at deploying resources in or migrating to Microsoft Azure. But many of those same organisations fail to en
6 min read

Related categories

Skip carousel

Reviews for Building Dependable Distributed Systems

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Building Dependable Distributed Systems - Wenbing Zhao

Preface

Distributed computing systems are playing an ever increasingly important role in all aspects of our society, governments, businesses, and individuals alike. Such systems are behind many services on which we depend on a daily basis, such as financial (e.g., online banking and stock trading), e-commerce (e.g., online shopping), civil infrastructure (e.g., electric power grid and traffic control), entertainment (e.g., online gaming and multimedia streaming), and personal data storage (e.g., various cloud services such as Dropbox, Google Drive, and SkyDrive). The dependability of these systems no longer matters only to businesses, but matters to every one of us too.

The cost of system failures is enormous. If a data center is brought down by a system failure, the average cost for downtime may range from $42,000 to about $300,000 per hour [1, 3]. The cost can be estimated by summing up the wasted expenses and the loss of revenue. While the labor cost of downtime may be estimated relatively easily (i.e., roughly, wasted expenses per hour = number of employees × average salary per hour) [7], it is much harder to estimate the loss of revenue, especially due to the damages on the reputation of the business and the loyalty of its potential customers [1].

Of course, ensuring high availability of distributed systems is not cheap. In [4], the cost of data center is estimated to range from $450 per square foot for 99.671% availability (i.e., 28.8 hours of downtime per year), to $1,100 per square foot for 99.995% availability (i.e., 0.4 hours of downtime per year). That is perhaps one reason why about 59% of Fortune 500 companies suffer from 1.6 hours or more of downtime per week [1]. To reduce the cost of building and maintaining highly dependable systems, I believe that an effective way is to train more experts that know how to design, implement, and maintain dependable distributed systems. We hope that this book helps achieve this goal.

In this book, I cover the most essential techniques for designing dependable distributed systems (according to the my subjective judgement, of course). To keep the book concise, I chose not to cover a broad range of research work for each dependability technique. Instead, only a selected few (usually the most well-known, or the first publication of each approach) are included and explained in depth, usually with a comprehensive set of examples. The goal is to dissect each technique thoroughly so that readers who are not familiar with dependable distributed computing can actually grasp the technique after studying the book. Should I have missed any important work that has immediate practical implication (almost inevitable), I would love to hear from the readers and will be happy to include the work in the next edition of the book.

In Chapter 1, we introduce the basic concepts and terminologies of dependable distributed computing, as well as the primary means to achieve dependability.

In Chapter 2, we describe the checkpointing and logging mechanisms, which are widely used in practice to achieve some form of fault tolerance (they enable the recoverability of the application but do not prevent service disruption). The biggest advantages of this approach are that it is relatively simple to implement and understand, and it incurs minimum runtime overhead while demanding very modern extra resources (only stable storage). Furthermore, checkpointing and logging also serve as the foundation for more sophisticated dependability techniques. The disadvantage of this approach, if used alone, is that it cannot prevent service disruption from happening. Hence, it is not suitable to be used alone for applications that demand high reliability.

In Chapter 3, we cover research works on recovery-oriented computing, including fault detection and diagnosis, microreboot, and system-level undo and redo. Recovery-oriented computing aims to facilitate faster recovery after a system failure and thereby improving the availability of the system. Similar to checkpointing and logging, the mechanisms for recovery-oriented computing do not prevent service disruption, hence, it is a promising approach for many e-commerce application, but not suitable for applications that require high reliability.

In Chapter 4, we outline the replication technique for data and service fault tolerance. This is the fundamental technique to ensure high reliability. Through active replication (i.e., the use of multiple redundant copies of the application processes), the system would be able to mask the failure of a replica and continue to process clients’ requests (this is actually not entirely true, as we will show in later chapters, some failures may cause extended period of unavailability of the system). With replication comes the complexity of consistency issue. Ideally, the replicas should always maintain consistency with each other. However, doing so might not incur too much runtime overhead to be acceptable for some applications, or may cause extended period of system unavailability. Hence, strict consistency may have to be compromised either for better performance or for better availability.

In Chapter 5, we explain the group communication systems, which can be used to implement active replication. A group communication system typically offers a totally ordered reliable multicast service for messages, a membership server, and a view synchrony service. These set of services help the replicas to maintain consistency even in the presence of failures, which would reduce the development cost of building dependable systems with active replication. In the chapter, we describe in detail several well known research works on group communication system construction with different approaches.

In Chapter 6, we discuss the consensus problem and describe several Paxos algorithms, including the Classic Paxos, Dynamic Paxos, Cheap Paxos, and Fast Paxos. Distributed consensus is perhaps the most fundamental problem in distributed computing. While it is easy for a group of processes to agree on the same value if all processes can communicate with each other promptly and if none of them fails. However, distributed consensus is an incredibly hard problem when processes might fail and there might be extended delay to send or receive a message. The classical Paxos algorithm solves the consensus problem (under the non-malicious fault model) in a very elegant and efficient manner by separating the safety concern and the liveness concern [5]. Additional Paxos algorithm are developed to minimize the resources required (for Cheap Paxos), and to reduce the latency for achieving consensus by using a higher redundancy level.

In Chapter 7, we introduce the problem Byzantine fault tolerance. A Byzantine fault is synonymous with a malicious fault. Because a malicious faulty component may choose to behave like any of the non-malicious faults, the Byzantine fault model encompasses any arbitrary fault. The distributed consensus problem under the Byzantine fault model was first studied several decades ago by Lamport, Shostak, and Pease [6]. A much more efficient algorithm for achieving fault tolerance under the Byzantine fault model (referred to as Byzantine fault tolerance) was proposed by Castro and Liskov in 1999 [2]. Since then, the research on Byzantine fault tolerance exploded. With the pervasiveness of cyber attacks and espionages, tolerating malicious faults becomes an urgent concern now instead of being a far fetched problem several decades ago. In this chapter, we explain in detail several seminal works on this topic.

In Chapter 8, we document a few research works on the design of customized Byzantine fault tolerance solutions by exploiting the application semantics. For a general-purpose Byzantine fault tolerance algorithm, all requests are totally ordered and executed sequentially in the total order. This imposes severe restrictions on the types of applications that can be supported by the algorithm. By exploiting application semantics, the general-purpose algorithm can be customized to enable the partitioning of requests, the identifying of independent requests, read-only requests, and commutative requests, all of which facilitate concurrent execution of multiple requests. Furthermore, by enabling concurrent execution of selected requests based on the application semantics, potential deadlocks could be prevented.

References

1. A. Arnold. Assessing the financial impact of downtime, April 2010. http://www.businesscomputingworld.co.uk/assessing-the-financial-impactof-downtime/.

2. M. Castro and B. Liskov. Practical byzantine fault tolerance. In Proceedings of the third symposium on Operating systems design and implementation, OSDI ’99, pages 173–186, Berkeley, CA, USA, 1999. USENIX Association.

3. Channel Insider. Unplanned it outages cost more than $5,000 per minute: Report. http://www.channelinsider.eom/c/a/Spotlight/Unplanned-ITOutages-Cost-More-than-5000-per-Minute-Report-105393/, May 2011.

4. J. Clark. The price of data center availability, October 2011. http://www.datacenterjournal.com/design/the-price-of-data-centeravailability/.

5. L. Lamport. Paxos made simple. ACM SIGACT News (Distributed Computing Column), 32(4):18–25, December 2001.

6. L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4:382–401, 1982.

7. T. Pisello and B. Quirk. How to quantify downtime, January 2004. http://www.networkworld.com/careers/2004/0105man.html.

Chapter 1 Introduction to Dependable Distributed Computing

Distributed systems bring many benefits to us, for example, we can share resources such as data storage and processing cycles much more easily; we can collaborative on projects efficiently even if the team members span across the planet; we can solve challenging problems by utilizing the vast aggregated computing power of large scale distributed systems. However, if not designed properly, distributed systems may appear to be less dependable than standalone systems. As Leslie Lamport pointed out: You know you have one (a distributed system) when the crash of a computer you’ve never heard of stops you from getting any work done [9]. In this book, we introduce various dependability techniques that can be used to address the issue brought up by Lamport. In fact, with sufficient redundancy in the system, a distributed system can be made significantly more dependable than a standalone system because such a distributed system can continue providing services to its users even when a subset of its nodes have failed.

In this chapter, we introduce the basic concepts and terminologies of dependable distributed computing, and outline the primary approaches to achieving dependability.

1.1 Basic Concepts and Terminologies

The term dependable systems has been used widely in many different contexts and often means different things. In the context of distributed computing, dependability refers to the ability of a distributed system to provide correct services to its users despite various threats to the system such as undetected software defects, hardware failures, and malicious attacks.

To reason about the dependability of a distributed system, we need to model the system itself as well as the threats to the system clearly [2]. We also define common attributes of dependable distributed systems and metrics on evaluating the dependability of a distributed system.

1.1.1 System Models

A system is designed to provide a set of services to its users (often referred to as clients). Each service has an interface that a client could use to request the service. What the system should do for each service is defined as a set of functions according to a functional specification for the system. The status of a system is determined by its state. The state of a practical system is usually very complicated. A system may consist of one or more processes spanning over one or more nodes, and each process might consist of one or more threads. The state of the system is determined collectively by the state of the processes and threads in the system. The state of a process typically consists of the values of its registers, stack, heap, file descriptors, and the kernel state. Part of the state might become visible to the users of the system via information contained in the responses to the users’ requests. Such state is referred to as external state and is normally an abstract state defined in the functional specification of the system. The remaining part of the state that is not visible to users is referred to as internal state. A system can be recovered to where it was before a failure if its state was captured and not lost due to the failure (for example, if the state is serialized and written to stable storage).

From the structure perspective, a system consists of a one or more components (such as nodes or processes), and a system always has a boundary that separates the system from its environment. Here environment refers to all other systems that the current system interact with. Note that what we refer to as a system is always relative with respect to the current context. A component in a (larger) system by itself is a system when we want to study its behavior and it may in turn have its own internal structures.

1.1.2 Threat Models

Whether or not a system is providing correct services is judged by whether or not the system is performing the functions defined in the functional specification for the system. When a system is not functioning according to its functional specification, we say a service failure (or simply failure) has occurred. The failure of a system is caused by part of its state in wrong values, i.e., errors in its state. We hypothesize that the errors are caused by some faults [6]. Therefore, the threats to the dependability of a system are modeled as various faults.

A fault might not always exhibit itself and cause error. In particular, a software defect (often referred to as software bug) might not be revealed until the code that contains the defect is exercised when certain condition is met. For example, if a shared variable is not protected by a lock in a multithreaded application, the fault (often referred to as race condition) does not exhibit itself unless there are two or more threads trying to update the shared variable concurrently. As another example, if there is no boundary check on accessing to an array, the fault does not show up until a process tries to access the array with an out-of-bound index. When a fault does not exhibit itself, we say the fault is dormant. When certain condition is met, the fault will be activated.

When a fault is activated, initially the fault would cause an error in the component that encompasses the defected area (in programming code). When the component interacts with other components of the system, the error would propagates to other components. When the errors propagate to the interface of the system and render the service provided to a client deviate from the specification, a service failure would occur. Due to the recursive nature of common system composition, the failure of one system may cause a fault in a larger system when the former constitutes a component of the latter, as shown in Figure 1.1. Such relationship between fault, error, and failure is referred to as chain of threats in [2]. Hence, in literature the terms faults and failures are often used interchangeably.

Figure 1.1 An example of a chain of threats with two levels of recursion.

Of course, not all failures can be analyzed with the above chain of threats. For example, a power outage of the entire system would immediately cause the failure of the system.

Faults can be classified based on different criteria, the most common classifications include:

Based on the source of the faults, faults can be classified as:

– Hardware faults, if the faults are caused by the failure of hardware components such as power outages, hard drive failures, bad memory chips, etc.

– Software faults, if the faults are caused by software bugs such as race conditions and no-boundary-checks for arrays.

– Operator faults, if the faults are caused by the operator of the system, for example, misconfiguration, wrong upgrade procedures, etc.

Based on the intent of the faults, faults can be classified as:

– Non-malicious faults, if the faults are not caused by a person with malicious intent. For example, the naturally occurred hardware faults and some remnant software bugs such as race conditions are non-malicious faults.

– Malicious faults, if the faults are caused by a person with intent to harm the system, for example, to deny services to legitimate clients or to compromise the integrity of the service. Malicious faults are often referred to as commission faults, or Byzantine faults [5].

Based on the duration of the faults, faults can be classified as:

– Transient faults, if such a fault is activated momentarily and becomes dormant again. For example, the race condition might often show up as transient fault because if the threads stop accessing the shared variable concurrently, the fault appears to have disappeared.

– Permanent faults, if once a fault is activated, the fault stays activated unless the faulty component is repaired or the source of the fault is addressed. For example, a power outage is considered a permanent fault because unless the power is restored, a computer system will remain powered off. A specific permanent fault is the (process) crash fault. A segmentation fault could result in the crash of a process.

Based on how a fault in a component reveals to other components in the system, faults can be classified as:

– Content faults, if the values passed on to other components are wrong due to the faults. A faulty component may always pass on the same wrong values to other components, or it may return different values to different components that it interacts with. The latter is specifically modeled as Byzantine faults [5].

– Timing faults, if the faulty component either returns a reply too early, or too late alter receiving a request from another component. An extreme case is when the faulty component stops responding at all (i.e., it takes infinite amount of time to return a reply), e.g., when the component crashes, or hangs due to an infinite loop or a deadlock.

Based on whether or not a fault is reproducible or deterministic, faults (primarily software faults) can be classified as:

– Reproducible/deterministic faults. The fault happens deterministically and can be easily reproduced. Accessing a null pointer is an example of deterministic fault, which often would lead to the crash of the system. This type of faults can be easily identified and repaired.

– Nondeterministic faults. The fault appears to happen nondeterministically and hard to reproduce. For example, if a fault is caused by a specific interleaving of several threads when they access some shared variable, it is going to be hard to reproduce such a fault. This type of software faults is also referred to as Heisenbugs to highlight their uncertainty.

Given a number of faults within a system, we can classify them based on their relationship:

– Independent faults, if there is no causal relationship between the faults, e.g., given fault A and fault B, B is not caused by A, and A is not caused by B.

– Correlated faults, if the faults are causally related, e.g., given fault A and fault B, either B is caused by A, or A is caused by B. If multiple components fail due to a common reason, the failures are referred to as common mode failures.

When the system fails, it is desirable to avoid catastrophic consequences, such as the loss of life. The consequence of the failure of a system can be alleviated by incorporating dependability mechanisms into the system such that when it fails, it stops responding to requests (such systems are referred to as fail-stop systems), if this is impossible, it returns consistent wrong values instead of inconsistent values to all components that it may interact with. If the failure of a system does not cause great harm either to human life or to the environment, we call such as system a fail-safe system. Usually, a fail-safe system defines a set of safe states. When a fail-safe system can no longer operate according to its specification due to faults, it can transit to one of the predefined safe states when it fails. For example, the computer system that is used to control a nuclear power plant must be a fail-safe system.

Perhaps counter intuitively, it is often desirable for a system to halt its operation immediately when it is in an error state or encounters an unexpected condition. The software engineering practice to ensure such a behavior is called fail fast [8]. The benefits of the fail-fast practice are that it enables early detection of software faults and the diagnosis of faults. When a fault has been propagated to many other components, it is a lot harder to pinpoint the source of the problem.

1.1.3 Dependability Attributes and Evaluation Metrics

A dependable system has a number of desirable attributes and some of the attributes can be used as evaluation metrics for the system. We classify these attributes into two categories: (1) those that are fundamental to, and are immediate concern of, all distributed systems, including availability, reliability, and integrity; and (2) those that are secondary and may not be of immediate concern of, or be applicable to all systems, such as maintainability and safety.

The availability and reliability of a system can be used as evaluation metrics. Other attributes are normally not used as evaluation metrics because it is difficult to quantify the integrity, maintainability, and safety of a distributed system.

1.1.3.1 Availability

Availability is a measure of the readiness of a dependable system at a point in time, i.e., when a client needs to use a service provided by the system, the probability that the system is there to provide the service to the client. The availability of a system is determined by two factors:

Mean time to failure (MTTF). It characterizes how long the system can run without a failure.

Mean time to repair (MTTR). It characterizes how long the system can be repaired and recovered to be fully functional again.

Availability is defined to be MTTF/(MTTF + MTTR). Hence, the larger the MTTF, and higher the availability of a system. Similarly, the smaller the MTTR, the higher the availability of the system.

The availability of a system is typically represented in terms of how many 9s. For example, if a system is claimed to offer five 9s availability, it means that the system will be available with a probability of 99.999%, i.e., the system has 10−5 probability to be not available when a client wants to access the service offered by the system at any time, which means that the system may have at most 5.256 minutes of down time a year.

1.1.3.2 Reliability

Reliability is a measure of the system’s capability of providing correct services continuously for a period of time. It is often represented as the probability for the system to do so for a given period of time t, i.e., Reliability = R(t). The larger the t, the lower the reliability value. The reliability of a system is proportional to MTTF. The relationship between reliability and availability can be represented as Availability = ∫∞0 R(t). Reliability is very different from availability. If a system fails frequently but can recover very quickly, the system may have high availability. However, such a system would have very low reliability.

1.1.3.3 Integrity

Integrity refers to the capability of a system to protect its state from being compromised under various threats. In dependable computing research, integrity is typically translated into the consistency of server replicas, if redundancy is employed. As long as the number of faulty replicas does not exceed a pre-defined threshold, the consistency of the remaining replicas would naturally imply the integrity of the system.

1.1.3.4 Maintainability

Maintainability refers to the capability of a system to evolve after it is deployed. Once a software fault is detected, it is desirable to be able to apply a patch that repairs the system without having to uninstall the existing system and then reinstall an updated system. The same patching/software update mechanism may be used to add new features or improve the performance of the existing system. Ideally, we want to be able to perform the software update without having to shutdown the running system (often referred to as live upgrade or live update), which is already a standard feature for many operating systems for patching non-kernal level components. Live upgrade has also be achieved via replication in some distributed systems [10].

1.1.3.5 Safety

Safety means that when a system fails, it does not cause catastrophic consequences, i.e., the system must be fail-safe. Systems that are used to control operations that may cause catastrophic consequences, such as nuclear power plants, or endanger human lives, such as hospital operation rooms, must bear the safety attribute. The safety attribute is not important for systems that are not operating in such environments, such as for e-commerce.

1.2 Means to Achieve Dependability

There are two primary approaches to improving the dependability of distributed systems: (1) fault avoidance: build and use high quality software components and hardware that are less prone to failures; (2) fault detection and diagnosis: while crash faults are trivial to detect, components in a practical system might fail in various ways other than crash, and if not detected, the integrity of the system cannot be guaranteed; and (3) fault tolerance: a system is able to recover from various faults without service interruption if the system employs sufficient redundancy so that the system can mask the failures of a portion of its components, or with minimum service interruption if the system uses less costly dependability means such as logging and checkpointing.

1.2.1 Fault Avoidance

For software components, fault avoidance aims to ensure correct design specification and correct implementation before a distributed system is released. This objective can be achieved by employing standard software engineering practices, for example:

More rigorous software design using techniques such as formal methods. Formal methods mandate the use of formal language to facilitate the validation of a specification.

More rigorous software testing to identify and remove software bugs due to remnant design deficiency and introduced during implementation.

For some applications, it may be impractical to employ formal methods, in which case, it is wise to design for testability [2], for example, by extensively use unit testing that is available in many modern programming languages such as Java and C#.

1.2.2 Fault Detection and Diagnosis

Fault detection is a crucial step in ensuring the dependability of a system. Crash faults are relatively trivial to detect, for example, we can periodically probe each component to check on its health. If no response is received after several consecutive probes, the component may be declared as having crashed. However, components in a system might fail in various ways and they might respond promptly to each probe after they have failed. It is nontrivial to detect such faults, especially in a large distributed system. Diagnosis is required to determine that a fault indeed has occurred and to localize the source of the fault (i.e., pinpoint the faulty component). To accomplish this, the distributed system is modeled, and sophisticated statistical tools are often used [3]. Some of the approaches in fault detection and diagnosis are introduced in Chapter 3.

A lot of progress has been made in modern programming language design to include some forms of software fault detection and handling, such as unexpected input or state. The most notable example is exception handling. A block of code can be enclosed with a try-catch construct. If an error condition occurs during the execution of the code, the catch block will be executed automatically. Exceptions may also be propagated upward through the calling chain. If an exception occurs and it is not handled by any developer-supplied code, the language runtime usually terminates the process.

The recovery block method, which is designed for software fault tolerance [7], may be considered as an extension of the programming language exception handling mechanism. An important step in recovery blocks is the acceptance testing, which is a form of fault detection. A developer is supposed to supply an acceptance test for each module of the system. When the acceptance test fails, a software fault is detected. Subsequently, an alternate block of code is executed, after which the acceptance test is evaluated again. Multiple alternate blocks of code may be provided to increase the robustness of the system.

1.2.3 Fault Removal

Once a fault is detected and localized, it should be isolated and removed from the system. Subsequently, the faulty component is either repaired or replaced. A repaired or replaced component can be readmitted to the system. To accommodate these steps, the system often needs to be reconfigured. In a distributed system, it is often necessary to have a notion of membership, i.e., each component is aware of a list of components that are considered part of the system and their roles. When a faulty component is removed from the system, a reconfiguration is carried out and a new membership is formed with the faulty component excluded. When the component is repaired or replaced, and readmitted to the system, it becomes part of the membership again.

A special case of fault removal is software patching and updates. Software faults and vulnerabilities may be removed via a software update when the original system is patched. Virtually all modern operating systems and software packages include the software update capability.

1.2.4 Fault Tolerance

Robust software itself is normally insufficient to delivery high dependability because of the possibility of hardware failures. Unless a distributed system is strictly stateless, simply restarting the system after a failure would not automatically restore its state to what it had before the failure. Hence, fault tolerance techniques are essential to improve the dependability of distributed systems to the next level.

There are different fault tolerance techniques that can be used to cater to different levels of dependability requirements. For applications that need high availability, but not necessarily high reliability, logging and checkpointing (which is the topic of Chapter 2), which incurs minimum runtime overhead and uses minimum extra resources, might be sufficient. More demanding applications could adopt the recovery oriented computing techniques (which is the topic of Chapter 3). Both types of fault tolerance techniques rely on rollback recovery. After restarting a failed system, the most recent correct state (referred to as a checkpoint) of the system is located in the log and the system is restored to this correct state.

An example scenario of rollback recovery is illustrated in Figure 1.2. When a system fails, it takes some time to detect the failure. Subsequently, the system is restarted and the most recent checkpoint in the log is used to recover the system back to that checkpoint. If there are logged requests, these requests are re-executed by the system, after which the recovery is completed. The system then resumes handling new requests.

Figure 1.2 The rollback recovery is enabled by periodically taking checkpoints and usually logging of the requests received.

For a distributed system that requires high reliability, i.e., continuous correct services, redundant instances of the system must be used so that the system can continue operating correctly even if a portion of redundant copies (referred to as replicas) fail. Using redundant instances (referred to as replicas) also makes it possible to tolerate malicious faults provided that the replicas fail independently. When the failed replica is repaired, it can be incorporated back into the system by rolling its state forward to the current state of other replicas. This recovery strategy is called rollforward recovery.

An example scenario of rollforward recovery is shown in Figure 1.3. When the failure of the replica is detected and the replica is restarted (possibly after being repaired). To readmit the restarted

Enjoying the preview?

Page 1 of 1

Building Dependable Distributed Systems

About this ebook

Wenbing Zhao

Related authors

Related to Building Dependable Distributed Systems

Related ebooks

Software Development & Engineering For You

Related podcast episodes

Related articles

Related categories

Reviews for Building Dependable Distributed Systems

What did you think?

Book preview

Building Dependable Distributed Systems - Wenbing Zhao

Preface

References

Chapter 1

Introduction to Dependable Distributed Computing

1.1 Basic Concepts and Terminologies

1.1.1 System Models

1.1.2 Threat Models

1.1.3 Dependability Attributes and Evaluation Metrics

1.1.3.1 Availability

1.1.3.2 Reliability

1.1.3.3 Integrity

1.1.3.4 Maintainability

1.1.3.5 Safety

1.2 Means to Achieve Dependability

1.2.1 Fault Avoidance

1.2.2 Fault Detection and Diagnosis

1.2.3 Fault Removal

1.2.4 Fault Tolerance