Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

SQL Server 2019 AlwaysOn: Supporting 24x7 Applications with Continuous Uptime
SQL Server 2019 AlwaysOn: Supporting 24x7 Applications with Continuous Uptime
SQL Server 2019 AlwaysOn: Supporting 24x7 Applications with Continuous Uptime
Ebook379 pages3 hours

SQL Server 2019 AlwaysOn: Supporting 24x7 Applications with Continuous Uptime

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Get a fast start to using AlwaysOn, the SQL Server solution to high-availability and disaster recovery. This third edition is newly-updated to cover the 2019 editions of both SQL Server and Windows Server and includes strong coverage of implementing AlwaysOn Availability Groups on both Windows and Linux operating systems. The book provides a solid and accurate understanding of how to implement systems requiring consistent and continuous uptime, as well as how to troubleshoot those systems in order to keep them running and reliable. This edition is updated to account for all new major functionality and also includes coverage of implementing atypical configurations, such as clusterless and domain-independent Availability Groups, distributed Availability Groups, and implementing Availability Groups on Azure.

The book begins with an introduction to high-availability and disaster recovery concepts such as Recovery Point Objectives (RPOs), Recovery Time Objectives (RTOs), availability levels, and the cost of downtime. You’ll then move into detailed coverage of implementing and configuring the AlwaysOn feature set in order to meet the business objectives set by your organization. Content includes coverage on implementing clusters, building AlwaysOn failover clustered instances, and configuring AlwaysOn Availability Groups.

SQL Server 2019 AlwaysOn is chock full of real-world advice on how to build and configure the most appropriate topology to meet the high-availability and disaster recovery requirements you are faced with, as well as how to use AlwaysOn Availability Groups to scale-out read-only workloads. This is a practical and hands-on book to get you started quickly in using one of the most talked-about SQL Server feature sets.

What You Will Learn

  • Understand high availability and disaster recovery in SQL Server 2019
  • Build and configure a Windows Cluster in Windows Server 2019
  • Create and configure an AlwaysOn failover clustered instance
  • Implement AlwaysOn Availability Groups and appropriately configure them
  • Implement AlwaysOn Availability Groups on Linux servers
  • Configure Availability Groups on Azure IaaS
  • Administer AlwaysOn technologies post implementation
  • Understand typical configurations, such as clusterless and distributed Availability Groups

Who This Book Is For

For Microsoft SQL Server database administrators who interested in growing their knowledge and skills in SQL Server’s high-availability and disaster recovery feature set.

LanguageEnglish
PublisherApress
Release dateNov 11, 2020
ISBN9781484264799
SQL Server 2019 AlwaysOn: Supporting 24x7 Applications with Continuous Uptime

Read more from Peter A. Carter

Related to SQL Server 2019 AlwaysOn

Related ebooks

Databases For You

View More

Related articles

Reviews for SQL Server 2019 AlwaysOn

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    SQL Server 2019 AlwaysOn - Peter A. Carter

    © Peter A. Carter 2020

    P. A. CarterSQL Server 2019 AlwaysOnhttps://doi.org/10.1007/978-1-4842-6479-9_1

    1. High Availability and Disaster Recovery Concepts

    Peter A. Carter¹ 

    (1)

    Botley, UK

    In today’s 24×7 environments that are running mission-critical applications, businesses rely heavily on the availability of their data. Although servers and their software are generally reliable, there is always the risk of a hardware failure or a software bug, each of which could bring a server down. To mitigate these risks, business-critical applications often rely on redundant hardware to provide fault tolerance. If the primary system fails, then the application can automatically fail over to the redundant system. This is the underlying principle of high availability (HA).

    Even with the implementation of HA technologies, there is always a small risk of an event that causes the application to become unavailable. This could be due to a major incident, such as the loss of a data center, due to a natural disaster, or due to an act of terrorism. It could also be caused by data corruption or human error, resulting in the application’s data becoming lost or damaged beyond repair.

    In these situations, some applications may rely on restoring the latest backup to recover as much data as possible. However, more critical applications may require a redundant server to hold a synchronized copy of the data in a secondary location. This is the underpinning concept of disaster recovery (DR). This chapter discusses the concepts behind HA and DR.

    Level of Availability

    The amount of time that a solution is available to end users is known as the level of availability, or uptime. To provide a true picture of uptime, a company should measure the availability of a solution from a user’s desktop. In other words, even if your SQL Server has been running uninterrupted for over a month, users may still experience outages to their solution caused by other factors. These factors can include network outages or an application server failure.

    In some instances, however, you have no choice but to measure the level of availability at the SQL Server level. This may be because you lack holistic monitoring tools within the enterprise. Most often, however, the requirement to measure the level of availability at the instance level is political, as opposed to technical. In the IT industry, it has become a trend to outsource the management of data centers to third-party providers. In such cases, the provider responsible for managing the SQL servers may not necessarily be the provider responsible for the network or application servers. In this scenario, you need to monitor uptime at the SQL Server level to accurately judge the performance of the service provider.

    The level of availability is measured as a percentage of the time that the application or server is available. Companies often strive to achieve 99%, 99.9%, 99.99%, or 99.999% availability. As a result, the level of availability is often referred to in 9s. For example, five 9s of availability means 99.999% uptime and three 9s means 99.9% uptime.

    Table 1-1 details the amount of acceptable downtime per week, per month, and per year for each level of availability.

    Table 1-1

    Levels of Availability

    All values are rounded down to the nearest second.

    To calculate other levels of availability, you can use the script in Listing 1-1. Before running this script, replace the value of @Uptime to represent the level of uptime that you wish to calculate. You should also replace the value of @UptimeInterval to reflect uptime per week, month, or year.

    DECLARE @Uptime    DECIMAL(5,3) ;

    --Specify the uptime level to calculate

    SET @Uptime = 99.9 ;

    DECLARE @UptimeInterval VARCHAR(5) ;

    --Specify WEEK, MONTH, or YEAR

    SET @UptimeInterval = 'YEAR' ;

    DECLARE @SecondsPerInterval FLOAT ;

    --Calculate seconds per interval

    SET @SecondsPerInterval =

    (

    SELECT CASE

            WHEN @UptimeInterval = 'YEAR'

                    THEN 60*60*24*365.243

            WHEN @UptimeInterval = 'MONTH'

                    THEN 60*60*24*30.437

            WHEN @UptimeInterval = 'WEEK'

                    THEN 60*60*24*7

            END

    ) ;

    DECLARE @UptimeSeconds DECIMAL(12,4) ;

    --Calculate uptime

    SET @UptimeSeconds = @SecondsPerInterval * (100-@Uptime) / 100 ;

    --Format results

    SELECT

        CONVERT(VARCHAR(12), FLOOR(@UptimeSeconds /60/60/24))   + ' Day(s), '

      + CONVERT(VARCHAR(12), FLOOR(@UptimeSeconds /60/60 % 24)) + ' Hour(s), '

      + CONVERT(VARCHAR(12),  FLOOR(@UptimeSeconds /60 % 60))    + ' Minute(s), '

      + CONVERT(VARCHAR(12),  FLOOR(@UptimeSeconds % 60))        + ' Second(s).' ;

    Listing 1-1

    Calculating the Level of Availability

    Actual Availability

    Now that we understand how to calculate the level of availability required by an application, we should also understand how to calculate the actual availability of an application. We can do this by discovering the MTBF (Mean Time Between Failures) and MTTR (Mean Time to Recover) metrics.

    The MTBF metric describes the average length of time between failures. For example, imagine that we were reviewing the service logs for the past week and discovered that the Foo application has suffered three outages. There are 168 hours in a week and there have been three failures. We simply need to divide the number of hours within our time period by the number of failures. This will produce our MTBF. In this case, we have an MTBF of 56 hours.

    MTTR can actually have two different meanings: Mean Time to Recover or Mean Time to Repair. When there is no HA or DR in place, then the MTTR metric describes the average length of time it takes for something that is broken to be repaired. This could potentially be an extended period of time, as we may need to wait for a service engineer to replace faulty hardware. When thinking about HA and DR, however, we use MTTR metric to mean Mean Time to Recover to record the duration of the outage. For example, if we have a three-node cluster and one of the nodes experiences a hardware failure, of course we need to fix this server, but from an application downtime perspective, we will only have an outage of a matter of seconds or minutes, while the service fails over and we still have resilience, meaning that the application’s nonfunctional requirements are not impacted. Therefore, in this example, I will assume that MTTR is referring to Mean Time to Recover.

    We can calculate the MTTR by taking a sum of the total downtime duration within our period and dividing it by the number of failures. In this case, during our 168-hour period, we have had three failures and the total duration of downtime has been 12 minutes. Therefore, the MTTR for our Foo application is 4 minutes.

    Now that we know our application’s MTBF and MTTR, we can use these metrics to calculate the actual availability of our application. The formula for this is (MTBF/(MTBF+MTTR))*100. So in this case, we first need to convert our MTTR value to hours, so we have hours as our consistent unit. Four minutes is 0.06667 hours. Therefore, our calculation would be (56/(56+0.6667))*100. This makes our actual application availability 99.8811% .

    Service-Level Agreements and Service-Level Objectives

    When a third-party provider is responsible for managing servers, the contract usually includes service-level agreements (SLAs). These SLAs define many parameters, including how much downtime is acceptable, the maximum length of time a server can be down in the event of failure, and how much data loss is acceptable if failure occurs. Normally, there are financial penalties for the provider if these SLAs are not met.

    In the event that servers are managed in-house, DBAs still have the concept of customers. These are usually the end users of the application, with the primary contact being the business owner. An application’s business owner is the stakeholder within the business who commissioned the application and who is responsible for signing off on funding enhancements, among other things.

    In an in-house scenario, it is still possible to define SLAs, and in such a case, the IT Infrastructure or Platform departments may be liable for charge-back to the business teams if these SLAs are not being met. However, in internal scenarios, it is much more common for IT departments to negotiate service-level objectives (SLOs) with the business teams, as opposed to SLAs. SLOs are very similar in nature to SLAs, but their use implies that the business do not impose financial penalties on the IT department in the event that they are not met.

    Proactive Maintenance

    It is important to remember that downtime is not only caused by failure but also by proactive maintenance . For example, if you need to patch the operating system, or SQL Server itself, with the latest service pack, then you must have some downtime during installation.

    Depending on the upgrade you are applying, the downtime in such a scenario could be substantial – several hours for a stand-alone server. In this situation, high availability is essential for many business-critical applications – not to protect against unplanned downtime, but to avoid prolonged outages during planned maintenance.

    Recovery Point Objective and Recovery Time Objective

    The recovery point objective (RPO) of an application indicates how much data loss is acceptable in the event of a failure. For a data warehouse that supports a reporting application, for example, this may be an extended period, such as 24 hours, given that it may only be updated once per day by an ETL process and all other activity is read-only reporting. For highly transactional systems, however, such as an OLTP database supporting trading platforms or web applications, the RPO will be zero. An RPO of zero means that no data loss is acceptable.

    Applications may have different RPOs for high availability and for disaster recovery. For example, for reasons of cost or application performance, an RPO of zero may be required for a failover within the site. If the same application fails over to a DR data center, however, five or ten minutes of data loss may be acceptable. This is because of technology differences used to implement intra-site availability and inter-site recovery.

    The recovery time objective (RTO) for an application specifies the maximum amount of time an application can be down before recovery is complete and users can reconnect. When calculating the achievable RTO for an application, you need to consider many aspects. For example, it may take less than a minute for a cluster to fail over from one node to another and for the SQL Server service to come back up; however, it may take far longer for the databases to recover. The time it takes for databases to recover depends on many factors, including the size of the databases, the quantity of databases within an instance, and how many transactions were in-flight when the failover occurred. This is because all noncommitted transactions need to be rolled back.

    Just like RPO, it is common for there to be different RTOs depending on whether you have an intra-site or inter-site failover. Again, this is primarily due to differences in technologies, but it also factors in the amount of time you need to bring up the entire estate in the DR data center if the primary data center is lost.

    The RPO and RTO of an application may also vary in the event of data corruption. Depending on the nature of the corruption and the HA/DR technologies that have been implemented, data corruption may result in you needing to restore a database from a backup.

    If you must restore a database, the worst-case scenario is that the achievable point of recovery may be the time of the last backup. This means that you must factor a hard business requirement for a specific RPO into your backup strategy. If only part of the database is corrupt, however, you may be able to salvage some data from the live database and restore only the corrupt data from the restored database.

    Data corruption is also likely to have an impact on the RTO. One of the biggest influencing factors is if backups are stored locally on the server, or if you need to retrieve them from tape. Retrieving backup files from tape, or even from off-site locations, is likely to add significant time to the recovery process.

    Note

    Backups directly to tape from SQL Server are deprecated. When this section refers to retrieving backups from tape, it is assuming a tape drive as the target for an enterprise backup solution that your database backups have been offloaded too.

    Another influencing factor is what caused the corruption. If it is caused by a faulty IO subsystem, then you may need to factor in time for the Windows administrators to run the check disk command (CHKDSK) against the volume and potentially more time for disks to be replaced. If the corruption is caused by a user accidently truncating a table or deleting a data file, however, then this is not of concern.

    Cost of Downtime

    If you ask any business owners how much downtime is acceptable for their applications and how much data loss is acceptable, the answers invariably come back as zero and zero, respectively. Of course, it is never possible to guarantee zero downtime, and once you begin to explain the costs associated with the different levels of availability, it starts to get easier to negotiate a mutually acceptable level of service.

    The key factor in deciding how many 9s you should try to achieve is the cost of downtime. Two categories of cost are associated with downtime: tangible costs and intangible costs. Tangible costs are usually fairly straightforward to calculate. Let’s use a sales application as an example. In this case, the most obvious tangible cost is lost revenue because the sales staff cannot take orders. Intangible costs are more difficult to quantify but can be far more expensive. For example, if a customer is unable to place an order with your company, they may place their order with a rival company and never return. Other intangible costs can include loss of staff morale, which leads to higher staff turnover, or even loss of company reputation. Because intangible costs, by their very nature, can only be estimated, the industry rule of thumb is to multiply the tangible costs by three and use this figure to represent your intangible costs.

    Once you have an hourly figure for the total cost of downtime for your application, you can scale this figure out, across the predicted life cycle of your application, and compare the costs of implementing different availability levels. For example, imagine that you calculate that your total cost of downtime is $2,000/hour and the predicted life cycle of your application is three years. Table 1-2 illustrates the cost of downtime for your application, comparing the costs that you have calculated for implementing each level of availability, after you have factored in hardware, licenses, power, cabling, additional storage, and additional supporting equipment, such as new racks, administrative costs, and so on. This is known as the total cost of ownership (TCO) of a solution.

    Table 1-2

    Cost

    Enjoying the preview?
    Page 1 of 1