Designing Active-Active and Disaster Recovery Data Centers

Home » Webinars » Data Center Infrastructure » Designing Active-Active and Disaster Recovery Data Centers

This webinar covers typical design scenarios encountered when building a disaster recovery data center or deploying multiple data centers in an active-active configuration.

Last modified on 2024-03-12 (release notes)


Designing Active-Active and Disaster Recovery Data Centers

13:57 Introduction

In the fist section of this webinar we'll try to figure out why we'd want to migrate application workload between data centers, and define a few useful terms like RTO, RPO, MTTR and MTTI.

Introduction and Definitions 13:57 2017-03-29

49:45 Free items Typical Challenges

There are four typical reasons why you'd want to migrate application servers between data centers: migration, disaster recovery or avoidance, and workload load balancing.

Disaster Recovery 6:50 2017-03-29
Disaster Avoidance 29:46 2017-03-29
Data Center Migration 2:47 2017-03-29
Load Balancing Across Data Centers and Cloudbursting 10:22 2017-03-29

28:17 Limitations and Considerations

A number of factors limit our ability to deploy servers across multiple data centers: latency, bandwidth limitations, and data gravity.

Latency 8:09 2017-03-29
Limited Bandwidth 10:55 2017-03-29
Storage considerations 9:13 2017-03-29

16:12 Typical Solutions

Well-designed active-active applications used "swimlanes" - a concept where multiple copies of an application stack reside in different locations.

Parallel Application Stacks (Swimlanes) 16:12 2017-03-29
Describing Fault Domains

A great introduction to fault domains, fault levels, cascading failures, and fault hierarchy.

31:34 Free items Long-Distance VM Mobility Challenges

Instead of redesigning applications to make them work across multiple data centers, enterprise environments typically try to solve the challenges within the infrastructure, sometimes even moving running servers between data centers. This section describes most obvious drawbacks of that idea.

Inter-DC vMotion Bandwidth 6:04 2017-05-03
Large Layer-2 Domains 9:27 2017-05-03
Ingress and Egress Traffic Flows 16:03 2017-05-03

42:37 Summary & Questions

Time for a wrap-up. We'll discuss the right way of doing things, surviving infrastructure failures, and typical real-life designs.

Surviving the Failures 15:37 2017-05-03
The Right Way of Doing Things 10:12 2017-05-03
Typical Real-Life Designs 9:05 2017-05-03
Summary and Questions 7:43 2017-05-03

1:27:00 Lessons Learned Operating Active-Active Data Centers

Networking and virtualization vendors keep proposing crazier and crazier ideas that are supposed to allow you to run active-active data centers without touching the application architecture. Not surprisingly, most of them fail disastrously under the right failure conditions.

If you want to have a highly-available application, there's simply no substitute for good design including global and local load balancing. In his presentation, Ethan Banks described the architecture he used when running multiple data centers for a large credit card payment processor, and lessons learned while operating them.

Definitions and Typical Setup 7:44 2016-10-09
Internet Edge, DNS, and BGP 16:08 2016-10-09
Firewalls 11:15 2016-10-09
Load Balancers 14:07 2016-10-09
Core Network 20:22 2016-10-09
High-Level Comments and Conclusions 17:24 2016-10-09

Slide Deck

Designing Active-Active and Disaster Recovery Data Centers 11M 2015-11-07

36:07 From the Design Clinic

Migrating Application Stacks into Public Clouds 16:36 2021-12-27
Running Applications in Multi-Cloud Environment 19:31 2022-05-30

Additional Resources

The blog posts, articles, and books collected in this section might help you get a broader perspective on high-availability application architectures.

Application Design and Operations

Scalability Rules: Principles for Scaling Web Sites (2nd Edition)

A must-read book for anyone interested in robust high-availability application design.

Systems Design for Advanced Beginners
Site Reliability Engineering: How Google Runs Production Systems
More Site Reliability Engineering (SRE) resources

Disaster Recovery in AWS

High availability concepts don't change just because you're deploying your workloads in a public cloud. If anything, public clouds require cleaner architectures as they don't support enterprise kludges like layer-2 DCI. It's therefore worth reading the series of articles describing disaster recovery solutions within AWS.

Architecture and Patterns
Backup and Restore
Pilot Light and Warm Standby
Multi-site Active/Active
Implementing Multi-Region Disaster Recovery Using Event-Driven Architecture

Disaster Recovery with AWS Services

AWS published several blog posts describing how you could use AWS services in a disaster recovery process. These documents are obviously self-serving, but you might find them valuable should you decide to deploy your workload on AWS, or you could use the same concepts when implementing disaster recovery in a different environment.

Disaster Recovery with AWS Managed Services (Single Region)
Multi-Region Backup and Restore

AWS Multi-Region Application Architecture with AWS Services

Part 1: Compute, Networking, and Security
Part 2: Data and Replication
Part 3: Application Management and Monitoring
Minimizing Dependencies in a Disaster Recovery Plan

Load Balancing and Service Discovery

Load balancing in Google network
Building a billion user load balancer (Facebook)
Ananta: Cloud Scale Load Balancing (Microsoft Azure)
GitHub Load Balancer
A quick intro to Consul
DNS-based Load Balancing with NSONE (podcast)

Redundancy and Resiliency

Redundant network designs usually use 1+1 redundancy. Applications (at least the database layer) are usually no better. However, 1+1 redundancy might not be good enough, and too much redundancy might decrease the overall availability.

1+1 Redundancy Just Isn’t Good Enough
Gray failures: the Achilles’ heel of cloud-scale systems
Why Shared Mutable State Is the Root of All Evil

Testing Resilient Application Stacks

Resilience Engineering: Learning to Embrace Failure
The Netflix Simian Army
Simian Army source code on GitHub
Testing in Production: Yes, You Can
AWS Fault Injection Simulator
Toxiproxy: a Framework for Simulating Network Conditions
You started this section on %started% Mark completed