Amazon Web Services (AWS) has published a post-mortem of its Easter cloud outage that provides the IT industry a unique opportunity to study what technologies the world's largest provider of cloud computing uses to provide resilient services.
The post-mortem was published to provide a detailed breakdown of what caused major outages on April 21 that took down many popular websites, including Reddit and Foursquare.
The announcement also provided a detailed description of how the AWS cloud is designed, what went wrong during the outage, and what the industry can learn to better prepare similar events in the future.
The trigger for the outage - referred on Amazon's service status page only as a "network event" - was a mistake made during a scheduled upgrade of capacity on the primary network for Amazon's Elastic Block Storage (EBS) service, which underpins AWS.
The mistake caused all traffic that would normally use the primary, high-capacity network to instead use a second, lower-capacity network designed for reliable communications and overflow capacity. The secondary network was quickly overwhelmed, which triggered a cascade of other issues, resulting in a "re-mirroring storm".
EBS consists of clusters of storage nodes, connected in a peer-to-peer fashion, with each node storing a replica of EBS data "volumes". These volumes are used for data read and write operations.
To Continue Reading: Click Here
--------------------------------------------
Source: itnews.com.au
By: Justin Warren
The post-mortem was published to provide a detailed breakdown of what caused major outages on April 21 that took down many popular websites, including Reddit and Foursquare.
The announcement also provided a detailed description of how the AWS cloud is designed, what went wrong during the outage, and what the industry can learn to better prepare similar events in the future.
The trigger for the outage - referred on Amazon's service status page only as a "network event" - was a mistake made during a scheduled upgrade of capacity on the primary network for Amazon's Elastic Block Storage (EBS) service, which underpins AWS.
The mistake caused all traffic that would normally use the primary, high-capacity network to instead use a second, lower-capacity network designed for reliable communications and overflow capacity. The secondary network was quickly overwhelmed, which triggered a cascade of other issues, resulting in a "re-mirroring storm".
EBS consists of clusters of storage nodes, connected in a peer-to-peer fashion, with each node storing a replica of EBS data "volumes". These volumes are used for data read and write operations.
To Continue Reading: Click Here
--------------------------------------------
Source: itnews.com.au
By: Justin Warren

0 comments:
Post a Comment