Preparing for a Cloud Outage: AWS Serves as a Lesson to Future Ventures
In the early hours of September 20th, 2015, Amazon observed something going awry in one of its Web Services data centers. The error rates for the company’s NoSQL database, DynamoDB, began escalating in its US-East-1 region in North Virginia. The database that managed the metadata had gone haywire, affecting the service’s partitions and tables. After almost two full hours, the company rectified the issue, but it was too late. However, 34 services that were being monitored by AWS’ Service Health Dashboard had already been compromised. This incident, which was not a first, highlights the occurrence of the cloud outage in-spite of the precautions taken by AWS and demands the organizations to be more vigilant regarding their cloud services.
It is important to note that regardless of its size and credibility, every cloud service, at some point will face cloud outage and there are no known means to nullify them. Since they are inevitable, clients and providers alike should be ever prepared for this calamity, for even the most mature cloud offering on the market can still have a six-hour plus service disruption during an outage.
Although it is practically impossible to prevent potential outages, there are software tools available which can help mitigate them. These tools are classified into three categories namely, User Experience Management (UXM) tools, Synthetic Monitoring tools and Performance Analytics tools.
In situations where the outage is brought about by an auxiliary internal service, the said service should be automatically turned off and flagged. This process requires a UXM tool that enables application level response time for users currently engaged with their applications, and an outage analyzer which provides real-time visualizations and alerts for outage. These analyzers further help identify the problem and analyze whether it is geographically localized or spread across more general areas. In case it is geographic, the affected region can be determined and the services specific to that region can be turned off, and the issue will be flagged and the responsible party gets informed. Subsequently, the issue can be solved by tapping the Application Performance Management (APM) data. As for the Synthetic Monitoring tool and the Performance Analytics tool, the former is used to cover global points of presence and inform application managers if no users are actively using the impacted service. Whereas, the latter employs big data analytics to determine if a particular cloud service is down.
Another option for cloud service users is the spreading of their applications across different cloud service providers to keep them intact and less vulnerable to these outages. Netflix, for example, was one of the lesser affected clients of AWS during the chaotic outage. They foresaw the possible occurrence of a disruption and had a contingence plan in line. As soon as the outage struck, they automatically migrated their workloads from the calamitous US-East region to another healthy region. This was an eye opener for companies who use cloud services for mission critical applications which popularized the practice of architecting systems with a vision-on-a-clutch situation where running services could fail any time.
Even though the distribution of the applications across cloud platforms seems to be an appealing prospect, it is not very common yet because of the large investment involved, especially in R&D, operations, and money and legal contracts. Not surprisingly though, there are a number of cloud brokers or cloud managers who provide various cloud storage options for cloud clients. The best case scenario for the clients would be to hire the services of such brokers or use a cloud management tool to abstract from the difference of cloud providers.
Apart from the methods discussed above, there are a few other available options to the users - opting for a redundant server or building redundancy into applications, and handling the architectural features. Architecturally, the features can be split into main and auxiliary features. The auxiliary feature can be linked with the application flow through a proxy and the proxy can be designed in a manner in which it would be able to autonomously decide whether or not to call the service. The proxy also provides users the option to manually turn the service on or off. The proxy can further be standardized across many services, making it easier for management and APM solutions to automatically turn features on and off.
Analyzing all the available options, one fact which is really obvious is that in all cases of protection against cloud outages, APM is pivotal to support global and cross cloud platform applications thereby increasing the communication between components in different clouds.