Four Steps to Resilient Self-Healing Cloud Applications with AWS
Downtime and application outages cost companies money, reputation and customers. Today's cloud technologies such as Amazon Web Services (AWS) give you the power to truly build and operate resilient, self-healing cloud applications.
- by Ranjani Krishnamurthy
- CDW Expert |
Building applications in AWS gives you the inherent benefits of cloud computing such as reliability, scalability, cost optimization and security powered by the technologies underpinning the AWS Well-Architected Framework. This helps you avoid unplanned downtime and the associated costs.
Cost of One Hour of Downtime
% of Companies Reporting
33%
80%
Cost of 1 Hour of Downtime
$1M - $5M+
$300,000+
However, the key to resilient applications is not only architecting applications that are scalable and reliable, but also ensuring that your application performance is continuously monitored and enhanced for resiliency throughout the application lifecycle. In the context of cloud-based systems and applications, it is necessary to focus on effective and efficient self-healing approaches that integrate fault and failure prediction, localization and resolution.
Four Steps to Building Self-Healing Applications
Focus on building a scalable architecture for your applications
• Scalable resources: Autoscaling EC2, RDS, ECS, EBS
• Scalable reach: Load balancing – ELB, DNS - Route 53
• Scalable code: Lambda
Add intelligence to fault detection and application monitoring
• Automatic application performance monitoring (APM): collection policies to alert on KPIs on all resources
• Incorporate machine learning: anomaly detection, outlier detection, forecasting to derive meaningful metrics
• Visualization: create and continuously enhance visibility and actions
Automate issue remediation and resolution
• Create multiple action policies, events and automated escalation policies based on failures.
• Extend and automate complex remediation workflows per application/environment.
Increase reliability with chaos experiments
• Perform experiments in controlled environments to inject and prevent failures.
• Introduce issues such as application latency, improper retries and error handling, fallback.
• Learn from failures and incorporate learnings into automated detection and remediation policies.
Focus on Building a Scalable Application Architecture
AWS services allow for complete on-demand resource pooling and elasticity at all tiers of your application. Application design should strive for a stateless application, avoiding single points of failure. While designing for scalability, there are more than few core tenets to keep in mind:
Build with scalable resources (e.g., ensuring autoscaling for Amazon EC2 instances). Also consider using fully managed AWS services such as Amazon Relational Database Service, Amazon DynamoDB and Amazon Elastic Container Service to further enhance reliability.
Ensure that your application can be reached at any time from anywhere by leveraging services such as Amazon CloudFront, Elastic Load Balancing and Amazon Route 53 for DNS or Direct Connect.
Build your applications with an API-first approach, leveraging serverless frameworks with the use of AWS Lambda and Amazon API Gateway.
Automate provisioning, builds and testing cycles.
Add Intelligence to Fault Detection and Application Monitoring
Monitoring your application and infrastructure is key. Amazon CloudWatch enables you to report on the overall health of your application and resources and to create application monitoring policies defining the data that must be collected, synthesized and evaluated to generate smart events. Focus on:
Data Collection
Automatically collect data around common KPIs like CPU, memory, network latency and storage capacity.
Use a combination of synchronous and asynchronous collection methods to collect data.
Visualization
Increased visibility of a failure greatly increases the speed of resolution.
Add intelligence to your monitoring through machine learning methods, proactively creating alerts based on the health of your application and resources in addition to threshold-based alerts.
Machine learning methods enable anomaly detection, outlier detection and error forecasting, which can be leveraged to increase the current and future stability of your application. CDW’s AWS Managed Services incorporate these methods to proactively detect and correct faults, continuously tuning the collection policies.
Automate Issue Remediation and Resolution
In order to minimize the impact of failure and fault, continuous effort toward automation of remediation is vital. There are a few things to keep top of mind:
Use automated actions such as restarts, notifications, autoscaling, provisioning and escalation for remediation.
Automate as many complex actions or remediation workflows as you can with a continuously expanding set of triggers.
Configure a built-in escalation procedure to ensure that service levels are met.
Increase Reliability with Chaos Experiments
Chaos engineering methods allows you to use controlled experiments to inject failures into environments, helping you learn and prevent failures. Several tools, such as Chaos Toolkit and AWS Lambda Layers, allow you to create what-if experiments per application, per service type and against roles, security and governance policies. Experimentation may include introduction of issues such as:
- Application latency
- Artificial delays and API error handling failures
- Killing of instances, AZ and regions
Experimenting with injected failures in a limited radius in a dev/test environment allows you to create automation for remediation under circumstances that would not be possible otherwise without a drastic failure of your application.
Whether you are new to AWS or a well on your journey, CDW offers a full range of professional and managed AWS services to support you. Please reach out to your CDW account manager or call 800.800.4239 and say that you would like to discuss your AWS architecture. They will arrange a meeting with one of our AWS-certified solutions architects who help you define next steps for AWS initiatives.