How Custom Ink implements Chaos Engineering and how you can do it too!
The WebOps team at Custom Ink works to maintain the highest availability possible for our internal and external customers. Being always available is a goal that many companies strive for, and at Custom Ink, we take steps to make sure that we are as available as we can be to stay ahead in this competitive marketplace.
One of the ways we ensure we are highly available is by practicing our version of infrastructure Chaos at Custom Ink. Many of you may be familiar with Netflix's Chaos Monkey. It's impressive and is probably the leader within our industry for benchmarking these sorts of tests. But, we aren't Netflix, and there's a good chance you aren't either.
(If you are, hi...I love the "San Junipero" episode from Black Mirror.)
We dedicate a few days of scheduled chaos every year to help identify problems that could arise if an actual issue with our AWS infrastructure happens.
One day is for strictly "frontend" applications, the other is for "backend" applications, and we notify everyone at the company that testing is happening so there is a minimum of panic when alarms inevitably go off. We shut off production servers and portions of SaaS services, such as Elasticache cluster nodes, one availability zone at a time, for at least an hour. During our testing, we look for single points of failure, poorly balanced applications, and other miscellaneous items that could cause real production issues down the line.
Our development teams are aware of this process and hold off on deployments for a few hours to ensure we can have thorough testing. Make sure you document EVERYTHING...yes, everything! Keep the notes in a shared folder where your operations team members can collaborate and add things for the next year of Chaos. Hold a post-mortem once you’re done and create action items to empower and encourage ownership of your applications with each member of the team.
Testing chaos this way isn't new and shiny, but it is totally doable. If your company has 5 or 500 servers, you can do this! Don't assume because you aren't at Netflix's scale that chaos testing is out of your reach. You can empower your engineers to make sure your site is always on. After all, that's the goal, right?
If you're familiar with more automated testing in a production capacity, please leave a comment, I'd love to hear from you!
This is currently a manual process, but we still do it and every year we find issues that we can quickly remediate or acknowledge aren't fixable (cost considerations, licensing issues, etc.) and accept the risk if the host goes down. This helps down the line so you know what applications are more fragile, how you can make existing applications better, and explore more technologies that will help minimize fallout, and improve your redundancy. For example, you may leverage auto-scaling groups, replace applications with lambda functions, and pursue other technologies that your cloud provider makes available to improve resiliency.
The next step at Custom Ink would be to automate our chaos. We are currently looking into Chaos Monkey, and plan on making greater use of autoscaling with our applications.
Katherine Cisneros is an Operations Engineer at Custom Ink and has been an Inker since March 2015.
Interested in breaking things? We’re hiring! Visit us at customink.com/jobs