So I’ve done a few posts on how to increase the availability and resiliency for your application, and I’ve come to a conclusion over the past few months. This is a really big topic that seems to be top of mind for a lot of people. I’ve done previous posts like “Keeping the lights on!” and “RT? – Making Sense of Availability?” which talk about what it means to go down this path and how to architect your solutions for availability.
But another question that comes with that, is what types of things should I do to implement stronger availability and resiliency in my applications? How do I upgrade a legacy application for greater resiliency? What can I do to keep this in mind as a developer to make this easier?
So I wanted to compile a list of things you should look for in your application, that changing would increase your availability / resiliency, and / or things to avoid if you want to increase both of these factors for your applications.
So let’s start with the two terms I continue to use, because I find this informs the mindset to improving both and that only happens if we are all talking about the same thing.
- Availability – Is the ability of your application to continue operations of a critical functionality, even during a major outage or downtime situation. So the ability to continue to offer service with minimal user impact.
- Resiliency – Is the ability of your application to continue processing current work even in the event of a major or transient fault. So finishing doing work that is currently in-progress.
So looking at this further, the question becomes what kinds of things should I avoid, or remove from my applications to improve my position moving forward:
Item #1 – Stateful Services
Generally speaking this is a key element in removing issues with availability and resiliency, and can be a hotly debated issue, but here’s where I come down on this. If a service has a state (either in memory or otherwise) it means that for me to fail over to somewhere else becomes significantly more difficult. I know must replicate that state, and if that’s done in memory, that becomes a LOT harder. If its a separate store, like SQL or Redis, it becomes easier, but at the same time requires additional complexity which can make that form of availability harder. This is especially true as you add “9”‘s to your SLA. So generally speaking if you can avoid having application components that rely on state its for the best.
Additionally, stateful services also cause other issues in the cloud, including limiting the ability to scale out as demand increases. The perfect example of this is “sticky session” which means that once you are routed to a server once, we keep sending you to the same server. This is the antithesis of scaling out, and should be avoided at all cost.
If you are dealing with a legacy application, and removing state is not feasible, then at the minimum you would need to make sure that state is managed outside of memory. An example being if you can’t remove session, move it to SQL and replicate.
Item #2 – Tight Couplings
This one points to both of the key elements that I outlined above. When you have tight coupling between application components you create something that can ultimately fail and doesn’t scale well. It prevents the ability to build a solution that can scale well.
Let’s take a common example, let’s say you have an API tier on your application, and that api is built into the same web project as your UI front end. That API then talks directly to the database.
This is a pretty common legacy pattern. The problem this creates is that the demand of load your web application, and the backend api are very tightly coupled, so a failure in one means a failure in others.
Now let’s take this a step further and say that you expose your api to the outside world (following security practices to let your application be extensible. Sounds all good right.
Except when you look deeper, by having all your application elements all talking directly to each other you know created a scenario where cascading failures can completely destroy your application.
For example, one of your customers decides to leverage your api pretty hard, pulling a full dump of their data every 30 seconds, or you sign up a lot of customers who all decide to hit your api. It leads to the following affects:
- The increase demand on the api causes memory and cpu consumption on your web tier to go up.
- This causes performance issues on your applications ability to load pages.
- This causes intermittent areas that cause transactions against the api to demand higher SQL demand. Increased demand on SQL causes your application to experience resource deadlocks.
- Those resource deadlocks cause further issues with user experience as the application fails.
Now you are probably thinking, yes Kevin but I can just enable autoscaling in the cloud and it solves all those issues. To which my response is, and uncontrolled inflation of your bill to go with it. So clearly your CFO is OK with uncontrolled costs and inflation to offset a bad practice.
One scenario where we can resolve this awfulness is to split the API to a separate compute tier, by doing so we can manage the compute separately without having to wildly scale to offset the issue. I then have separate options for allowing my application to scale.
Additionally I can implement queues as a load leveling practices which allows for making my application scale only in scenarios where queue depth expands beyond reasonable response time. I can also throttle requests coming from the api or prioritize messages coming from the application. I then can replicate the queue messages to provide greater resiliency.
Item #3 – Enabling Scale out
Now I know, I just made it sound like scaling out is awful, but the key part to this is “controlled.” What I mean here is that by making your services stateless, and implementing practices to decouple you create scenarios where you can run one or more copies of a service which enables all kids of benefits from a resiliency and availability perspective. It changes your services from pets to cattle, you no longer care if one is brought down, because another takes its place. It’s sort of like a hydra, is a good way of thinking about it.
Item #4 – Move settings out of each piece of an application
The more tightly your settings and application code are connected, the harder it is to make changes on the fly. If your code is tightly coupled, and requires a deployment to make a configuration change it means that should you need to change an endpoint, it is an increasingly difficult thing to do. So the best thing you can do is start moving those configuration settings out of your application. No matter how you look at it, this is an important thing to do. For reasons relating to:
- Change Management
Item #5 – Build in automated deployment pipeline
The key to high availability comes down to automation a lot of times, especially when you hit the higher levels of getting more 9’s. The simple fact is that seconds count.
But more than that, Automated Deployments help to manage configuration drift, a simple fact is that the more you have configuration drift the harder it is to maintain a secondary region because you have to manage making sure that one region doesn’t have things the other does not. This is eliminated by forcing everything to go through the automated deployment pipeline. If every change must be scripted and automated, it is almost impossible to see configuration drift happen in your environments.
Item #6 – Monitoring, Monitoring, Monitoring
Another element of high availability and resiliency is monitoring. If you had asked me years ago about the #1 question most developers think of as an afterthought it was “How do I secure this?” And while that is a question a lot of developers still somehow treat as an afterthought, the bigger one is “How do I monitor and know this is working?” Given the rise of micro services, and server-less computing, we really need to be able to monitor every piece of code we deploy. So we need hooks into anything new you build to answer that question.
This could be as simple as building in logging for custom telemetry into Application Insights, or logging incoming and outgoing requests, logging exceptions, etc. But we can’t make sure something is running without implementing these metrics.
Item #7 – Control Configuration
This one, I’m building upon comments above. The biggest mistake that I see people get to with regard to these kinds of implementations is that they don’t manage how configuration changes are made to an environment. Ultimately this leads to a “pets” vs “Cattle” mentality. I had a boss once in my career who had a sign above his office that said “Servers are cattle, not pets…sometimes you have to make hamburgers.”
And as funny as the above statement is, there is an element of truth to it. If you allow configuration to be changes and fixes applied directly to an environment, you create a situation where it is impossible to rely on automation with any degree of trust. And it makes monitoring and every other element of a truly high available or resilient architecture completely irrelevant.
So the best thing you can do, leverage the automated pipeline, and if any change needs to be made it must be pushed through the pipeline, ideally remove peoples access to production for anything outside of read for metrics and logging.
Item #8 – Remove “uniqueness” of environment
And like above, we need to make sure everything about our environments is repeatable. In theory I should be able to blow an entire environment away, and with a click of a button deploy a new one. And this is only done through scripting everything. I’m a huge fan of terraform to help resolve this problem, but bash scripts, powershell, cli, pick your poison.
The more you can remove anything unique about an environment, the easier it is to replicate it and create at minimum an active / passive environment.
Item #9 – Start implementing availability patterns
If you are starting down this road of implementing more practices to enhance the resiliency of your applications, there are several practices you should consider that as you build out new services would help to create the type of resiliency you are building towards. Those patterns include:
- Health Endpoint Monitoring – Implementing functional checks in an application to ensure that external tools can be leveraged to help.
- Queue-Based Load Leveling – Leveraging queues that act as a buffer, or put a layer of abstraction between how your applications handle incoming requests in a more resilient manner.
- Throttling – This pattern helps with managing resource consumption so that you can meet system demand while controlling consumption.
- Circuit Breaker – This pattern is extremely valuable in my experience. Your service should be smart enough to use an incremental retry and back off if a downstream service is impacted.
- Bulk Head – This pattern leverages separation and a focus on fault tolerance to ensure that because one service is down the application is not.
- Compensating Transaction – If you are using a bulkhead, or any kind of fault tolerance, or have separation of concerns its important that you be able to roll a transaction back to its original state.
- Retry – The most basic pattern to implement and essential to build transient fault tolerance.
Item #10 – Remember this is an evolving process
As was described earlier in this post, the intention here are that if you are looking to build out more cloud based functionality, and in turn increase the resiliency of your applications, the best advice I can give is to remember that this is an iterative process and to look for opportunities to update your application and to increase resiliency.
For example, let’s say I have to make changes to an API that sends notification. If I’m going to make those updates, maybe I can implement queues, logging and make some changes to break that out to a micro service to increase resiliency. As you do this you will find that your applications position will improve.