Keeping the lights on! – Architecting for availability?

Hello all, It’s been a while since I did a blog post outside of the weekly updates. But I wanted to do one in terms of conversations that I’ve been having a lot lately and seems to be largely universal. High Availability. So more and more, software is becoming a critical part of every aspect of our lives. To that end, we really see as developers / engineers, the following scenarios have become a constant reality:

  • For end customer software, not having access for an extend timeframe to an app or service can be the final nail in the coffin for a lot of users. Their tolerance for down time continues to drop. If you don’t believe me, research the metrics around how long someone will wait for a video to load before leaving according to YouTube.
  • For enterprises, organizations are becoming more and more reliant on software to function at the most basic level, meaning that outages or downtime windows have an even greater impact on their business, causing more parts of the organization to have to function at a diminished capacity or not at all during an outage.

The end result of these perceptions / realities is that the demands put on software solutions for maintaining availability are going higher and higher. And it becomes important to architect and plan for high availability to start with, as if you don’t it can be very expensive and difficult to retro-fit your applications to meet these demands.

This is a huge topic, and one that I’m not going to be able to cover in one blog post, but I’m hoping that we can identify ways to help if you are being tasked with meeting these demands.

Defining SLA

See the source image

So the first part of this conversation, always in my experience starts the same, “What’s our SLA?”, so let’s talk through what an SLA is? SLA stands for Service Level Agreement, and this is a legal agreement of what level of service you are required to provide.

Now the key part of that, is a “legal agreement”, this is not strictly a software function or engineering concept, but a business agreement in the sense that if an SLA is not met, there is a financial obligation from the organization to compensate the customer (in an enterprise setting).

Be Reasonable…

See the source image
Let’s not get crazy!

So the most common mistake I hear when someone starts down this road is “we need 100% SLA”, which is a bad place to start this process. Realistically this is almost impossible, the idea that you will never have an outage is extreme. And to get this level of resiliency you can expect to pay for it, and its easy to get upside down on your costs by starting out here. And really mean need to be realistic about the ask here.

Let’s walkthrough an example, let’s say you have a software the provides grant processing for a municipality, and that grant reviews are done monday to friday during business hours (8-6pm). If your customer says “We need a 100% SLA”, I would make the counter argument of “Do you really?” If the system is down from 1-2am on a saturday, does that really affect you and the nature of the business? Or is this just a matter of needing the solution to be up during those core business operating hours?

Conversely let’s go the other way, and say that you are providing a solution that provides emergency service communication in terms of a natural disaster? Would your customer be ok with a 5-minute downtime at 2am in the middle of a hurricane? Probably not. So tolerance should be measured in terms of actual impact to the end user and ability to function.

High Availability is like insurance, I can get add-ons to my policy for everything that could ever happen, but that means that I will likely be paying for things I don’t need. I can get volcano insurance in Pennsylvania, but the odds of needing it are so low to make it ridiculous.

So what we should be doing is finding a happy balance between what we can realistically do, and do by following recommended processes, and way the business calculation, and cost.

Let me give you a high level example, let’s say I deploy my production environment to one region, and I’ve calculated that the composite SLA (more on this later) to be 99.9% for one region. That means that right now I am telling my customers that I am expecting about 43.2 minutes of downtime a month.

But if I stood up a secondary region, and built out a lot of automation around failover and monitoring (lets say 80 hours of work), I could raise that SLA from 99.9% to 99.99% which would mean a downtime of 4.32 minutes.

Now what I need to weigh is the following:

  • 80 hours worth of labor costs
  • opportunity cost of not using that labor resource on new features
  • doubling my environment costs (2 active regions)
  • Potential advantage by supporting a higher SLA.

And I look at that and say, I’m saving 38.88 minutes of downtime in the process. So the question is, does that help my business and make sense from a financial position, or am I “ok” taking a financial hit and having only 1 environment up, and paying out if we are down for more than the 99.99% and rolling the dice on that.

I can’t say in the above discussion what the right answer is, because ultimately it depends on the type of business and resiliency of the application. You might be comfortable with that, you might not.

My point is that at the end of the day this is both an engineering problem and a business problem, and likely the right answer is somewhere in the middle.

Now to be clear, other times, especially in enterprise software, the customer may require a certain SLA, and at that point you might have to show that you meet that SLA by having specific redundancies in place. I’ll talk about this more in our next section.

Calculating a composite SLA

See the source image

Another common area of question, is “How do I calculate the SLA of my service?” And this is more straight forward than people realize. Let’s take the following example:

Note: You can find all of azure’s SLAs here.

ServiceSLA
App Service99.95%
Azure SQL99.99%

So based on the table above, the composite SLA would be:

.9995 * .9999 = .9994 = 99.94%

So that would imply that your cloud provider is standing behind these service to have downtime of :

730 (Hours per month) * (1 – .9994) = 26.28 minutes

Now the above is an estimate, but it would be around that time that we could expect to be our monthly downtime. This calculation doesn’t change the more services you add.

Now its important to note, this is the platform SLA, not your SLA. And I say that because at the end of the day, this is assuming that your application doesn’t have issues that cause downtime, so that should be considered as well.

How do we improve our SLA, start with “what is down?”

See the source image

Now for many cloud services, Microsoft and every other cloud provider gives recommendations to enhance resiliency and improve your SLA. One way to do that is to leverage items like Availability Zones and multi-region deployments. This allows you to spread out your application across multi-regions and it makes the probability of an outage drop substantially.

Really the first step here is to do a failure mode analysis, and determination of critical functionality. And what I mean by that is we need to define what constitutes the system being “Down”. So let’s take for instance you have an eCommerce platform, something like NopCommerce, and you have the following use-cases:

  1. Browse the catalog
  2. Add items to shopping cart
  3. Purchase items
  4. Publish blogs
  5. Send out notifications of deals / sales
  6. Process Orders

Now based on the above, we could identify 1,2,3, and 5 as mission critical, if we can’t allow our customers to shop, buy, and receive their products, that means that we are out of business. If we can’t publish a blog when we want to, or if a sale notice goes out a little late, its not ideal, but its not the end of the world. And let’s say that we have azure functions sending the notifications, and the blogs and promotions are managed by Cosmos DB.

So now based on that, we need to examine our architecture and identify what components are required to maintain the 4 key uses cases we identified. Notice I left off the elements that are not part of our key functionality for our SLA.

Let’s say we have the proposed architecture:

Now based on the above, I can calculate our primary region SLA to be:

ServiceSLA
Application Gateway99.95%
App Service99.95%
Azure SQL99.99%
Total SLA99.89%

So as a result of the above, we need to examine what elements of our solution are critical to the meeting our uptime SLA, and then doing a failure analysis. So based on the above use-cases, we can assume that the Traffic Manager, Application Gateway, App Service, and Azure SQL are essential to our meeting of our SLA. For the sake of this example, let’s say that the caching layer meets with industry recommendations and is used only for speed of access, if not available the application will just reach out to the database.

So how do we calculate the compound SLA for the two regions, we do that with the following math:

We basically have to figure out the probability of both regions being offline, so if we take the region “unavailability” of .12% and multiply it by one another:

0.12% * 0.12% = 0.0121%

Convert it back to availability:

100 % – 0.0121% = 99.99%

Now we take that multiplied by traffic manager SLA:

.9999 * .9999 = 99.99%

Failure Mode Analysis:

See the source image

A failure analysis means that we pick apart each element of the infrastructure and identify the following:

  • What potential failures could occur?
  • What are the different “modes” or “states” can this component be in?
  • How likely is a failure of this component?
  • What is the impact of each failure “mode” or “state” on the application?

After examining the above, you need to look at each of the “modes” or “states” and identify the following:

  • How you will respond and recover?
  • How you will monitor for this situation, before, during, and after?

So let’s take an example, because to me that always helps. If we examine the above solution, and say Azure SQL Database. If I were to do a failure mode analysis, I would find the following:

  • The database is offline in the following situations:
    • The database can be offline due to a platform issue
    • the database is shutdown
    • the database is deleted
  • The database is in a degraded state in the following situations:
    • Database is performing slowly due to high website demand.
    • Database is running slowly due to bad query optimization
    • Database is experiencing deadlocks

Now this is by no means an exhaustive list, but it hits the high points for our ecommerce site. Now in those states, I need to identify what do for these scenarios. So the question is how do we respond and recover. In the case of the database, the most common recommendations are, to use a standard tier, and to use active geo-replication.

So for “How do we respond, recover?” I would say we setup active geo-replication of our production database to a secondary region. In the event the database is “offline” we fail-over to a secondary region and leverage traffic manager to route to the backup site. We would see some data loss during the failover, but for this exercise, let’s say that is manageable.

The next question is the most important, how do we monitor for this? The answer is we could do this a couple of ways:

  • Setup alerts via azure monitor around specific metrics.
  • Setup alerts in Application Insights for Dependency failures for database calls.
  • Build a page within our application that Traffic Manager can prob to identify when the database is unreachable and trigger failover.

The next mode was “degraded” and if we examine that the response is to increase the performance tier of the database to respond to increased demand, or do more in-depth analysis around the performance of the database. Again the monitoring would be similar of setting up alerts around these conditions to make appropriate staff aware.

So all kidding aside, this is a huge topic, and one I want to boil down more on how best to implement these solutions. This post didn’t begin to discuss the differences between RTO / RPO, or how you make sure to ensure resiliency through transient fault tolerance or distributed architectures, and that’s just scratching the surface, so more to come.

Weekly Links – 11/4

Hello All, so I goofed and we missed last week’s post. I actually was at the International Association of the Chief’s of Police conference. Which is always a whirlwind and crazy experience.

See the source image

So that being said, down to business.

Development:

  • Build Great Xamarin Apps with App Center : As you probably know by now, I’m a big DevOps fan, and firm believer in its value. So this is a great description on how to use App Center to implement DevOps on your Xamarin Mobile Apps.

Cloud:

Video / Audio:

Fun Stuff:

So I’m a big fan of the Witcher books, honestly for as much as I enjoy games like Dungeons and Dragons, I’m usually not a fan of fantasy fiction. Honestly in general I find most of the genre to be overly slow in its story telling, and I get bored. But one of the exceptions to that is the Witcher, which tells engaging and dynamic stories really quickly. Geralt is a great character who lives in a really fascinating world. So when I heard NetFlix was adapting it, I was hopeful but concerned. Well the trailer dropped, and I’m really excited, this looks great.

Weekly links – 10/14

So this past week, I spent every free moment working on a shed in my backyard, and like any constructive project its had a slew of delays. But we are powering through:

See the source image

Down to business…

Development:

Cloud:

Audio / Video:

Fun Stuff:

So as always I’m a big comic fan, and I’ve said before I’m a fan of the CW Arrowverse. For as much as DC movies are terrible, their TV shows are quite excellent. And the standout last year was Supergirl, it really tapped into what makes for the best Superman / Supergirl stories. The best stories are all based around problems that they can’t “super power their way out of”. Last season tackled real topics like trust of the media, xenophobia, racism, and others. This season is already moving towards tackling technology and its ability to change the way that we view reality and connect with each other.

Weekly Links – 10/7

So this week was a little crazy with family obligations, work travel, busy work schedules, etc. But overall it wasn’t bad kind of week. We are officially into full on October. Right now I’m busy prepping for our monthly game today which should be a lot of fun

See the source image

Now down to the business…

Development:

Cloud:

Audio / Video:

Fun Stuff:

I’m a big comic fan, and Greg Rucka is one of my favorite writers. And StumpTown was one of his pet projects, and really the book plays out like Jessica Jones more for the real world. And I gotta say, I watched the pilot and was really impressed.

Weekly Links – 9/30

So this week was a lot of travel in the middle of the week, thanks to delays it took me 12 hours to get from Atlanta to Pennsylvania, which is absolute insanity. But I made it, and all is good at the end of the day.

The important part is that its almost Halloween, and my kids have literally started a full on countdown to Halloween. My daughter is obsessed and we kicked off the week with a visit to Spirit Halloween to see the “fun scary stuff” and has started playing the Nightmare Before Christmas on repeat.

See the source image

So down to business:

Development:

So as I mentioned, this week was .NET Conf, and we got a bunch of cool announcements:

  • Announcing .net core 3.0 : Pretty exciting to see the new version of .net core finally get a GA release. Really excited to start building with it.
  • Xamarin Announcements at .net conf : So excited about this, Hot Reload, that’s amazing. Compiling in mobile is a very time consuming process and this is a huge timesaver.
  • ML.Net updates : A bunch of new updates that help with use of ML.Net to bring new functionality to your app. Specifically Feature engineering is pretty huge, and removes what I saw as a major limitation.
  • Free .net, C#, and ASP.NET Training : Who doesn’t love free training!

Cloud:

  • Azure Sentinel GA: Very exciting offering in the security space. A SIEM in the cloud is amazing.
  • How to Develop your service health alerting strategy : One of the most critical parts of running a cloud application is your monitoring and alerting strategy as this defines how much information you have for debugging and ensuring your service is running.

Audio / Video:

Fun Stuff:

And finally the fun stuff, being a comic fan it should surprise no one that I’m a fan of the CW Arrowverse. I was skeptical in the beginning but Arrow, Flash, and Supergirl all turned out to be really amazing shows, and every year their crossover is something I look forward to. And this year they are adapting the “Crisis of Infinite Earths” storyline, and we will see a battle that affects the entire multi-verse. And I give the producers credit that they really reached out and are pulling in every DC property today or in the past. Including the TV show that started it all Smallville. So I’m very excited to see that Tom Welling and Erica Durance will be coming back as Clark Kent and Lois Lane in the crossover. Read more here.

See the source image

Weekly Links – 9/23

So I know I’m a little late this week, but here they are. I was away at a conference in sunny Las Vegas for the week, and it was quite the week.

See the source image

But anyway down the business:

Development:

  • Cascadia Code is live: Normally don’t care about a font, but this is pretty cool because of its support of ligatures. Makes code much easier to read which is pretty awesome.
  • .NET Conf: Really cool virtual conference with more materials and announcements. Next week should have a lot of new annoucements.

Cloud:

Audio / Video:

Fun Stuff:

As always to live up to our name, here’s a nerd topic for the links. I’m a big batman fan, always have been. I’m pretty sure my kids knew who Batman was long before they knew Big Bird. With that I’ve been enjoying the current comic run, with Tom King as the writer, and it is coming to an end and they announced the new writer, James Tynion IV, who is a great writer who wrote Batman Eternal, so I’ve very excited. To read more, look here.

See the source image

Weekly Links – 9/16

Hello all, its another week. Been really busy as we are approaching the end of a quarter. Been doing a lot of code work this week on a project that has been delayed for too long. So really enjoying that.

See the source image

Now here we go with the business:

Development:

Cloud:

Audio / Video:

Fun Stuff:

So as I said last time I’m a bit of a gamer, and as this comic points to, I’m a well documented nerd. Lately I’ve found myself getting pulled back into tabletop gaming, specifically Dungeons and Dragons, and have a good game going (we play monthly, right now we’ve been done about 8 sessions). So as I get deeper into this, new things are announced all the time, and the new one is Unearthed Arcana, which is basically “Beta” content for the game for players to use.

The latest are two new sub classes Aberrant Mind sorcer, and Lurker of the Deep Warlock.

Weekly Links – 9/9

Welcome back everyone, for another weekly links post. The important note here is its fall, which means kids are in school, leaves are going to start turning and ….

See the source image

So down to business:

Development:

Cloud:

Audio / Video:

Fun Stuff:

I’m a bit of a gamer, as he said to the surprise of no one. And its official that on September 5th, the new Gears of War 5 was available for Early Access period. The world wide release is September 10th. Very awesome. I always enjoyed Gears of War because it is one seriously intense game.

Here’s the article. Warning, Mature audiences.

Weekly Links – 9/3

Hello All, I’m a little late getting this out the door, but a fully sick family, complete with a side-helping of pneumonia for me didn’t help this weekend. But here it is.

See the source image

Development:

Cloud:

Video and Audio:

Weekly Links – 8/26

End of summer, school is back in session. I’m just going to leave this here:

See the source image

But here are the latest for this week:

Development:

Cloud:

Videos / Audio: