Browsed by
Tag: availability

A Few Gotcha’s for Resiliency

A Few Gotcha’s for Resiliency

So I’ve been doing a lot of work over the past few months around availability, SLAs, Chaos Engineering, and in that time I’ve come across a few “Gotcha’s” that are important to sidestep if you want to build a stronger availability in your solutions. This is no means met to be an exhaustive list, but its at least a few tips in my experience that can help, if you are starting down this road:

Gotcha #1 – Pulling in too many services into the calculation.

As I’ve discussed in many other posts, the first step of doing a resiliency review of your application, is to figure out what functions are absolutely essential, and governed by the SLA. Because the simple fact is that SLA calculations are a business decision and agreement, so just like any good contract, the first step is to figure out the scope of what is covered.

But let’s boil this down to simple math, SLA calculations are described in depth here.

Take the following as an example:

ServiceSLA
App Service99.95%
Azure Redis99.9%
Azure SQL99.99%

Based on the above, the calculation for the SLA gives us the following:

.9995 * .999 * .9999 = .9984 = 99.84%

Now, if I look at the above, and more with Gotcha #2 on the specifics, but if I can remove the “Redis” and lower the number of services involved, the calculation changes to the following:

.9995 * .9999 = .9994 = 99.94%

Notice how removing an item from the SLA causes it to increase, part of the reason here is that I removed something with a much lower SLA, but each item in the SLA calculation will impact the final number, so where ever possible, we should make sure we have scoped our calculations to only the services involved in supporting the SLA.

Gotcha #2 – Using a caching layer incorrectly

Caching tiers are an essential part of any solution. When Redis first was created, caching tiers were seen as something that you would implement if you had aggressive performance requirements. But anymore the demands on software solutions are so high, that I would argue all major solutions have a caching tier of some kind.

Now to slightly contradict myself, those caching tiers, while important to performance of an application, should not be required as part of your SLA or availability calculation, if implemented correctly.

What I mean by that, is caching tiers are meant to be transient, meaning that they can be dropped at anytime, and the application should be able to function without it. Rather than relying on it for a persistence store. The most common case, that violate recommendations for solutions is the following:

  • User takes an action that requests data.
  • Application reaches down to data store to retrieve data.
  • Application puts data in Redis cache.
  • Application returns requested data.

The above has no issues at all, that’s what Redis is for, the problem is when the next part is this:

  • User takes an action that requests data.
  • Application pulls data from Redis and returns.
  • If data is not available, application errors out.

Given the ephemeral nature of caches, and the fact that these caches can be very hard to replicate, your application should be smart enough that if the data isn’t in redis, it will go get it from a data store.

By implementing the following, and configuring your application to use its cache only for performance optimization, you can effectively remove the Redis cache from the SLA calculation.

Gotcha #3 – Using the right event store

Now the other way I’ve seen Redis or caching tiers misused, is as a event data store. So a practice I’ve seen done over and over again is to leverage redis to store JSON objects as part of event store because of the performance benefits. There are appropriate technologies that can support this from a much better perspective and manage costs while benefiting your solutions:

  • Cosmos DB: Cosmos is really designed, exactly for this purpose, of providing high performance, and high availability for your applications. It does this by allowing you to configure the appropriate writing strategy.
  • Blob Storage: Again, Blob storage can be used as an event store, by writing objects to blob, although not my first choice, it is a viable option for managing costs.
  • Other database technologies: There are a myriad of potential options here from Maria, PostGres, MySQL, SQL, etc. All of which performance this operation better.

Gotcha #4 – Mismanaged Configuration

I did a whole post on this, but the idea that configuration cannot be changed with out causing an application event is always a concern. You should be able to change an applications endpoint without having any major hiccups in its operation.

At the end of the following

Fast and Furious: Configuration Drift

Fast and Furious: Configuration Drift

Unlike the movie Tokyo Drift, the phrase “you’re not in control, until you’re out of control.” Is pretty much the worst thing you can do when delivering software.

See the source image

Don’t get me wrong, I love the movie. But Configuration Drift is the kind of things that cripple an organization and also be the poison pill that runs your ability to support high availability for any solution, and increase your operation costs exponentially.

What is configuration drift?

Configuration Drift is the problem that occurs when manual changes are allowed to occur to an environment and this causes environments to change in ways that are undocumented.

Stop me if you heard this one before:

  • You deploy code to a dev environment, and everything works fine.
  • You run a battery of tests, automated or otherwise to ensure everything works.
  • You deploy that same code to a test environment.
  • You run a battery of tests, and some fail, you make some changes to the environment and things work just as you would expect.
  • You deploy to production, expecting it to all go fine, and start seeing errors and issues and end up losing hours to debugging weird issues.
  • During which you find a bunch of environment issues, and you fix each of them. You get things stable and are finally through everything.

Now honestly, that should sound pretty familiar, we’ve all lived it if I’m being honest. The problem is that this kind of situation causes configuration drift. Now what I mean by configuration drift is the situation where there is “drift” in the configuration of the environments such that they are have differences that are can cause additional problems.

If you look at the above, you will see a pattern of behavior that leads to bigger issues. For example, one of the biggest issues with the above is that the problem actually starts in the lower environments, where there are clearly configuration issues that are just “fixed” for sake of convenience.

What kind of problems does Configuration Drift create?

Ultimately by allowing configuration drift to happen, you are undermining your ability to make processes truly repeatable. You essentially create a situation where certain environments are “golden.”

So this creates a situation where each environment, or even each virtual machine can’t be trusted to run the pieces of the application.

This problem, gets even worse when you consider multi-region deployments as part of each environment. You now have to manage changes across the entire environment, not just one region.

This can cause a lot of problems:

  • Inconsistent service monitoring
  • Increased difficulty debugging
  • Insufficient testing of changes
  • Increased pressure on deployments
  • Eroding user confidence

How does this impact availability?

When you have configuration drift, it undermines the ability to deploy reliably to multiple regions, which means you can’t trust your failover, and you can’t create new environments as needed.

The most important thing to keep in mind is that the core concept behind everything here is that “Services are cattle, not pets…sometimes you have to make hamburgers.”

What can we do to fix it?

So given the above, how do we fix it? There are something you can do that are processed based, and others that are tool based to resolve this problem. In my experience, the following things are important to realize, and it starts with “admitting there is a problem.” Deadlines will always be more aggressive, demands for release will always be greater. But you have to take a step back and say this “if we have to change something, it has to be done by script.”

By forcing all changes to go through the pipeline, we can make sure everyone is aware of it, and make sure that the changes are always made every time. So there is a requirement to make sure you force yourself to do that, and it will change the above flow in the following ways:

  • You deploy code to a dev environment, and everything works fine.
  • You run a battery of tests, automated or otherwise to ensure everything works.
  • You deploy that same code to a test environment.
  • You run a battery of tests, and some fail, you make some changes to script to get it working.
  • You redeploy automatically when you change those scripts to dev, and automated tests are run on the dev environment.
  • The new scripts are run on test and everything works properly.
  • You deploy to production, and everything goes fine.

So ultimately you need to focus on simplifying and automating the changes to the environments. And there are some tools you can look at to help here.

  • Implement CI/CD, and limit access to environments as you move up the stack.
  • Require changes to be scripted as you push to test, stage, or preprod environments.
  • Leverage tools like Chef, Ansible, PowerShell, etc to script the actions that have to be taken during deployments.
  • Leverage infrastructure as code, via tools like TerraForm, to ensure that your environments are the same every time.

By taking some of the above steps you can make sure that things are consistent and ultimately limiting access to production to “machines” only.

See the source image

So ultimately the summary of this article is I waned to call attention to this issue as one that I see plague lots of organizations.

Configuring SQL for High Availability in Terraform

Configuring SQL for High Availability in Terraform

Hello All, a short post this week, but as we talk about availability, chaos engineering, etc. One of the most common data elements I see out there is SQL, and Azure SQL. SQL is a prevelant and common data store, it’s everywhere you look.

Given that, many shops are implementing infrastructure-as-code to manage configuration drift and provide increased resiliency for their applications. Which is definitely a great way to do that. The one thing that I’ve had a couple of people talk to me about that isn’t clear…how can I configure geo-replication in TerraForm.

This actually built into the TerraForm azure rm provider, and can be done with the following code:

provider "azurerm" {
    subscription_id = ""
    features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "rg" {
  name     = "sqlpoc"
  location = "{region}"
}

resource "azurerm_sql_server" "primary" {
  name                         = "kmack-sql-primary"
  resource_group_name          = azurerm_resource_group.rg.name
  location                     = azurerm_resource_group.rg.location
  version                      = "12.0"
  administrator_login          = "sqladmin"
  administrator_login_password = "{password}"
}

resource "azurerm_sql_server" "secondary" {
  name                         = "kmack-sql-secondary"
  resource_group_name          = azurerm_resource_group.rg.name
  location                     = "usgovarizona"
  version                      = "12.0"
  administrator_login          = "sqladmin"
  administrator_login_password = "{password}"
}

resource "azurerm_sql_database" "db1" {
  name                = "kmackdb1"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  location            = azurerm_sql_server.primary.location
  server_name         = azurerm_sql_server.primary.name
}

resource "azurerm_sql_failover_group" "example" {
  name                = "sqlpoc-failover-group"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  server_name         = azurerm_sql_server.primary.name
  databases           = [azurerm_sql_database.db1.id]
  partner_servers {
    id = azurerm_sql_server.secondary.id
  }

  read_write_endpoint_failover_policy {
    mode          = "Automatic"
    grace_minutes = 60
  }
}

Now above TF, will deploy two database servers with geo-replication configured. The key part is the following:

resource "azurerm_sql_failover_group" "example" {
  name                = "sqlpoc-failover-group"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  server_name         = azurerm_sql_server.primary.name
  databases           = [azurerm_sql_database.db1.id]
  partner_servers {
    id = azurerm_sql_server.secondary.id
  }

  read_write_endpoint_failover_policy {
    mode          = "Automatic"
    grace_minutes = 60
  }
}

The important elements are “server_name” and “partner_servers”, this makes the connection to where the data is being replicated. And then the “read_write_endpoint_failover_policy” setups up the failover policy.

Configuration can be a big stumbling block when its comes to availability.

Configuration can be a big stumbling block when its comes to availability.

So let’s face it, when we build projects, we make trade-offs. And many times those trade-offs come in the form of time and effort. We would all build the most perfect software ever… if time and budget were never a concern.

So along those lines, one thing that I find gets glossed over quickly, especially with Kubernetes and micro services … configuration.

Configuration, something where likely you are looking and saying, “That’s the most ridiculous thing I’ve ever heard.” We put our configuration in a YAML file, or a web.config, and manage those values through our build pipelines. And while that might seem like a great practice, in my experience it can cause a lot more headaches in the long run than your probably expecting.

The problem with storing configuration in YAML files, or Web.configs, is that they create an illusion of being able to change these settings on the fly. An illusion that can actually cause significant headaches when you start reaching for higher availability.

The problems these configuration files can cause is the following:

Changing these files is a deployment activity

If you need to change a value for these applications, it requires changing a configuration file. Changes to configuration files usually are tightly connected to different restart process. Take App Service as a primary example, if you store your configuration in a web.config and you make a change to that file. App Service will automatically trigger a restart, which will cause a downtime even for you and or your customers.

This is further difficult in a kubernetes cluster, in that if you use a YAML file, it requires the deployment agent changing the cluster. This makes it very hard to change these values due to a change in application behavior.

For example, if you wanted to change your SQL database connection if performance degrades below a certain point. That is a lot harder to do when you referencing a connection string in a config file on pods that are deployed across a cluster.

Storing Sensitive Configuration is a problem

Let’s face it, people make mistakes. And of the biggest problems I’ve seen come up several times is that I hear the following statement, “We store normal configuration in a YAML file, and then sensitive configuration in a key vault.”

The problem here is that the concept of what “sensitive” means and that it means different things to different people. So the odds of something being miss-classified. It’s much easier to manage if you tell your team that for all settings, treat them as sensitive. It makes management a lot easier and limits you to a single store.

So what do we do…

The best way I’ve found to mitigate these issues, is to use an outside service like KeyVault to store your configuration settings, or azure configuration management service.

But that’s just step 1, step 2 is to on startup cache the configuration settings for each micro service in memory in the container, and make sure that you configure it to expire after so much time.

This helps by providing an option where by your microservices startup after deployment, reach out to a secure store, and cache the configuration settings in memory.

This also gains us several benefits that mitigate the problems above.

  • Allow for changing configuration settings on the fly: For example, if I wanted to change a connection string over to a read replica, that can be done by simply updating the configuration store, and allowing the application to move services over as they expire the cache. Or if you want even further control, you could build in a web hook that would force it to dump the configuration and re-pull it.
  • By treating all configuration as sensitive you ensure there is no accidental leaks. This also ensures that you can manage these keys at deployment time, and not have them ever be seen by human eyes.

So this is all great, but what does this actually look like from an architecture standpoint.

For AKS, its a fairly easy implementation, to create a side car for retrieving configuration, and then deploy that sidecar with any pod that is deployed.

Given this, its easy to see how you would implement separate sidecar to handle this configuration. Each service within the pod is completely oblivious to how it gets its configuration, it calls a micro-service to get it.

I personally favor the sidecar implementation here, because it allows you to easily bundle this with your other containers and minimizes latency and excessive network communication.

Latency will be low because its local to every pod, and then if you ever decide to change your configuration store, its easy to do.

Let’s take a sample here using Azure Key Vault. If you look at the following code samples, you can see how here’s a configuration could be managed.

Here’s some sample code that could easily be wrapped in a container for your configuration to keyvault:

public class KeyVaultConfigurationProvider : IConfigurationProvider
    {
        private string _clientId = Environment.GetEnvironmentVariable("clientId");
        private string _clientSecret = Environment.GetEnvironmentVariable("clientSecret");
        private string _kvUrl = Environment.GetEnvironmentVariable("kvUrl");

        public KeyVaultConfigurationProvider(IKeyVaultConfigurationSettings kvConfigurationSettings)
        {
            _clientId = kvConfigurationSettings.ClientID;
            _clientSecret = kvConfigurationSettings.ClientSecret;
            _kvUrl = kvConfigurationSettings.KeyVaultUrl;
        }

        public async Task<string> GetSetting(string key)
        {
            KeyVaultClient kvClient = new KeyVaultClient(async (authority, resource, scope) =>
            {
                var adCredential = new ClientCredential(_clientId, _clientSecret);
                var authenticationContext = new AuthenticationContext(authority, null);
                return (await authenticationContext.AcquireTokenAsync(resource, adCredential)).AccessToken;
            });

            var path = $"{this._kvUrl}/secrets/{key}";

            var ret = await kvClient.GetSecretAsync(path);

            return ret.Value;
        }
    }

Now the above code uses a single service principal to call upon keyvault to pull configuration information. This could be modified to leverage the specific pod identities for even greater security and cleaner implementation.

The next step of the above implementation would be to leverage a cache for your configuration. This could be done piecemeal as needed or in a group. There are a lot of directions you could take this but it will ultimately help you to manage configuration easier.