Fast and Furious: Configuration Drift

Fast and Furious: Configuration Drift

Unlike the movie Tokyo Drift, the phrase “you’re not in control, until you’re out of control.” Is pretty much the worst thing you can do when delivering software.

See the source image

Don’t get me wrong, I love the movie. But Configuration Drift is the kind of things that cripple an organization and also be the poison pill that runs your ability to support high availability for any solution, and increase your operation costs exponentially.

What is configuration drift?

Configuration Drift is the problem that occurs when manual changes are allowed to occur to an environment and this causes environments to change in ways that are undocumented.

Stop me if you heard this one before:

  • You deploy code to a dev environment, and everything works fine.
  • You run a battery of tests, automated or otherwise to ensure everything works.
  • You deploy that same code to a test environment.
  • You run a battery of tests, and some fail, you make some changes to the environment and things work just as you would expect.
  • You deploy to production, expecting it to all go fine, and start seeing errors and issues and end up losing hours to debugging weird issues.
  • During which you find a bunch of environment issues, and you fix each of them. You get things stable and are finally through everything.

Now honestly, that should sound pretty familiar, we’ve all lived it if I’m being honest. The problem is that this kind of situation causes configuration drift. Now what I mean by configuration drift is the situation where there is “drift” in the configuration of the environments such that they are have differences that are can cause additional problems.

If you look at the above, you will see a pattern of behavior that leads to bigger issues. For example, one of the biggest issues with the above is that the problem actually starts in the lower environments, where there are clearly configuration issues that are just “fixed” for sake of convenience.

What kind of problems does Configuration Drift create?

Ultimately by allowing configuration drift to happen, you are undermining your ability to make processes truly repeatable. You essentially create a situation where certain environments are “golden.”

So this creates a situation where each environment, or even each virtual machine can’t be trusted to run the pieces of the application.

This problem, gets even worse when you consider multi-region deployments as part of each environment. You now have to manage changes across the entire environment, not just one region.

This can cause a lot of problems:

  • Inconsistent service monitoring
  • Increased difficulty debugging
  • Insufficient testing of changes
  • Increased pressure on deployments
  • Eroding user confidence

How does this impact availability?

When you have configuration drift, it undermines the ability to deploy reliably to multiple regions, which means you can’t trust your failover, and you can’t create new environments as needed.

The most important thing to keep in mind is that the core concept behind everything here is that “Services are cattle, not pets…sometimes you have to make hamburgers.”

What can we do to fix it?

So given the above, how do we fix it? There are something you can do that are processed based, and others that are tool based to resolve this problem. In my experience, the following things are important to realize, and it starts with “admitting there is a problem.” Deadlines will always be more aggressive, demands for release will always be greater. But you have to take a step back and say this “if we have to change something, it has to be done by script.”

By forcing all changes to go through the pipeline, we can make sure everyone is aware of it, and make sure that the changes are always made every time. So there is a requirement to make sure you force yourself to do that, and it will change the above flow in the following ways:

  • You deploy code to a dev environment, and everything works fine.
  • You run a battery of tests, automated or otherwise to ensure everything works.
  • You deploy that same code to a test environment.
  • You run a battery of tests, and some fail, you make some changes to script to get it working.
  • You redeploy automatically when you change those scripts to dev, and automated tests are run on the dev environment.
  • The new scripts are run on test and everything works properly.
  • You deploy to production, and everything goes fine.

So ultimately you need to focus on simplifying and automating the changes to the environments. And there are some tools you can look at to help here.

  • Implement CI/CD, and limit access to environments as you move up the stack.
  • Require changes to be scripted as you push to test, stage, or preprod environments.
  • Leverage tools like Chef, Ansible, PowerShell, etc to script the actions that have to be taken during deployments.
  • Leverage infrastructure as code, via tools like TerraForm, to ensure that your environments are the same every time.

By taking some of the above steps you can make sure that things are consistent and ultimately limiting access to production to “machines” only.

See the source image

So ultimately the summary of this article is I waned to call attention to this issue as one that I see plague lots of organizations.

Building a Reservoir of Good Will

Building a Reservoir of Good Will

So for a completely untechnology related post, I thought I would pass along here, and it’s just some general tips to live by. Let’s face it the nature of work has changed a lot for everyone. Most of us are busier than we’ve ever been. Things are downright crazy for a lot of us, and people are working together to do more, and accomplish more with less.

To that end, I’ve picked up a couple of tips to help make sure that you build out the reservoir of good will to work with people in a corporate setting, and it boils down to a few small gestures. That can really help to encourage partnership and growth with others. These are practices that I do and lean on heavily and I find they help me and others so I wanted to share them.

See the source image

Visibility Matters to everyone

The old adage is “No man is an island,” and that more true now than ever, odds are if you are working on something, you are working with a team.

If you read any books on leadership, one of the first things you will see common to all of them is that individual accomplishment is the last thing you should focus on yourself. Make sure you thinking is aligned that you succeed as a team and fail as a team. But that being said it’s important to make sure that if anyone does something exceptional, that it be noticed.

There are a couple of ways to do that.

  • Make sure to give feedback: by this I mean that you should take time and tell people “good job” or “you really stepped up, thanks.” It’s important that people know their efforts are being appreciated and having impact.
  • Make sure they’re manager knows: One thing I do, is keep notes on the kinds of things the team member has accomplished and send an email to their boss / manager. I know some companies have tools for this. But I find that people appreciate a good email.

I promise you that everyone who you send that email for, 3 things will happen:

  • They will appreciate it.
  • Their manager will appreciate it.
  • You will feel better knowing you helped someone.

Timing matters

Every company I’ve ever known has performance reviews. And those times are important to everyone. It’s important the employee but it’s also important to the manager.

Do your best time find out when those times are, and send a note to their manager right before that happens. Great feedback right before a performance review helps everyone. So do the best you can to make sure it lands at the right time.

Be concise and direct

As much as you can say things like “they are a great person”. I promise you it will land better if you say “This was the problem, and here’s how Jon / Jane went out of their way to deliver.” That kind of thing makes all the difference because it is more specific.

And make sure you keep it short, most managers are busy people and only have a couple of minutes so get to the point and make your argument why this person deserves praise.

In my experience this will help you build a reservoir of good will and help those around you want to continue to work towards your team goals.

Configuring SQL for High Availability in Terraform

Configuring SQL for High Availability in Terraform

Hello All, a short post this week, but as we talk about availability, chaos engineering, etc. One of the most common data elements I see out there is SQL, and Azure SQL. SQL is a prevelant and common data store, it’s everywhere you look.

Given that, many shops are implementing infrastructure-as-code to manage configuration drift and provide increased resiliency for their applications. Which is definitely a great way to do that. The one thing that I’ve had a couple of people talk to me about that isn’t clear…how can I configure geo-replication in TerraForm.

This actually built into the TerraForm azure rm provider, and can be done with the following code:

provider "azurerm" {
    subscription_id = ""
    features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "rg" {
  name     = "sqlpoc"
  location = "{region}"
}

resource "azurerm_sql_server" "primary" {
  name                         = "kmack-sql-primary"
  resource_group_name          = azurerm_resource_group.rg.name
  location                     = azurerm_resource_group.rg.location
  version                      = "12.0"
  administrator_login          = "sqladmin"
  administrator_login_password = "{password}"
}

resource "azurerm_sql_server" "secondary" {
  name                         = "kmack-sql-secondary"
  resource_group_name          = azurerm_resource_group.rg.name
  location                     = "usgovarizona"
  version                      = "12.0"
  administrator_login          = "sqladmin"
  administrator_login_password = "{password}"
}

resource "azurerm_sql_database" "db1" {
  name                = "kmackdb1"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  location            = azurerm_sql_server.primary.location
  server_name         = azurerm_sql_server.primary.name
}

resource "azurerm_sql_failover_group" "example" {
  name                = "sqlpoc-failover-group"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  server_name         = azurerm_sql_server.primary.name
  databases           = [azurerm_sql_database.db1.id]
  partner_servers {
    id = azurerm_sql_server.secondary.id
  }

  read_write_endpoint_failover_policy {
    mode          = "Automatic"
    grace_minutes = 60
  }
}

Now above TF, will deploy two database servers with geo-replication configured. The key part is the following:

resource "azurerm_sql_failover_group" "example" {
  name                = "sqlpoc-failover-group"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  server_name         = azurerm_sql_server.primary.name
  databases           = [azurerm_sql_database.db1.id]
  partner_servers {
    id = azurerm_sql_server.secondary.id
  }

  read_write_endpoint_failover_policy {
    mode          = "Automatic"
    grace_minutes = 60
  }
}

The important elements are “server_name” and “partner_servers”, this makes the connection to where the data is being replicated. And then the “read_write_endpoint_failover_policy” setups up the failover policy.

Embracing the Chaos

Embracing the Chaos

So I’ve done quite a few posts recently about resiliency. And it’s a topic that more and more is very important to everyone as you build out solutions in the cloud.

The new buzz word that’s found its way onto the scene is Chaos engineering. And really this is a practice of building out solutions that are more resilient. That can survive faults and issues that arise, and ensure the best possibly delivery of those solutions to end customers. The simple fact is that software solutions are absolutely critical to every element of most operations, and to have them go down can ultimately break down a whole business if this is not done properly.

At its core, Chaos engineering is about pessimism :). Things are going to fail.

Sort of like every other movement, like Agile and DevOps, Chaos Engineering embraces a reality. In this case that reality is that failures will happen, and should be expected. The goal being that you assume, that there will be failures and should architected to support resiliency.

So what does that actually mean, it means that you determine the strength of the application, by doing controlled experiments that are designed to inject faults into your applications and seeing the impact. The intention being that the application grows stronger and able to handle any faults and issues while maintaining the highest resiliency possible.

How this something new?

Now a lot of people will read the above, and say that “chaos engineering” is just the latest buzz word to cover something everyone’s doing. And there is an element of truth to that, but the details are what matters.

And what I mean by that, is that there is a defined approach to doing this and doing it in a productive manner. Much like agile, and devops. In my experience, some are probably doing elements of this, but by putting a name and methodology to it, we are calling attention to the practice for those who aren’t, and helping with a guide of sorts to how we approach the problem.

There are several key elements that you should keep in mind as you find ways to grow your solution by going down this path.

  • Embrace the idea that failures happen.
  • Find ways to be proactive about failures.
  • Embrace monitoring and visibility

Sort of how Agile embraced the reality that “Requirements change”, and DevOps embrace that “All Code must be deployed.” Chaos engineering embraces that the application will experience failures. This is a fact. We need to assume that any dependency can break, or that components will fail or be unavailable. So what do we mean at a high level for each of these:

Embrace the idea…failure happens

The idea being that elements of your solution will fail, and we know this will happen. Servers go down, service interruptions occur, and to steal a quote from Batman Begins, “Sometime things just go bad.”

I was in a situation once where an entire network connection was taken down by a Squirrel.

So we should build our code and applications in such a way that embraces that failures will eventually occur and build resiliency into our applications to accommodate that. You can’t solve a problem, until you know there is one.

How do we do that at a code level? Really this comes down to looking at your application, or micro service and doing a failure mode analysis. And a taking an objective look at your code and asking key questions:

  • What is required to run this code?
  • What kind of SLA is offered for that service?
  • What dependencies does the service call?
  • What happens if a dependency call fails?

That analysis will help to inform how you handle those faults.

Find ways to be proactive about failure

In a lot of ways, this results in leveraging tools such as patterns, and practices to ensure resiliency.

After you’ve done that failure mode analysis, you need to figure out what happens when those failures occur:

  • Can we implement patterns like circuit breaker, retry logic, load leveling, and libraries like Polly?
  • Can we implement multi-zone, multi-region, cluster based solutions to lower the probability of a fault?

Also at this stage, you can start thinking about how you would classify a failure. Some failures are transient, others are more severe. And you want to make sure you respond appropriately to each.

For example, a monitoring networking outage is very different from a database being down for an extended period. So another key element to consider is how long the fault lasts for.

Embrace Monitoring and Visibility

Now based on the above, the next question is, how do I even know this is happening? With micro service architectures, applications are becoming more and more decentralized means that there are more moving parts that require monitoring to support.

So for me, the best next step is to go over all the failures, and identify how you will monitor and alerts for those events, and what your mitigations are. Say for example you want to do manual failover for your database, you need to determine how long you return failures from a dependency service before it notifies you to do a failover.

Or how long does something have to be down before an alert is sent? And how do you log these so that your engineers will have visibility into the behavior. Sending an alert after a threshold does no one any good if they can’t see when the behavior started to happen.

Personally I’m a fan of the concept her as it calls out a very important practice that I find gets overlooked more often than not.

Weekly Links – 5/18

Weekly Links – 5/18

So this week will be light on links, mainly because next week won’t be. This week is Microsoft Build, and this year it is an entirely virtual conference. So enjoy,, and next weekend is Memorial Day in the United States, which is the start of summer.

I hope everyone is still doing ok with the new normal, and finding ways to have fun and relax. For my family, we’ve taken to doing more gaming as a family, and its been an absolute blast. And additionally my normal DnD group, which met monthly, is now meeting weekly via video conference. So right now its led to some really amazing stories and fun, and been something we all look forward to. If you’ve never tried TTRPGs, now is a great time to do so:

See the source image

Down to business…

Fun Stuff

As I mentioned one of the things, my family and I have really gotten into is gaming as something to keep things interesting in this crazy new world. And one of our favorite has been Dungeon Mayhem, if you have kids this is a super easy game to learn and play, and my 5-year-old, and 7-year-old love it.

See the source image
Terraform – Key Rotation Gotcha!

Terraform – Key Rotation Gotcha!

So I did want to write about something that I discovered recently when investigating a question. The idea being Key rotation, and how TerraForm state is impacted.

So the question being this, if you have a key vault and you ask any security expert. The number one rule is that Key rotation is absolutely essential. This is important because it helps manage the blast radius of an attack, and keep the access keys changing in a way that makes it harder to compromise.

Now, those same experts will also tell you this should be done via automation, so that no human eye has ever seen that key. Which is easy enough to accomplish. If you look at the documentation released by Microsoft, here. It discusses how to rotate keys with azure automation.

Personally, I like this approach because it makes the key rotation process a part of normal system operation, and not something you as a DevOps engineer or developer have to keep track of. It also if you’re in the government space makes it easy to report for compliance audits.

But I’ve been seeing this growing movement online of people who say to have TerraForm generate your keys, and do rotation of those keys using randomization in your TerraForm scripts. The idea being that you can automate random values in your TerraForm script to generate the keys, and I do like that, but overall I’m not a fan of this.

The reason being is it makes key rotation a deployment activity, and if your environment gets large enough that you start doing “scoped” deployments, it removes any rhyme or reason from your key generation. It’s solely based on when you run the scripts.

Now that does pose a problem with state at the end of the day. Because let’s take the following code:

provider "azurerm" {
    subscription_id = "...subscription id..."
    features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "superheroes" {
  name     = "superheroes"
  location = "usgovvirginia"
}

resource "random_id" "server" {
  keepers = {
    ami_id = 1
  }

  byte_length = 8
}

resource "azurerm_key_vault" "superherovault" {
  name                        = "superheroidentities"
  location                    = azurerm_resource_group.superheroes.location
  resource_group_name         = azurerm_resource_group.superheroes.name
  enabled_for_disk_encryption = true
  tenant_id                   = data.azurerm_client_config.current.tenant_id
  soft_delete_enabled         = true
  purge_protection_enabled    = false

  sku_name = "standard"

  access_policy {
    tenant_id = data.azurerm_client_config.current.tenant_id
    object_id = data.azurerm_client_config.current.object_id

    key_permissions = [
      "create",
      "get",
    ]

    secret_permissions = [
      "set",
      "get",
      "delete",
    ]
  }

  tags = {
    environment = "Testing"
  }
}

resource "azurerm_key_vault_secret" "Batman" {
  name         = "batman"
  value        = "Bruce Wayne"
  key_vault_id = azurerm_key_vault.superherovault.id

  tags = {
    environment = "Production"
  }
}

Now based on the above, all good right? When I execute my TerraForm script, I will have a secret named “batman” with a value of “Bruce Wayne.”

But the only problem here will be if I go to the Azure Portal and change that value, say I change the value of “batman” to “dick grayson”, and then I rerun my TerraForm apply.

It will want to reset that key back to “batman”. And we’ve broken our key rotation at this point….now what?

My thoughts on this is its easy enough to wrap the “terraform apply” in a bash script and before you execute it run a “terraform refresh” and re-pull the values from the cloud to populate your TerraForm script.

If you don’t like that option, there is another solution, use the lifecycle tag within a resource to tell it to ignore updates. And prevent updates to the keyvault if the keys have changed as part of the rotation.

Configuration can be a big stumbling block when its comes to availability.

Configuration can be a big stumbling block when its comes to availability.

So let’s face it, when we build projects, we make trade-offs. And many times those trade-offs come in the form of time and effort. We would all build the most perfect software ever… if time and budget were never a concern.

So along those lines, one thing that I find gets glossed over quickly, especially with Kubernetes and micro services … configuration.

Configuration, something where likely you are looking and saying, “That’s the most ridiculous thing I’ve ever heard.” We put our configuration in a YAML file, or a web.config, and manage those values through our build pipelines. And while that might seem like a great practice, in my experience it can cause a lot more headaches in the long run than your probably expecting.

The problem with storing configuration in YAML files, or Web.configs, is that they create an illusion of being able to change these settings on the fly. An illusion that can actually cause significant headaches when you start reaching for higher availability.

The problems these configuration files can cause is the following:

Changing these files is a deployment activity

If you need to change a value for these applications, it requires changing a configuration file. Changes to configuration files usually are tightly connected to different restart process. Take App Service as a primary example, if you store your configuration in a web.config and you make a change to that file. App Service will automatically trigger a restart, which will cause a downtime even for you and or your customers.

This is further difficult in a kubernetes cluster, in that if you use a YAML file, it requires the deployment agent changing the cluster. This makes it very hard to change these values due to a change in application behavior.

For example, if you wanted to change your SQL database connection if performance degrades below a certain point. That is a lot harder to do when you referencing a connection string in a config file on pods that are deployed across a cluster.

Storing Sensitive Configuration is a problem

Let’s face it, people make mistakes. And of the biggest problems I’ve seen come up several times is that I hear the following statement, “We store normal configuration in a YAML file, and then sensitive configuration in a key vault.”

The problem here is that the concept of what “sensitive” means and that it means different things to different people. So the odds of something being miss-classified. It’s much easier to manage if you tell your team that for all settings, treat them as sensitive. It makes management a lot easier and limits you to a single store.

So what do we do…

The best way I’ve found to mitigate these issues, is to use an outside service like KeyVault to store your configuration settings, or azure configuration management service.

But that’s just step 1, step 2 is to on startup cache the configuration settings for each micro service in memory in the container, and make sure that you configure it to expire after so much time.

This helps by providing an option where by your microservices startup after deployment, reach out to a secure store, and cache the configuration settings in memory.

This also gains us several benefits that mitigate the problems above.

  • Allow for changing configuration settings on the fly: For example, if I wanted to change a connection string over to a read replica, that can be done by simply updating the configuration store, and allowing the application to move services over as they expire the cache. Or if you want even further control, you could build in a web hook that would force it to dump the configuration and re-pull it.
  • By treating all configuration as sensitive you ensure there is no accidental leaks. This also ensures that you can manage these keys at deployment time, and not have them ever be seen by human eyes.

So this is all great, but what does this actually look like from an architecture standpoint.

For AKS, its a fairly easy implementation, to create a side car for retrieving configuration, and then deploy that sidecar with any pod that is deployed.

Given this, its easy to see how you would implement separate sidecar to handle this configuration. Each service within the pod is completely oblivious to how it gets its configuration, it calls a micro-service to get it.

I personally favor the sidecar implementation here, because it allows you to easily bundle this with your other containers and minimizes latency and excessive network communication.

Latency will be low because its local to every pod, and then if you ever decide to change your configuration store, its easy to do.

Let’s take a sample here using Azure Key Vault. If you look at the following code samples, you can see how here’s a configuration could be managed.

Here’s some sample code that could easily be wrapped in a container for your configuration to keyvault:

public class KeyVaultConfigurationProvider : IConfigurationProvider
    {
        private string _clientId = Environment.GetEnvironmentVariable("clientId");
        private string _clientSecret = Environment.GetEnvironmentVariable("clientSecret");
        private string _kvUrl = Environment.GetEnvironmentVariable("kvUrl");

        public KeyVaultConfigurationProvider(IKeyVaultConfigurationSettings kvConfigurationSettings)
        {
            _clientId = kvConfigurationSettings.ClientID;
            _clientSecret = kvConfigurationSettings.ClientSecret;
            _kvUrl = kvConfigurationSettings.KeyVaultUrl;
        }

        public async Task<string> GetSetting(string key)
        {
            KeyVaultClient kvClient = new KeyVaultClient(async (authority, resource, scope) =>
            {
                var adCredential = new ClientCredential(_clientId, _clientSecret);
                var authenticationContext = new AuthenticationContext(authority, null);
                return (await authenticationContext.AcquireTokenAsync(resource, adCredential)).AccessToken;
            });

            var path = $"{this._kvUrl}/secrets/{key}";

            var ret = await kvClient.GetSecretAsync(path);

            return ret.Value;
        }
    }

Now the above code uses a single service principal to call upon keyvault to pull configuration information. This could be modified to leverage the specific pod identities for even greater security and cleaner implementation.

The next step of the above implementation would be to leverage a cache for your configuration. This could be done piecemeal as needed or in a group. There are a lot of directions you could take this but it will ultimately help you to manage configuration easier.

Weekly Links – 5/11

Weekly Links – 5/11

So while we all think we’re are in groundhog day from lockdown, its important to find ways to get out of the house. So the big one we’ve been doing is on weekends, every week my wife and I have been taking on a “culinary experiment” involving the smoker. Here’s the model I’ve got a Weber Smokey Mountain.

Down to business…

Fun Stuff:

So given that we are all in lockdown, I’ve been making my way through netflix like everyone else. Finally about halfway through Altered Carbon season 2, and I have to say I’m really enjoying it so far. Anthony Mackie does a great job as Tak, and Poe’s story arc really is fantastic. A stellar second season for this series.

Copying blobs between storage accounts / regions

Copying blobs between storage accounts / regions

So a common question I get is copying blobs. So if you are working with azure blob storage, it’s sort of inevitable that you would need to do a data copy. Whether that be for a migration, re-architecture, any number of reasons … you will need to do a data copy.

Now this is something where I’ve seen all different versions of doing a data copy. And I’m going to talk through those options here, and ultimately how best to execute a copy within Azure Blob Storage.

I want to start with the number 1, DO NOT DO, option. That option is “build a utility to cycle through and copy blobs one by one.” This is the least desirable option for moving data for a couple of reasons:

  • Speed – This is going to be a single threaded, synchronous operation.
  • Complexity – This feels counter-intuitive, but the process of ensuring data copies, building fault handling, etc…is not easy. And not something you want to take on when you don’t have to.
  • Chances of Failure – Long running processes are always problematic, always. As these processes can fail, and when they do they can be difficult to recover from. So you are opening yourself up to potential problems.
  • Cost – At the end of the day, you are creating a long running process that will need to have compute running 24/7 for an extended period. Compute in the cloud costs money, so this is an additional cost.

So the question is, if I shouldn’t build my own utility, how do we get this done. There are really two options that I’ve used in the past to success:

  • AzCopy – This is the tried and true option. This utility provides an easy command line interface for kicking off copy jobs that can be run either in a synchronous or asynchronous method. Even in its synchronous option, you will see higher throughput for the copy. This removes some of the issues from above, but not all.
  • Copy API – a newer option, the Rest API enables a copy operation. This provides the best possible throughput and prevents you from having to create a VM, allowing for asynchronous copy operations in azure to facilitate this operation. The API is easy to use and documentation can be found here.

Ultimately, there are lots of ways and scenarios you can leverage these tools to copy data. The other one that I find usually raises questions, is if I’m migrating a large volume of data, how do I do it to minimize downtime.

The way I’ve accomplished this, is to break your data down accordingly.

  • Sort the data by age oldest to newest.
  • Starting with the oldest blobs, break them down into the following chucks.
  • Move the first 50%
  • Move the next 30%
  • Move the next 10-15%
  • Take a downtime window to copy the last 5-10%

By doing so, you gain the ability to minimize your downtime window while maximizing the backend copy. Now the above process only works if your newer data is accessed more often, it creates a good option for moving your blobs, and minimizing downtime.

Azure Search SDK in Government

Azure Search SDK in Government

So I’ve been working on a demo project using Azure Search, and if you’ve followed this blog for a while you know. I do a lot of work that requires Azure Government. Well recently I needed to implement a search that would be called via an Azure Function and require the passing of latitude and longitude to facilitate the searching within a specific distance. So I started to build my azure function using the SDK. And what I ended up with looked a lot like this:

Key Data elements:

First to be able to interact with my search service I need to install the following nuget package:

Microsoft.Azure.Search

And upon doing so, I found so pretty good documentation here for building the search client. So I built out a GeoSearchProvider class that looked like the following:

NOTE: I use a custom class called IConfigurationProvider which encapsulates my configuration store, in most cases its KeyVault, but it can be a variety of other options.

public class GeoSearchProvider : IGeoSearchProvider
    {
        IConfigurationProvider _configurationProvider;

        public GeoSearchProvider(IConfigurationProvider configurationProvider)
        {
            _configurationProvider = configurationProvider;
        }

        public async Task<DocumentSearchResult<SearchResultModel>> RunSearch(string text, string latitude, string longitude, string kmdistance, Microsoft.Extensions.Logging.ILogger log)
        {
            if (String.IsNullOrEmpty(kmdistance))
            {
                kmdistance = await _configurationProvider.GetSetting("SearchDefaultDistance");
            }

            var serviceName = await _configurationProvider.GetSetting("SearchServiceName");
            var serviceApiKey = await _configurationProvider.GetSetting("SearchServiceApiKey");
            var indexName = await _configurationProvider.GetSetting("SearchServiceIndex");

            SearchIndexClient indexClient = new SearchIndexClient(serviceName, indexName, new SearchCredentials(serviceApiKey));

            var parameters = new SearchParameters()
            {
                Select = new[] { "...{list of fields}..." },
                Filter = string.Format("geo.distance(location, geography'POINT({0} {1})') le {2}", latitude, longitude, kmdistance)
            };

            var logmessage = await _configurationProvider.GetSetting("SearchLogMessage");

            try
            {
                var results = await indexClient.Documents.SearchAsync<SearchResultModel>(text, parameters);

                log.LogInformation(string.Format(logmessage, text, latitude, longitude, kmdistance, results.Count.ToString()));

                return results;
            }
            catch (Exception ex)
            {
                log.LogError(ex.Message);
                log.LogError(ex.StackTrace);
                throw ex;
            }
        }
    }

The above code seems pretty straight forward and will run just fine to get back my search results. I even built in logic so that if I don’t give it a distance, it will take a default from the configuration store, pretty slick.

And I pretty quickly ran into a problem, and that error was “Host Not found”.

And I racked my brain on this for a while before I discovered the cause. By default, the Azure Search SDK, talks to Commercial. Not Azure Government, and after picking through the documentation I found this. There is a property called DnsSuffix, which allows you to put in the suffix used for finding the search service. By default it is “search.windows.net”. I changed my code to the following:

public class GeoSearchProvider : IGeoSearchProvider
    {
        IConfigurationProvider _configurationProvider;

        public GeoSearchProvider(IConfigurationProvider configurationProvider)
        {
            _configurationProvider = configurationProvider;
        }

        public async Task<DocumentSearchResult<SearchResultModel>> RunSearch(string text, string latitude, string longitude, string kmdistance, Microsoft.Extensions.Logging.ILogger log)
        {
            if (String.IsNullOrEmpty(kmdistance))
            {
                kmdistance = await _configurationProvider.GetSetting("SearchDefaultDistance");
            }

            var serviceName = await _configurationProvider.GetSetting("SearchServiceName");
            var serviceApiKey = await _configurationProvider.GetSetting("SearchServiceApiKey");
            var indexName = await _configurationProvider.GetSetting("SearchServiceIndex");
            var dnsSuffix = await _configurationProvider.GetSetting("SearchSearchDnsSuffix");

            SearchIndexClient indexClient = new SearchIndexClient(serviceName, indexName, new SearchCredentials(serviceApiKey));
            indexClient.SearchDnsSuffix = dnsSuffix;

            var parameters = new SearchParameters()
            {
                Select = new[] { "...{list of fields}..." },
                Filter = string.Format("geo.distance(location, geography'POINT({0} {1})') le {2}", latitude, longitude, kmdistance)
            };

            //TODO - Define sorting based on distance

            var logmessage = await _configurationProvider.GetSetting("SearchLogMessage");

            try
            {
                var results = await indexClient.Documents.SearchAsync<SearchResultModel>(text, parameters);

                log.LogInformation(string.Format(logmessage, text, latitude, longitude, kmdistance, results.Count.ToString()));

                return results;
            }
            catch (Exception ex)
            {
                log.LogError(ex.Message);
                log.LogError(ex.StackTrace);
                throw ex;
            }
        }
    }

And set the “SearchSearchDnsSuffix” to “search.azure.us” for government, and it all immediately worked.