Browsed by
Category: CodeProject

TerraForm – Using the new Azure AD Provider

TerraForm – Using the new Azure AD Provider

So by using TerraForm, you gain a lot of benefits, including being able to manage all parts of your infrastructure using HCL languages to make it rather easy to manage.

A key part of that is not only being able to manage the resources you create, but also access to them, by creating and assigning storage principals. In older versions of TerraForm this was possible using the azurerm_azuread_application and other elements. I had previously done this in the Kubernetes template I have on github.

Now, with TerraForm v2.0, there have been some pretty big changes, including removing all of the Azure AD elements and moving them to their own provider, and the question becomes “How does that change my template?”

Below is an example, it shows the creation of a service principal, with a random password, and creating an access policy for a keyvault.

resource "random_string" "kub-rs-pd-kv" {
  length = 32
  special = true
}

data "azurerm_subscription" "current" {
    subscription_id =  "${var.subscription_id}"
}
resource "azurerm_azuread_application" "kub-ad-app-kv1" {
  name = "${format("%s%s%s-KUB1", upper(var.environment_code), upper(var.deployment_code), upper(var.location_code))}"
  available_to_other_tenants = false
  oauth2_allow_implicit_flow = true
}

resource "azurerm_azuread_service_principal" "kub-ad-sp-kv1" {
  application_id = "${azurerm_azuread_application.kub-ad-app-kv1.application_id}"
}

resource "azurerm_azuread_service_principal_password" "kub-ad-spp-kv" {
  service_principal_id = "${azurerm_azuread_service_principal.kub-ad-sp-kv1.id}"
  value                = "${element(random_string.kub-rs-pd-kv.*.result, count.index)}"
  end_date             = "2020-01-01T01:02:03Z"
}

resource "azurerm_key_vault" "kub-kv" {
  name = "${var.environment_code}${var.deployment_code}${var.location_code}lkub-kv1"
  location = "${var.azure_location}"
  resource_group_name = "${azurerm_resource_group.management.name}"

  sku {
    name = "standard"
  }

  tenant_id = "${var.keyvault_tenantid}"

  access_policy {
    tenant_id = "${var.keyvault_tenantid}"
    object_id = "${azurerm_azuread_service_principal.kub-ad-sp-kv1.id}"

    key_permissions = [
      "get",
    ]

    secret_permissions = [
      "get",
    ]
  }
  access_policy {
    tenant_id = "${var.keyvault_tenantid}"
    object_id = "${azurerm_azuread_service_principal.kub-ad-sp-kv1.id}"

    key_permissions = [
      "create",
    ]

    secret_permissions = [
      "set",
    ]
  }

  depends_on = ["azurerm_role_assignment.kub-ad-sp-ra-kv1"]
}

Now as I mentioned, with the change to the new provider, you will see a new version of this code be implemented. Below is an updated form of code that generates a service principal with a random password.

provider "azuread" {
  version = "=0.7.0"
}

resource "random_string" "cds-rs-pd-kv" {
  length = 32
  special = true
}

resource "azuread_application" "cds-ad-app-kv1" {
  name = format("%s-%s%s-cds1",var.project_name,var.deployment_code,var.environment_code)
  oauth2_allow_implicit_flow = true
}

resource "azuread_service_principal" "cds-ad-sp-kv1" {
  application_id = azuread_application.cds-ad-app-kv1.application_id
}

resource "azuread_service_principal_password" "cds-ad-spp-kv" {
  service_principal_id  = azuread_service_principal.cds-ad-sp-kv1.id
  value                = random_string.cds-rs-pd-kv.result
  end_date             = "2020-01-01T01:02:03Z"
}

Notice how much cleaner the code is, first we aren’t doing the ${} to do string interpolation, and ultimately the resources are much cleaner. So the next question is how do I connect this with my code to assign this service principal to a keyvault access policy.

You can accomplish that with the following code, which is in a different file in the same directory:

resource "azurerm_resource_group" "cds-configuration-rg" {
    name = format("%s-Configuration",var.group_name)
    location = var.location 
}

data "azurerm_client_config" "current" {}

resource "azurerm_key_vault" "cds-kv" {
    name = format("%s-%s%s-kv",var.project_name,var.deployment_code,var.environment_code)
    location = var.location
    resource_group_name = azurerm_resource_group.cds-configuration-rg.name 
    enabled_for_disk_encryption = true
    tenant_id = data.azurerm_client_config.current.tenant_id
    soft_delete_enabled = true
    purge_protection_enabled = false
 
  sku_name = "standard"
 
  access_policy {
    tenant_id = data.azurerm_client_config.current.tenant_id
    object_id = data.azurerm_client_config.current.object_id
 
    key_permissions = [
      "create",
      "get",
    ]
 
    secret_permissions = [
      "set",
      "get",
      "delete",
    ]
  }

  access_policy {
    tenant_id = data.azurerm_client_config.current.tenant_id
    object_id = azuread_service_principal.cds-ad-sp-kv1.id

    key_permissions = [
      "get",
    ]

    secret_permissions = [
      "get",
    ]
  }
}

Notice that I am able to reference the “azuread_service_principal.cds-ad-sp-kv1.id” to access the newly created service principal without issue.

Good practices for starting with containers

Good practices for starting with containers

So I really hate the saying “best practices” mainly because it creates a belief that there is only one right way to do things. But I wanted to put together a post around some ideas for strengthening your micro services architectures.

As I’ve previously discussed, Micro service architectures are more complicated to implement but have a lot of huge benefits to your solution. And some of those benefits are:

  • Independently deployable pieces, no more large scale deployments.
  • More focused testing efforts.
  • Using the right technology for each piece of your solution.
  • I creased resiliency from cluster based deployments.

But for a lot of people, including myself the hardest part of this process is how do you structure a micro-service? How small should each piece be? How do they work together?

So here are some practices I’ve found helpful if you are starting to leverage this in your solutions.

One service = one job

One of the first questions is how small should my containers be. Is there such a thing as too small? A good rule of thumb to focus on is the idea of separation concerns. If you take every use-case and start to break it down to a single purpose, you’ll find you get to a good micro-service design pretty quickly.

Let’s look at examples, I recently worked on a solution with a colleague of mine that ended up pulling from an API, and then extracting that information to put it into a data model.

In the monolith way of thinking, that would have been 1 API call. Pass in the data and then cycle through and process it. But the problem was throughput, if I would have pulled the 67 different regions, and the 300+ records per region and processed this as a batch it would have been a mess of one gigantic API call.

So instead, we had one function that cycled through the regions, and pulled them all to json files in blob storage, and then queued a message.

Then we had another function that when a message is queued, will take that message, read in the records for that region, and process saving them to the database. This separate function is another micro-services.

Now there are several benefits to this approach, but chief among them, the second function can scale independent of the first, and I can respond to queued messages as they come in, using asynchronous processing.

Three words… Domain driven design

For a great definition of Domain-Driven Design, see here. The idea is pretty straight forward, the idea of building software and the structure of your application should mirror the business logic that is being implemented.

So for example, your micro-services should mirror what they are trying to do. Like let’s take the most straightforward example…e-commerce.

If we have to track orders, and have a process once an order is submitted of the following:

  • Orders are submitted.
  • Inventory is verified.
  • Order Payment is processed.
  • Notification is sent to supplier for processing.
  • Confirmation is sent to the customer.
  • Order is fulfilled and shipped

Looking at the above, one way to implement this would be to do the following:

  • OrderService: Manage the orders from start to finish.
  • OrderRecorderService: Record order in tracking system, so you can track the order throughout the process.
  • OrderInventoryService: Takes the contents of the order and checks it against inventory.
  • OrderPaymentService: Processes the payment of the order.
  • OrderSupplierNotificationService: Interact with a 3rd party API to submit the order to the supplier.
  • OrderConfirmationService: Send an email confirming the order is received and being processed.
  • OrderStatusService: Continues to check the 3rd party API for the status of the order.

If you notice above, outside of an orchestration they match exactly what the steps were according to the business. This provides a streamlined approach that makes it easy to make changes, and easy to understand for new team members. More than likely communication between services is done via queues.

For example, let’s say the company above wants to expand to except Venmo as a payment method. Really that means you have to update the OrderPaymentServices to be able to accept the option, and process the payment. Additionally OrderPaymentService might itself be an orchestration service between different micro-services, one per payment method.

Make them independently deployable

This is the big one, if you really want to see benefit of microservices, they MUST be independently deployable. This means that if we look at the above example, I can deploy each of these separate services and make changes to one without having to do a full application deployment.

Take the scenario above, if I wanted to add a new payment method, I should be able to update the OrderPaymentService, check-in those changes, and then deploy that to dev, through production without having to deploy the entire application.

Now, the first time I heard that I thought that was the most ridiculous thing I ever heard, but there are some things you can do to make this possible.

  • Each Service should have its own data store: If you make sure each service has its own data store, that makes it much easier to manage version changes. Even if you are going to leverage something like SQL Server, then make sure that the tables leveraged by each micro-service are used by that service, and that service only. This can be accomplished using schemas.
  • Put layers of abstraction between service communication: For example, a great common example is queuing or eventing. If you have a message being passed through, then as long as the message leaving doesn’t change, then there is no need to update the receiver.
  • If you are going to do direct API communication, use versioning. If you do have to have APIs connecting these services, leverage versioning to allow for micro-services to be deployed and change without breaking other parts of the application.

Build with resiliency in mind

If you adopt this approach to micro-services, then one of the biggest things you will notice quickly is that each micro-service becomes its own black-box. And as such I find its good to build each of these components with resiliency in mind. Things like leveraging Polly for retry, or circuit breaker patterns. These are great ways of making sure that your services will remain resilient, and it will have a cumulative affect on your application.

For example, take our OrderPaymentService above, we know that Queue messages should be coming in, with the order and payment details. We can take a microscope to this service and say, how could it fail, its not hard to get to a list like this:

  • Message comes through in a bad format.
  • Payment service can’t be reached.
  • Payment is declined (for any one of a variety of reasons)
  • Service fails while waiting on payment to be processed.

Now for some of the above, its just some simple error handling, like checking the format of the message for example. We can also build logic to check if the payment service is available, and do an exponential retry until its available.

We might also consider implementing a circuit breaker, that says if we can’t process payments after so many tries, the service switches to an unhealthy state and causes a notification workflow.

And in the final scenario, we could implement a state store that indicates the state of the payment being processed should a service fail and need to be picked up by another.

Consider monitoring early

This is the one that everyone forgets, but it dove-tails nicely out of the previous one. It’s important that there be a mechanism for tracking and monitoring the state of your micro-service. I find too often its easy to say “Oh the service is running, so that means its fine.” That’s like saying just cause the homepage loads, a full web application is working.

You should build into your micro-services the ability to track their health and enable a way of doing so for operations tools. Let’s face it, at the end of the day, all code will eventually be deployed, and all deployed code must be monitored.

So for example, looking at the above. If I build a circuit breaker pattern into OrderPaymentService, and every failure updates status stored within memory of the service that says its unhealthy. I can then expose an http endpoint that returns the status of that breaker.

  • Closed: Service is running fine and healthy
  • Half-Open: Service is experiencing some errors but still processing.
  • Open: Service is taken offline for being unhealthy.

I can then build out logic that when it gets to Half-open, and even open specific events will occur.

Start small, don’t boil the ocean

This one seems kind of ironic given the above. But if you are working on an existing application, you will never be able to convince management to allow you to junk it and start over. So what I have done in the past, is to take an application, and when you find its time to make a change to that part of the application, take the opportunity to rearchitect and make it more resilient. Deconstruct the pieces and implement a micro-service response to resolving the problem.

Stateless over stateful

Honestly this is just good practice to get used to, most container technologies, like Docker or Kubernetes or other options really favor the idea of elastic scale and the ability to start or kill a process at any time. This becomes a lot harder if you have to manage state within a container. If you must manage state I would definitely recommend using an external store for information.

Now I know not every one of these might fit your situation but I’ve found that these ten items make it much easier to transition to creating micro services for your solutions and seeing the benefits of doing so.

Configuring SQL for High Availability in Terraform

Configuring SQL for High Availability in Terraform

Hello All, a short post this week, but as we talk about availability, chaos engineering, etc. One of the most common data elements I see out there is SQL, and Azure SQL. SQL is a prevelant and common data store, it’s everywhere you look.

Given that, many shops are implementing infrastructure-as-code to manage configuration drift and provide increased resiliency for their applications. Which is definitely a great way to do that. The one thing that I’ve had a couple of people talk to me about that isn’t clear…how can I configure geo-replication in TerraForm.

This actually built into the TerraForm azure rm provider, and can be done with the following code:

provider "azurerm" {
    subscription_id = ""
    features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "rg" {
  name     = "sqlpoc"
  location = "{region}"
}

resource "azurerm_sql_server" "primary" {
  name                         = "kmack-sql-primary"
  resource_group_name          = azurerm_resource_group.rg.name
  location                     = azurerm_resource_group.rg.location
  version                      = "12.0"
  administrator_login          = "sqladmin"
  administrator_login_password = "{password}"
}

resource "azurerm_sql_server" "secondary" {
  name                         = "kmack-sql-secondary"
  resource_group_name          = azurerm_resource_group.rg.name
  location                     = "usgovarizona"
  version                      = "12.0"
  administrator_login          = "sqladmin"
  administrator_login_password = "{password}"
}

resource "azurerm_sql_database" "db1" {
  name                = "kmackdb1"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  location            = azurerm_sql_server.primary.location
  server_name         = azurerm_sql_server.primary.name
}

resource "azurerm_sql_failover_group" "example" {
  name                = "sqlpoc-failover-group"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  server_name         = azurerm_sql_server.primary.name
  databases           = [azurerm_sql_database.db1.id]
  partner_servers {
    id = azurerm_sql_server.secondary.id
  }

  read_write_endpoint_failover_policy {
    mode          = "Automatic"
    grace_minutes = 60
  }
}

Now above TF, will deploy two database servers with geo-replication configured. The key part is the following:

resource "azurerm_sql_failover_group" "example" {
  name                = "sqlpoc-failover-group"
  resource_group_name = azurerm_sql_server.primary.resource_group_name
  server_name         = azurerm_sql_server.primary.name
  databases           = [azurerm_sql_database.db1.id]
  partner_servers {
    id = azurerm_sql_server.secondary.id
  }

  read_write_endpoint_failover_policy {
    mode          = "Automatic"
    grace_minutes = 60
  }
}

The important elements are “server_name” and “partner_servers”, this makes the connection to where the data is being replicated. And then the “read_write_endpoint_failover_policy” setups up the failover policy.

Terraform – Key Rotation Gotcha!

Terraform – Key Rotation Gotcha!

So I did want to write about something that I discovered recently when investigating a question. The idea being Key rotation, and how TerraForm state is impacted.

So the question being this, if you have a key vault and you ask any security expert. The number one rule is that Key rotation is absolutely essential. This is important because it helps manage the blast radius of an attack, and keep the access keys changing in a way that makes it harder to compromise.

Now, those same experts will also tell you this should be done via automation, so that no human eye has ever seen that key. Which is easy enough to accomplish. If you look at the documentation released by Microsoft, here. It discusses how to rotate keys with azure automation.

Personally, I like this approach because it makes the key rotation process a part of normal system operation, and not something you as a DevOps engineer or developer have to keep track of. It also if you’re in the government space makes it easy to report for compliance audits.

But I’ve been seeing this growing movement online of people who say to have TerraForm generate your keys, and do rotation of those keys using randomization in your TerraForm scripts. The idea being that you can automate random values in your TerraForm script to generate the keys, and I do like that, but overall I’m not a fan of this.

The reason being is it makes key rotation a deployment activity, and if your environment gets large enough that you start doing “scoped” deployments, it removes any rhyme or reason from your key generation. It’s solely based on when you run the scripts.

Now that does pose a problem with state at the end of the day. Because let’s take the following code:

provider "azurerm" {
    subscription_id = "...subscription id..."
    features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

data "azurerm_client_config" "current" {}

resource "azurerm_resource_group" "superheroes" {
  name     = "superheroes"
  location = "usgovvirginia"
}

resource "random_id" "server" {
  keepers = {
    ami_id = 1
  }

  byte_length = 8
}

resource "azurerm_key_vault" "superherovault" {
  name                        = "superheroidentities"
  location                    = azurerm_resource_group.superheroes.location
  resource_group_name         = azurerm_resource_group.superheroes.name
  enabled_for_disk_encryption = true
  tenant_id                   = data.azurerm_client_config.current.tenant_id
  soft_delete_enabled         = true
  purge_protection_enabled    = false

  sku_name = "standard"

  access_policy {
    tenant_id = data.azurerm_client_config.current.tenant_id
    object_id = data.azurerm_client_config.current.object_id

    key_permissions = [
      "create",
      "get",
    ]

    secret_permissions = [
      "set",
      "get",
      "delete",
    ]
  }

  tags = {
    environment = "Testing"
  }
}

resource "azurerm_key_vault_secret" "Batman" {
  name         = "batman"
  value        = "Bruce Wayne"
  key_vault_id = azurerm_key_vault.superherovault.id

  tags = {
    environment = "Production"
  }
}

Now based on the above, all good right? When I execute my TerraForm script, I will have a secret named “batman” with a value of “Bruce Wayne.”

But the only problem here will be if I go to the Azure Portal and change that value, say I change the value of “batman” to “dick grayson”, and then I rerun my TerraForm apply.

It will want to reset that key back to “batman”. And we’ve broken our key rotation at this point….now what?

My thoughts on this is its easy enough to wrap the “terraform apply” in a bash script and before you execute it run a “terraform refresh” and re-pull the values from the cloud to populate your TerraForm script.

If you don’t like that option, there is another solution, use the lifecycle tag within a resource to tell it to ignore updates. And prevent updates to the keyvault if the keys have changed as part of the rotation.

Copying blobs between storage accounts / regions

Copying blobs between storage accounts / regions

So a common question I get is copying blobs. So if you are working with azure blob storage, it’s sort of inevitable that you would need to do a data copy. Whether that be for a migration, re-architecture, any number of reasons … you will need to do a data copy.

Now this is something where I’ve seen all different versions of doing a data copy. And I’m going to talk through those options here, and ultimately how best to execute a copy within Azure Blob Storage.

I want to start with the number 1, DO NOT DO, option. That option is “build a utility to cycle through and copy blobs one by one.” This is the least desirable option for moving data for a couple of reasons:

  • Speed – This is going to be a single threaded, synchronous operation.
  • Complexity – This feels counter-intuitive, but the process of ensuring data copies, building fault handling, etc…is not easy. And not something you want to take on when you don’t have to.
  • Chances of Failure – Long running processes are always problematic, always. As these processes can fail, and when they do they can be difficult to recover from. So you are opening yourself up to potential problems.
  • Cost – At the end of the day, you are creating a long running process that will need to have compute running 24/7 for an extended period. Compute in the cloud costs money, so this is an additional cost.

So the question is, if I shouldn’t build my own utility, how do we get this done. There are really two options that I’ve used in the past to success:

  • AzCopy – This is the tried and true option. This utility provides an easy command line interface for kicking off copy jobs that can be run either in a synchronous or asynchronous method. Even in its synchronous option, you will see higher throughput for the copy. This removes some of the issues from above, but not all.
  • Copy API – a newer option, the Rest API enables a copy operation. This provides the best possible throughput and prevents you from having to create a VM, allowing for asynchronous copy operations in azure to facilitate this operation. The API is easy to use and documentation can be found here.

Ultimately, there are lots of ways and scenarios you can leverage these tools to copy data. The other one that I find usually raises questions, is if I’m migrating a large volume of data, how do I do it to minimize downtime.

The way I’ve accomplished this, is to break your data down accordingly.

  • Sort the data by age oldest to newest.
  • Starting with the oldest blobs, break them down into the following chucks.
  • Move the first 50%
  • Move the next 30%
  • Move the next 10-15%
  • Take a downtime window to copy the last 5-10%

By doing so, you gain the ability to minimize your downtime window while maximizing the backend copy. Now the above process only works if your newer data is accessed more often, it creates a good option for moving your blobs, and minimizing downtime.

Azure Search SDK in Government

Azure Search SDK in Government

So I’ve been working on a demo project using Azure Search, and if you’ve followed this blog for a while you know. I do a lot of work that requires Azure Government. Well recently I needed to implement a search that would be called via an Azure Function and require the passing of latitude and longitude to facilitate the searching within a specific distance. So I started to build my azure function using the SDK. And what I ended up with looked a lot like this:

Key Data elements:

First to be able to interact with my search service I need to install the following nuget package:

Microsoft.Azure.Search

And upon doing so, I found so pretty good documentation here for building the search client. So I built out a GeoSearchProvider class that looked like the following:

NOTE: I use a custom class called IConfigurationProvider which encapsulates my configuration store, in most cases its KeyVault, but it can be a variety of other options.

public class GeoSearchProvider : IGeoSearchProvider
    {
        IConfigurationProvider _configurationProvider;

        public GeoSearchProvider(IConfigurationProvider configurationProvider)
        {
            _configurationProvider = configurationProvider;
        }

        public async Task<DocumentSearchResult<SearchResultModel>> RunSearch(string text, string latitude, string longitude, string kmdistance, Microsoft.Extensions.Logging.ILogger log)
        {
            if (String.IsNullOrEmpty(kmdistance))
            {
                kmdistance = await _configurationProvider.GetSetting("SearchDefaultDistance");
            }

            var serviceName = await _configurationProvider.GetSetting("SearchServiceName");
            var serviceApiKey = await _configurationProvider.GetSetting("SearchServiceApiKey");
            var indexName = await _configurationProvider.GetSetting("SearchServiceIndex");

            SearchIndexClient indexClient = new SearchIndexClient(serviceName, indexName, new SearchCredentials(serviceApiKey));

            var parameters = new SearchParameters()
            {
                Select = new[] { "...{list of fields}..." },
                Filter = string.Format("geo.distance(location, geography'POINT({0} {1})') le {2}", latitude, longitude, kmdistance)
            };

            var logmessage = await _configurationProvider.GetSetting("SearchLogMessage");

            try
            {
                var results = await indexClient.Documents.SearchAsync<SearchResultModel>(text, parameters);

                log.LogInformation(string.Format(logmessage, text, latitude, longitude, kmdistance, results.Count.ToString()));

                return results;
            }
            catch (Exception ex)
            {
                log.LogError(ex.Message);
                log.LogError(ex.StackTrace);
                throw ex;
            }
        }
    }

The above code seems pretty straight forward and will run just fine to get back my search results. I even built in logic so that if I don’t give it a distance, it will take a default from the configuration store, pretty slick.

And I pretty quickly ran into a problem, and that error was “Host Not found”.

And I racked my brain on this for a while before I discovered the cause. By default, the Azure Search SDK, talks to Commercial. Not Azure Government, and after picking through the documentation I found this. There is a property called DnsSuffix, which allows you to put in the suffix used for finding the search service. By default it is “search.windows.net”. I changed my code to the following:

public class GeoSearchProvider : IGeoSearchProvider
    {
        IConfigurationProvider _configurationProvider;

        public GeoSearchProvider(IConfigurationProvider configurationProvider)
        {
            _configurationProvider = configurationProvider;
        }

        public async Task<DocumentSearchResult<SearchResultModel>> RunSearch(string text, string latitude, string longitude, string kmdistance, Microsoft.Extensions.Logging.ILogger log)
        {
            if (String.IsNullOrEmpty(kmdistance))
            {
                kmdistance = await _configurationProvider.GetSetting("SearchDefaultDistance");
            }

            var serviceName = await _configurationProvider.GetSetting("SearchServiceName");
            var serviceApiKey = await _configurationProvider.GetSetting("SearchServiceApiKey");
            var indexName = await _configurationProvider.GetSetting("SearchServiceIndex");
            var dnsSuffix = await _configurationProvider.GetSetting("SearchSearchDnsSuffix");

            SearchIndexClient indexClient = new SearchIndexClient(serviceName, indexName, new SearchCredentials(serviceApiKey));
            indexClient.SearchDnsSuffix = dnsSuffix;

            var parameters = new SearchParameters()
            {
                Select = new[] { "...{list of fields}..." },
                Filter = string.Format("geo.distance(location, geography'POINT({0} {1})') le {2}", latitude, longitude, kmdistance)
            };

            //TODO - Define sorting based on distance

            var logmessage = await _configurationProvider.GetSetting("SearchLogMessage");

            try
            {
                var results = await indexClient.Documents.SearchAsync<SearchResultModel>(text, parameters);

                log.LogInformation(string.Format(logmessage, text, latitude, longitude, kmdistance, results.Count.ToString()));

                return results;
            }
            catch (Exception ex)
            {
                log.LogError(ex.Message);
                log.LogError(ex.StackTrace);
                throw ex;
            }
        }
    }

And set the “SearchSearchDnsSuffix” to “search.azure.us” for government, and it all immediately worked.

Log Analytics – Disk Queries

Log Analytics – Disk Queries

So Log Analytics is a really powerful tool, the ability to ingest a wide variety of logs can help you to really build out some robust monitoring to better enable your application. And this ultimately enables the ability to build out robust dashboards.

Now I recently had to do some log analytics queries, specifically around disk statistics to monitor all the disks on a given machine. And if your like me, you don’t write these queries often so when you do it can be a process.

Now a couple of things to note about log analytics queries that matter, especially KQL. The biggest and most important being that order of operations matter. Unlike SQL, when you apply each clause this is a lot closer to using a | in Linux than a “where” clause in SQL. You need to make sure you use the right clause as it can make things a lot harder.

So anyway, here are some queries I think you’ll find helpful:

All Disk Statistics:

Perf 
| where ObjectName == "LogicalDisk"
| summarize Value = min(CounterValue) by Computer, InstanceName, CounterName
| sort by CounterName asc nulls last 
| sort by InstanceName asc nulls last 
| sort by Computer asc nulls last 

% Free space – Graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "% Free Space" and InstanceName != "_Total" and Computer = ""
| summarize FreeSpace = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by FreeSpace asc nulls last 
| render timechart

Avg Disk sec / Read – graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Avg. Disk sec/Read" and InstanceName != "_Total" and Computer = ""
| summarize AvgDiskReadPerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by AvgDiskReadPerSec asc nulls last 
| render timechart

Avg Disk sec / Write

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Avg. Disk sec/Write" and InstanceName != "_Total" and Computer = ""
| summarize AvgDiskWritePerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by AvgDiskWritePerSec asc nulls last 
| render timechart

Current Disk Queue Length

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Current Disk Queue Length" and InstanceName != "_Total" and Computer = ""
| summarize CurrentQueueLength = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by CurrentQueueLength asc nulls last 
| render timechart

Disk Reads/sec – graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Disk Reads/sec" and InstanceName != "_Total" and Computer = ""
| summarize DiskReadsPerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by DiskReadsPerSec asc nulls last 
| render timechart

Disk Transfers/sec – Graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Disk Transfers/sec" and InstanceName != "_Total" and Computer = ""
| summarize DiskTransfersPerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by DiskTransfersPerSec asc nulls last 
| render timechart

Disk Writes/sec – Graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Disk Writes/sec" and InstanceName != "_Total" and Computer = ""
| summarize DiskWritesPerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by DiskWritesPerSec asc nulls last 
| render timechart

Alert = % Free Space Warning

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "% Free Space"
| summarize FreeSpace = min(CounterValue) by Computer, InstanceName
| where FreeSpace < 20
| sort by FreeSpace asc nulls last 
| render barchart kind=unstacked
Cloud Networking and Security

Cloud Networking and Security

Now here’s a fun topic I wanted to share, as I’ve been looking more and more into this. When many people think of the cloud, in my experience the ideas of networking and security are what has changed so vastly compared to what they think of in a normal circumstance.

At its core, there is a mindset shift between the way on-prem data centers, and cloud based networking function. And its important to remember these fundamental differences or else you run into a variety of problems down the road. It’s easy to get overwhelmed to be honest, and I don’t mean for this to seem complete by any stretch of the imagination. But you have to start somewhere right.

The most important thing to remember is that some elements of security just don’t apply anymore, at least not in the traditional sense. And here are some of those concepts:

  • Perimeter Security is not what it used to be: This is the hardest thing for a lot of people to realize, but everyone still tries to cling to these notions that the only way to secure a workload is through locking down every public endpoint, and build a perimeter around your application, and then call it a day. Do a search online of the number of companies who implement perimeter security practices and how many times it blew up in their face. Security Threats, attack vectors are always changing and to consider the idea that you can build a fence and that’s good enough is just ridiculious.
  • Authentication / Authorization are the new IP address: Another situation that I see all too common with the cloud is people clinging to IP whitelisting. IP Whitelisting is not sufficient for many of the more sophisticated attackers any more. And to be honest, your preventing yourself from taking advantage of cloud based services that are more secure than what you are capable of implementing yourself. The idea of Zero trust has been growing more and more, and here we assume that no sending is safe, without credentials. This ensures better security overall.
See the source image

So what do we have to look at to start. I wanted to provide some ideas of potential areas to focus when it came to security for the Cloud and those options are here.

  • Here is a quickly consumable “Best Practices” for IaaS workloads for security.
  • Additionally there is a link to security documentation for azure, and this provides a lot of details on different topics and questions.

And here is a reference on the Microsoft Shared Responsibility model for Security.

  • Network Security Options:  Here is a list of options for network security.
  • Network / Application Security Groups:  NSGs are a great way of limiting the traffic within a virtual network.  But additionally in this space, we provide service tags, which allows you to manage the different azure services you might allow to communicate for rule creation.  Things like “AzureTrafficManager”, “VirtualNetwork”, “Sql”, “Storage”.  Additionally there is an option with Application Security Groups (ASGs), which enable you to configure the NSGs to be based on the application architecture. 
  • Virtual Network Service Endpoints:  This provides an option to extend your virtual network private address space to Azure services without traveling the public internet.  So the intention here would be, I want my machines to access “KeyVault”, but I don’t want it to be accessible outside of the vNet.  This is important as it allows you to further lock down your networking and access.
  • Virtual Network Peering:  As you identified in your network diagram, you were implementing two virtual networks.  If you want communication to occur across the different virtual networks, you would need to implement vnet peering to enable that traffic. 

Ultimately as I mentioned above, Zero Trust security models are really the direction of the future from a Cyber Security direction. A great site that covers the idea of Zero trust, and all the considerations can be found here. As well as a great whitepaper here.

https://www.secureworldexpo.com/hs-fs/hubfs/meme-cybersecurity1.gif?width=600&name=meme-cybersecurity1.gif
AI / Analytics Tools in Azure

AI / Analytics Tools in Azure

So when its comes to Artificial Intelligence in Azure, there are a lot of tools, and a lot of options and directions you can explore. And AI is a broad topic by itself. But that being said I wanted to share some resources to help if you are looking for some demos to show the “art of the possible” or tools to start if you are a data scientist or doing this kind of work to help.

Let’s start with some demos.  Here are links to some of the demos that I find particularly interesting about the capabilities provided by Azure in this space.

  • Video.AI : This site allows you to upload videos and run them through a variety of cognitive / media services to showcase the capabilities. 
  • JFK Files : This is one of my favorites, as it shows the capabilities of cognitive search with regard to searching large datasets and making for a good reusable interface for surfacing some of the findings of things like transcription. 
  • Coptivity : Here’s a link to the video for CopTivity and how the use of a modern interface is interesting to law enforcement. 

Now when its comes to offerings in this space, there are a lot and its always growing but I wanted to cover some at a high level that can be investigated quickly.

Cognitive Services : This includes azure services that are more using APIs to provide AI capabilities to your applications without having to build it yourself.  These include things like Custom Vision, Sentiment Analysis, and other capabilities.  Here’s a video discussing it further. 

DataBricks : DataBricks is a great technology for generating the compute required to run your Python, and Spark based models and do so it a way that minimizes the management demands and requirements placed on your application. 

Azure Machine Learning : Specifically this offering provides options to empower developers and data scientists to increase productivity.  Here’s a video giving the quick highlights of what Azure Machine Learning Studio is.  And a video on data labeling in ML Studio.  Here’s a video about using Azure Machine Learning Designer to democratize AI.  Here’s a video on using Azure Machine Learning DataSets. 

Data Studio : Along with tools like VS Code, which is a great IDE for doing Python and other work, we do provide a similar open source tool called Azure Data Studio, which can help with the data work your teams are doing.  Here’s a video on how to use Jupyter notebooks with it.  Additionally VSCode provides options to support this kind of work as well (video). 

Azure Cognitive Search:  As I mentioned above Search can be a great way to surface insights to your users, and here’s a video on using Cognitive Search. 

Azure Data Science VM: Finally, part of the battle of doing Data Science work is maintaining all the open source tools, and leveraging them to your benefit, the amount of time required for machine configuration is not insignificant.  Azure provides a VM option where you can create a VM preloaded with all the tools you need.  Azure has it setup for Windows 2016, Ubuntu, CentOS. And there is even have a version built around Geo AI with ArcGIS.  There is no additional charge this, as you pay for the underlying VM you are using but Microsoft do not charge for the implementation of the data science tools on this. 

I particularly love this diagram as it shows all the tools included:

Now again, this is only scratching the surface but I think its a powerful place to start to find out more. I have additional posts on this topic.

See the source image
Reserved Instances and where everyone gets it wrong

Reserved Instances and where everyone gets it wrong

So one of the most important things in cloud computing is cost management. I know just this is just the thing that we all went to school for, and learned to code for…spreadsheets! We all wanted to do cost projections, and figure out gross margin, right…right?

In all seriousness, cost management is an important part of build solutions in the cloud, because it ultimately goes to the sustainability and the ability to provide the best features possible for your solutions. The simple fact is no matter how you slice it, resources will always be a factor.

Reserved Instances are a common offering for every cloud provider. And honestly they are the best option to easily save money in the cloud and do so in a way that empowers you to grow as your solution does and save money along the way.

Now to that end, there are some serious misconceptions about Reserved instances, that I wanted to share. And these specifically relate to the Azure version of Reserved instances.

Misconception 1 – Reserved Instances are attached to a specific VM.

This is the biggest misconception. The question of “how do I apply the RI I just purchased to a VM xyz?” The answer is “you don’t”, Reserved Instance pricing is actually a pre-purchase of compute hours for a specific sku. So there is no process by which you attach the RI to a specific VM.

Let’s take an example to understand the implementation of this a little more:

  • I have 5x DS2v2’s running, which are costing me $170.82 each for a total of $854.10. Now I’ve decided to do a 1 year RI, to bring about a 29% savings bringing my per vm costs to $121.25, and a total of $606.23.
  • I go through the portal, and purchase 5x 1-Year Reserved Instances for DS2v2, to get this cost savings.

And that’s it, I’m done.

It really is that simple, at the end of the day. Now what’s happening behind the scenes is that I have prepurchased 3,650 hours of Compute time at that lower price. So when my bill is calculated, the first 3,650 will be at the lower price, and I don’t need to worry about which VMs it’s attached to. I just know that I’m paying less for the first 3,650 hours.

So the next logical question is, what happens if I have 6 VMs? The math works out like this.

  • The normal PAYG rate is $170.82 which comes to ~ $0.234 per hour.
  • I purchased 5x DS2v2’s at the lower rate ($121.25), which means the hourly rate is ~ $0.167 per hour.
  • I’ve got 6x DS2v2’s running currently within the scope of the RI. So that means that ultimately in 1 month (assuming 730 hours in the month), I am consuming 4,380 compute hours.

What that means is that this is how the pricing breaks out:

Number of HoursWith RIPAYG RateTotal Cost
3,650$0.167$606.23
730$0.234$170.82
Total$777.05

So what this means is any overage above the RI is just simply billed at the PAYG rate, which means you have the result you are looking for.

But this also buys you a lot of flexibility, you gain the ability to add VMs, and delete VMs and as long as at the end the hours are the same it doesn’t matter. This gives you a lot of power to get the maximum amount of savings without having to go through a lot of headaches.

Misconception 2 – We can’t use RI because its cost money up front.

This is another misconception, because it is 100% not true. You can sign up for monthly payments on your RI, which removes all the risk for you ultimately. You can get the discount and pay the monthly amount without having to pay a large lump sum up front. Here’s a link on how to do so.

Now the most common question I get with this is “Kevin, how is this different than PAYG?” The answer is this, what’s happening is the amount is calculated the same as the upfront cost, and then broken up into a monthly charge. That charge for those compute hours will be divided evenly over the period (1-year or 3-year). Now where the difference happens is RI is use it or lose it.

Take the following scenario:

  • I have 5x DS2v2, with a one year reservation, meaning I’m paying $121.25 a month for each of them. The total being $606.23 a month, spread out over 12 months.
  • If I delete 2 of those VMs, and don’t provision any more, and don’t modify my reservation, my bill for the month will be $606.23. It is use it or lose it, the hours do not roll over to the next month, and I would have paid $242.50 for nothing.

Now if I created new VMs, no problem, or if I exchanged the RI, also not a problem. But its important to know that I can get the benefit of paying monthly, and provided I make sure they are managed properly, I’ll have no problems and get the full benefit of the discount.

Also worth mentioning here, there is no difference to the discount if you pay upfront, or monthly, the discount is 100% the same.

Misconception 3 – I can’t change it after I buy.

This is definitely one of the most common misconceptions I see out there. You absolutely can swap / exchange reservation as your requirements and needs change. And this allows you to change the size of a VM or service to meet your needs, without losing money. Here’s a link on how to manage your reservations. And here’s a link on self-service exchanges and refunds.

There is a lot of detail in the links above, and its pretty self-explanatory. So please review these policies but the end story is that you are not locked in and committed to paying some huge amount of money if you change your mind.

Ultimately, that also means that you don’t need to wait as long to gain the benefits of this program. Its definitely something you should take advantage of as soon as you can.

Misconception 4 – I can’t use RI because I use PaaS Services in my solution.

Another huge misconception, the RI program from Microsoft is changing all the time and new services are being added constantly. The services at the time of this post included with RI offerings are:

  • App Service
  • Azure Redis Cache
  • Cosmos
  • Database for MariaDB
  • Database for PostgreSQL
  • Databricks
  • Dedicated Hosts
  • Disk Storage
  • Storage
  • SQL Database
  • SQL Warehouse
  • Virtual Machines

Which is a pretty broad group of services and new services are lighting up every day.

Misconception 5 – RI isn’t worth it.

I never understood this, given that I can pay monthly, exchange or get a refund, cover VMs and a whole bunch of other services…And get usually between 25-30% (RI 1-year) or 40-50% (RI 3-year) off my bill. Just because I decide to. This is absolutely the first thing you should look at when you are looking to cut your cloud hosting costs.

Final Thoughts

I hope that clears up some concerns and thoughts about your Azure Costs and how to potentially manage your bill to ensure that you can provide your solutions to their end customers in a cost effective manner.