Browsed by
Category: CodeProject

Azure Search SDK in Government

Azure Search SDK in Government

So I’ve been working on a demo project using Azure Search, and if you’ve followed this blog for a while you know. I do a lot of work that requires Azure Government. Well recently I needed to implement a search that would be called via an Azure Function and require the passing of latitude and longitude to facilitate the searching within a specific distance. So I started to build my azure function using the SDK. And what I ended up with looked a lot like this:

Key Data elements:

First to be able to interact with my search service I need to install the following nuget package:

Microsoft.Azure.Search

And upon doing so, I found so pretty good documentation here for building the search client. So I built out a GeoSearchProvider class that looked like the following:

NOTE: I use a custom class called IConfigurationProvider which encapsulates my configuration store, in most cases its KeyVault, but it can be a variety of other options.

public class GeoSearchProvider : IGeoSearchProvider
    {
        IConfigurationProvider _configurationProvider;

        public GeoSearchProvider(IConfigurationProvider configurationProvider)
        {
            _configurationProvider = configurationProvider;
        }

        public async Task<DocumentSearchResult<SearchResultModel>> RunSearch(string text, string latitude, string longitude, string kmdistance, Microsoft.Extensions.Logging.ILogger log)
        {
            if (String.IsNullOrEmpty(kmdistance))
            {
                kmdistance = await _configurationProvider.GetSetting("SearchDefaultDistance");
            }

            var serviceName = await _configurationProvider.GetSetting("SearchServiceName");
            var serviceApiKey = await _configurationProvider.GetSetting("SearchServiceApiKey");
            var indexName = await _configurationProvider.GetSetting("SearchServiceIndex");

            SearchIndexClient indexClient = new SearchIndexClient(serviceName, indexName, new SearchCredentials(serviceApiKey));

            var parameters = new SearchParameters()
            {
                Select = new[] { "...{list of fields}..." },
                Filter = string.Format("geo.distance(location, geography'POINT({0} {1})') le {2}", latitude, longitude, kmdistance)
            };

            var logmessage = await _configurationProvider.GetSetting("SearchLogMessage");

            try
            {
                var results = await indexClient.Documents.SearchAsync<SearchResultModel>(text, parameters);

                log.LogInformation(string.Format(logmessage, text, latitude, longitude, kmdistance, results.Count.ToString()));

                return results;
            }
            catch (Exception ex)
            {
                log.LogError(ex.Message);
                log.LogError(ex.StackTrace);
                throw ex;
            }
        }
    }

The above code seems pretty straight forward and will run just fine to get back my search results. I even built in logic so that if I don’t give it a distance, it will take a default from the configuration store, pretty slick.

And I pretty quickly ran into a problem, and that error was “Host Not found”.

And I racked my brain on this for a while before I discovered the cause. By default, the Azure Search SDK, talks to Commercial. Not Azure Government, and after picking through the documentation I found this. There is a property called DnsSuffix, which allows you to put in the suffix used for finding the search service. By default it is “search.windows.net”. I changed my code to the following:

public class GeoSearchProvider : IGeoSearchProvider
    {
        IConfigurationProvider _configurationProvider;

        public GeoSearchProvider(IConfigurationProvider configurationProvider)
        {
            _configurationProvider = configurationProvider;
        }

        public async Task<DocumentSearchResult<SearchResultModel>> RunSearch(string text, string latitude, string longitude, string kmdistance, Microsoft.Extensions.Logging.ILogger log)
        {
            if (String.IsNullOrEmpty(kmdistance))
            {
                kmdistance = await _configurationProvider.GetSetting("SearchDefaultDistance");
            }

            var serviceName = await _configurationProvider.GetSetting("SearchServiceName");
            var serviceApiKey = await _configurationProvider.GetSetting("SearchServiceApiKey");
            var indexName = await _configurationProvider.GetSetting("SearchServiceIndex");
            var dnsSuffix = await _configurationProvider.GetSetting("SearchSearchDnsSuffix");

            SearchIndexClient indexClient = new SearchIndexClient(serviceName, indexName, new SearchCredentials(serviceApiKey));
            indexClient.SearchDnsSuffix = dnsSuffix;

            var parameters = new SearchParameters()
            {
                Select = new[] { "...{list of fields}..." },
                Filter = string.Format("geo.distance(location, geography'POINT({0} {1})') le {2}", latitude, longitude, kmdistance)
            };

            //TODO - Define sorting based on distance

            var logmessage = await _configurationProvider.GetSetting("SearchLogMessage");

            try
            {
                var results = await indexClient.Documents.SearchAsync<SearchResultModel>(text, parameters);

                log.LogInformation(string.Format(logmessage, text, latitude, longitude, kmdistance, results.Count.ToString()));

                return results;
            }
            catch (Exception ex)
            {
                log.LogError(ex.Message);
                log.LogError(ex.StackTrace);
                throw ex;
            }
        }
    }

And set the “SearchSearchDnsSuffix” to “search.azure.us” for government, and it all immediately worked.

Log Analytics – Disk Queries

Log Analytics – Disk Queries

So Log Analytics is a really powerful tool, the ability to ingest a wide variety of logs can help you to really build out some robust monitoring to better enable your application. And this ultimately enables the ability to build out robust dashboards.

Now I recently had to do some log analytics queries, specifically around disk statistics to monitor all the disks on a given machine. And if your like me, you don’t write these queries often so when you do it can be a process.

Now a couple of things to note about log analytics queries that matter, especially KQL. The biggest and most important being that order of operations matter. Unlike SQL, when you apply each clause this is a lot closer to using a | in Linux than a “where” clause in SQL. You need to make sure you use the right clause as it can make things a lot harder.

So anyway, here are some queries I think you’ll find helpful:

All Disk Statistics:

Perf 
| where ObjectName == "LogicalDisk"
| summarize Value = min(CounterValue) by Computer, InstanceName, CounterName
| sort by CounterName asc nulls last 
| sort by InstanceName asc nulls last 
| sort by Computer asc nulls last 

% Free space – Graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "% Free Space" and InstanceName != "_Total" and Computer = ""
| summarize FreeSpace = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by FreeSpace asc nulls last 
| render timechart

Avg Disk sec / Read – graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Avg. Disk sec/Read" and InstanceName != "_Total" and Computer = ""
| summarize AvgDiskReadPerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by AvgDiskReadPerSec asc nulls last 
| render timechart

Avg Disk sec / Write

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Avg. Disk sec/Write" and InstanceName != "_Total" and Computer = ""
| summarize AvgDiskWritePerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by AvgDiskWritePerSec asc nulls last 
| render timechart

Current Disk Queue Length

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Current Disk Queue Length" and InstanceName != "_Total" and Computer = ""
| summarize CurrentQueueLength = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by CurrentQueueLength asc nulls last 
| render timechart

Disk Reads/sec – graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Disk Reads/sec" and InstanceName != "_Total" and Computer = ""
| summarize DiskReadsPerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by DiskReadsPerSec asc nulls last 
| render timechart

Disk Transfers/sec – Graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Disk Transfers/sec" and InstanceName != "_Total" and Computer = ""
| summarize DiskTransfersPerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by DiskTransfersPerSec asc nulls last 
| render timechart

Disk Writes/sec – Graph

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "Disk Writes/sec" and InstanceName != "_Total" and Computer = ""
| summarize DiskWritesPerSec = min(CounterValue) by InstanceName, Computer, TimeGenerated
| sort by DiskWritesPerSec asc nulls last 
| render timechart

Alert = % Free Space Warning

Perf 
| where ObjectName == "LogicalDisk" and CounterName == "% Free Space"
| summarize FreeSpace = min(CounterValue) by Computer, InstanceName
| where FreeSpace < 20
| sort by FreeSpace asc nulls last 
| render barchart kind=unstacked
Cloud Networking and Security

Cloud Networking and Security

Now here’s a fun topic I wanted to share, as I’ve been looking more and more into this. When many people think of the cloud, in my experience the ideas of networking and security are what has changed so vastly compared to what they think of in a normal circumstance.

At its core, there is a mindset shift between the way on-prem data centers, and cloud based networking function. And its important to remember these fundamental differences or else you run into a variety of problems down the road. It’s easy to get overwhelmed to be honest, and I don’t mean for this to seem complete by any stretch of the imagination. But you have to start somewhere right.

The most important thing to remember is that some elements of security just don’t apply anymore, at least not in the traditional sense. And here are some of those concepts:

  • Perimeter Security is not what it used to be: This is the hardest thing for a lot of people to realize, but everyone still tries to cling to these notions that the only way to secure a workload is through locking down every public endpoint, and build a perimeter around your application, and then call it a day. Do a search online of the number of companies who implement perimeter security practices and how many times it blew up in their face. Security Threats, attack vectors are always changing and to consider the idea that you can build a fence and that’s good enough is just ridiculious.
  • Authentication / Authorization are the new IP address: Another situation that I see all too common with the cloud is people clinging to IP whitelisting. IP Whitelisting is not sufficient for many of the more sophisticated attackers any more. And to be honest, your preventing yourself from taking advantage of cloud based services that are more secure than what you are capable of implementing yourself. The idea of Zero trust has been growing more and more, and here we assume that no sending is safe, without credentials. This ensures better security overall.
See the source image

So what do we have to look at to start. I wanted to provide some ideas of potential areas to focus when it came to security for the Cloud and those options are here.

  • Here is a quickly consumable “Best Practices” for IaaS workloads for security.
  • Additionally there is a link to security documentation for azure, and this provides a lot of details on different topics and questions.

And here is a reference on the Microsoft Shared Responsibility model for Security.

  • Network Security Options:  Here is a list of options for network security.
  • Network / Application Security Groups:  NSGs are a great way of limiting the traffic within a virtual network.  But additionally in this space, we provide service tags, which allows you to manage the different azure services you might allow to communicate for rule creation.  Things like “AzureTrafficManager”, “VirtualNetwork”, “Sql”, “Storage”.  Additionally there is an option with Application Security Groups (ASGs), which enable you to configure the NSGs to be based on the application architecture. 
  • Virtual Network Service Endpoints:  This provides an option to extend your virtual network private address space to Azure services without traveling the public internet.  So the intention here would be, I want my machines to access “KeyVault”, but I don’t want it to be accessible outside of the vNet.  This is important as it allows you to further lock down your networking and access.
  • Virtual Network Peering:  As you identified in your network diagram, you were implementing two virtual networks.  If you want communication to occur across the different virtual networks, you would need to implement vnet peering to enable that traffic. 

Ultimately as I mentioned above, Zero Trust security models are really the direction of the future from a Cyber Security direction. A great site that covers the idea of Zero trust, and all the considerations can be found here. As well as a great whitepaper here.

https://www.secureworldexpo.com/hs-fs/hubfs/meme-cybersecurity1.gif?width=600&name=meme-cybersecurity1.gif
AI / Analytics Tools in Azure

AI / Analytics Tools in Azure

So when its comes to Artificial Intelligence in Azure, there are a lot of tools, and a lot of options and directions you can explore. And AI is a broad topic by itself. But that being said I wanted to share some resources to help if you are looking for some demos to show the “art of the possible” or tools to start if you are a data scientist or doing this kind of work to help.

Let’s start with some demos.  Here are links to some of the demos that I find particularly interesting about the capabilities provided by Azure in this space.

  • Video.AI : This site allows you to upload videos and run them through a variety of cognitive / media services to showcase the capabilities. 
  • JFK Files : This is one of my favorites, as it shows the capabilities of cognitive search with regard to searching large datasets and making for a good reusable interface for surfacing some of the findings of things like transcription. 
  • Coptivity : Here’s a link to the video for CopTivity and how the use of a modern interface is interesting to law enforcement. 

Now when its comes to offerings in this space, there are a lot and its always growing but I wanted to cover some at a high level that can be investigated quickly.

Cognitive Services : This includes azure services that are more using APIs to provide AI capabilities to your applications without having to build it yourself.  These include things like Custom Vision, Sentiment Analysis, and other capabilities.  Here’s a video discussing it further. 

DataBricks : DataBricks is a great technology for generating the compute required to run your Python, and Spark based models and do so it a way that minimizes the management demands and requirements placed on your application. 

Azure Machine Learning : Specifically this offering provides options to empower developers and data scientists to increase productivity.  Here’s a video giving the quick highlights of what Azure Machine Learning Studio is.  And a video on data labeling in ML Studio.  Here’s a video about using Azure Machine Learning Designer to democratize AI.  Here’s a video on using Azure Machine Learning DataSets. 

Data Studio : Along with tools like VS Code, which is a great IDE for doing Python and other work, we do provide a similar open source tool called Azure Data Studio, which can help with the data work your teams are doing.  Here’s a video on how to use Jupyter notebooks with it.  Additionally VSCode provides options to support this kind of work as well (video). 

Azure Cognitive Search:  As I mentioned above Search can be a great way to surface insights to your users, and here’s a video on using Cognitive Search. 

Azure Data Science VM: Finally, part of the battle of doing Data Science work is maintaining all the open source tools, and leveraging them to your benefit, the amount of time required for machine configuration is not insignificant.  Azure provides a VM option where you can create a VM preloaded with all the tools you need.  Azure has it setup for Windows 2016, Ubuntu, CentOS. And there is even have a version built around Geo AI with ArcGIS.  There is no additional charge this, as you pay for the underlying VM you are using but Microsoft do not charge for the implementation of the data science tools on this. 

I particularly love this diagram as it shows all the tools included:

Now again, this is only scratching the surface but I think its a powerful place to start to find out more. I have additional posts on this topic.

See the source image
Reserved Instances and where everyone gets it wrong

Reserved Instances and where everyone gets it wrong

So one of the most important things in cloud computing is cost management. I know just this is just the thing that we all went to school for, and learned to code for…spreadsheets! We all wanted to do cost projections, and figure out gross margin, right…right?

In all seriousness, cost management is an important part of build solutions in the cloud, because it ultimately goes to the sustainability and the ability to provide the best features possible for your solutions. The simple fact is no matter how you slice it, resources will always be a factor.

Reserved Instances are a common offering for every cloud provider. And honestly they are the best option to easily save money in the cloud and do so in a way that empowers you to grow as your solution does and save money along the way.

Now to that end, there are some serious misconceptions about Reserved instances, that I wanted to share. And these specifically relate to the Azure version of Reserved instances.

Misconception 1 – Reserved Instances are attached to a specific VM.

This is the biggest misconception. The question of “how do I apply the RI I just purchased to a VM xyz?” The answer is “you don’t”, Reserved Instance pricing is actually a pre-purchase of compute hours for a specific sku. So there is no process by which you attach the RI to a specific VM.

Let’s take an example to understand the implementation of this a little more:

  • I have 5x DS2v2’s running, which are costing me $170.82 each for a total of $854.10. Now I’ve decided to do a 1 year RI, to bring about a 29% savings bringing my per vm costs to $121.25, and a total of $606.23.
  • I go through the portal, and purchase 5x 1-Year Reserved Instances for DS2v2, to get this cost savings.

And that’s it, I’m done.

It really is that simple, at the end of the day. Now what’s happening behind the scenes is that I have prepurchased 3,650 hours of Compute time at that lower price. So when my bill is calculated, the first 3,650 will be at the lower price, and I don’t need to worry about which VMs it’s attached to. I just know that I’m paying less for the first 3,650 hours.

So the next logical question is, what happens if I have 6 VMs? The math works out like this.

  • The normal PAYG rate is $170.82 which comes to ~ $0.234 per hour.
  • I purchased 5x DS2v2’s at the lower rate ($121.25), which means the hourly rate is ~ $0.167 per hour.
  • I’ve got 6x DS2v2’s running currently within the scope of the RI. So that means that ultimately in 1 month (assuming 730 hours in the month), I am consuming 4,380 compute hours.

What that means is that this is how the pricing breaks out:

Number of HoursWith RIPAYG RateTotal Cost
3,650$0.167$606.23
730$0.234$170.82
Total$777.05

So what this means is any overage above the RI is just simply billed at the PAYG rate, which means you have the result you are looking for.

But this also buys you a lot of flexibility, you gain the ability to add VMs, and delete VMs and as long as at the end the hours are the same it doesn’t matter. This gives you a lot of power to get the maximum amount of savings without having to go through a lot of headaches.

Misconception 2 – We can’t use RI because its cost money up front.

This is another misconception, because it is 100% not true. You can sign up for monthly payments on your RI, which removes all the risk for you ultimately. You can get the discount and pay the monthly amount without having to pay a large lump sum up front. Here’s a link on how to do so.

Now the most common question I get with this is “Kevin, how is this different than PAYG?” The answer is this, what’s happening is the amount is calculated the same as the upfront cost, and then broken up into a monthly charge. That charge for those compute hours will be divided evenly over the period (1-year or 3-year). Now where the difference happens is RI is use it or lose it.

Take the following scenario:

  • I have 5x DS2v2, with a one year reservation, meaning I’m paying $121.25 a month for each of them. The total being $606.23 a month, spread out over 12 months.
  • If I delete 2 of those VMs, and don’t provision any more, and don’t modify my reservation, my bill for the month will be $606.23. It is use it or lose it, the hours do not roll over to the next month, and I would have paid $242.50 for nothing.

Now if I created new VMs, no problem, or if I exchanged the RI, also not a problem. But its important to know that I can get the benefit of paying monthly, and provided I make sure they are managed properly, I’ll have no problems and get the full benefit of the discount.

Also worth mentioning here, there is no difference to the discount if you pay upfront, or monthly, the discount is 100% the same.

Misconception 3 – I can’t change it after I buy.

This is definitely one of the most common misconceptions I see out there. You absolutely can swap / exchange reservation as your requirements and needs change. And this allows you to change the size of a VM or service to meet your needs, without losing money. Here’s a link on how to manage your reservations. And here’s a link on self-service exchanges and refunds.

There is a lot of detail in the links above, and its pretty self-explanatory. So please review these policies but the end story is that you are not locked in and committed to paying some huge amount of money if you change your mind.

Ultimately, that also means that you don’t need to wait as long to gain the benefits of this program. Its definitely something you should take advantage of as soon as you can.

Misconception 4 – I can’t use RI because I use PaaS Services in my solution.

Another huge misconception, the RI program from Microsoft is changing all the time and new services are being added constantly. The services at the time of this post included with RI offerings are:

  • App Service
  • Azure Redis Cache
  • Cosmos
  • Database for MariaDB
  • Database for PostgreSQL
  • Databricks
  • Dedicated Hosts
  • Disk Storage
  • Storage
  • SQL Database
  • SQL Warehouse
  • Virtual Machines

Which is a pretty broad group of services and new services are lighting up every day.

Misconception 5 – RI isn’t worth it.

I never understood this, given that I can pay monthly, exchange or get a refund, cover VMs and a whole bunch of other services…And get usually between 25-30% (RI 1-year) or 40-50% (RI 3-year) off my bill. Just because I decide to. This is absolutely the first thing you should look at when you are looking to cut your cloud hosting costs.

Final Thoughts

I hope that clears up some concerns and thoughts about your Azure Costs and how to potentially manage your bill to ensure that you can provide your solutions to their end customers in a cost effective manner.

What are some things I can do to improve application availability?

What are some things I can do to improve application availability?

So I’ve done a few posts on how to increase the availability and resiliency for your application, and I’ve come to a conclusion over the past few months. This is a really big topic that seems to be top of mind for a lot of people. I’ve done previous posts like “Keeping the lights on!” and “RT? – Making Sense of Availability?” which talk about what it means to go down this path and how to architect your solutions for availability.

But another question that comes with that, is what types of things should I do to implement stronger availability and resiliency in my applications? How do I upgrade a legacy application for greater resiliency? What can I do to keep this in mind as a developer to make this easier?

See the source image

So I wanted to compile a list of things you should look for in your application, that changing would increase your availability / resiliency, and / or things to avoid if you want to increase both of these factors for your applications.

So let’s start with the two terms I continue to use, because I find this informs the mindset to improving both and that only happens if we are all talking about the same thing.

  • Availability – Is the ability of your application to continue operations of a critical functionality, even during a major outage or downtime situation. So the ability to continue to offer service with minimal user impact.
  • Resiliency – Is the ability of your application to continue processing current work even in the event of a major or transient fault. So finishing doing work that is currently in-progress.

So looking at this further, the question becomes what kinds of things should I avoid, or remove from my applications to improve my position moving forward:

Item #1 – Stateful Services

Generally speaking this is a key element in removing issues with availability and resiliency, and can be a hotly debated issue, but here’s where I come down on this. If a service has a state (either in memory or otherwise) it means that for me to fail over to somewhere else becomes significantly more difficult. I know must replicate that state, and if that’s done in memory, that becomes a LOT harder. If its a separate store, like SQL or Redis, it becomes easier, but at the same time requires additional complexity which can make that form of availability harder. This is especially true as you add “9”‘s to your SLA. So generally speaking if you can avoid having application components that rely on state its for the best.

Additionally, stateful services also cause other issues in the cloud, including limiting the ability to scale out as demand increases. The perfect example of this is “sticky session” which means that once you are routed to a server once, we keep sending you to the same server. This is the antithesis of scaling out, and should be avoided at all cost.

If you are dealing with a legacy application, and removing state is not feasible, then at the minimum you would need to make sure that state is managed outside of memory. An example being if you can’t remove session, move it to SQL and replicate.

Item #2 – Tight Couplings

This one points to both of the key elements that I outlined above. When you have tight coupling between application components you create something that can ultimately fail and doesn’t scale well. It prevents the ability to build a solution that can scale well.

Let’s take a common example, let’s say you have an API tier on your application, and that api is built into the same web project as your UI front end. That API then talks directly to the database.

This is a pretty common legacy pattern. The problem this creates is that the demand of load your web application, and the backend api are very tightly coupled, so a failure in one means a failure in others.

Now let’s take this a step further and say that you expose your api to the outside world (following security practices to let your application be extensible. Sounds all good right.

Except when you look deeper, by having all your application elements all talking directly to each other you know created a scenario where cascading failures can completely destroy your application.

For example, one of your customers decides to leverage your api pretty hard, pulling a full dump of their data every 30 seconds, or you sign up a lot of customers who all decide to hit your api. It leads to the following affects:

  1. The increase demand on the api causes memory and cpu consumption on your web tier to go up.
  2. This causes performance issues on your applications ability to load pages.
  3. This causes intermittent areas that cause transactions against the api to demand higher SQL demand. Increased demand on SQL causes your application to experience resource deadlocks.
  4. Those resource deadlocks cause further issues with user experience as the application fails.

Now you are probably thinking, yes Kevin but I can just enable autoscaling in the cloud and it solves all those issues. To which my response is, and uncontrolled inflation of your bill to go with it. So clearly your CFO is OK with uncontrolled costs and inflation to offset a bad practice.

One scenario where we can resolve this awfulness is to split the API to a separate compute tier, by doing so we can manage the compute separately without having to wildly scale to offset the issue. I then have separate options for allowing my application to scale.

Additionally I can implement queues as a load leveling practices which allows for making my application scale only in scenarios where queue depth expands beyond reasonable response time. I can also throttle requests coming from the api or prioritize messages coming from the application. I then can replicate the queue messages to provide greater resiliency.

Item #3 – Enabling Scale out

Now I know, I just made it sound like scaling out is awful, but the key part to this is “controlled.” What I mean here is that by making your services stateless, and implementing practices to decouple you create scenarios where you can run one or more copies of a service which enables all kids of benefits from a resiliency and availability perspective. It changes your services from pets to cattle, you no longer care if one is brought down, because another takes its place. It’s sort of like a hydra, is a good way of thinking about it.

Item #4 – Move settings out of each piece of an application

The more tightly your settings and application code are connected, the harder it is to make changes on the fly. If your code is tightly coupled, and requires a deployment to make a configuration change it means that should you need to change an endpoint, it is an increasingly difficult thing to do. So the best thing you can do is start moving those configuration settings out of your application. No matter how you look at it, this is an important thing to do. For reasons relating to:

  • Security
  • Maintainability
  • Automation
  • Change Management

Item #5 – Build in automated deployment pipeline

The key to high availability comes down to automation a lot of times, especially when you hit the higher levels of getting more 9’s. The simple fact is that seconds count.

But more than that, Automated Deployments help to manage configuration drift, a simple fact is that the more you have configuration drift the harder it is to maintain a secondary region because you have to manage making sure that one region doesn’t have things the other does not. This is eliminated by forcing everything to go through the automated deployment pipeline. If every change must be scripted and automated, it is almost impossible to see configuration drift happen in your environments.

Item #6 – Monitoring, Monitoring, Monitoring

Another element of high availability and resiliency is monitoring. If you had asked me years ago about the #1 question most developers think of as an afterthought it was “How do I secure this?” And while that is a question a lot of developers still somehow treat as an afterthought, the bigger one is “How do I monitor and know this is working?” Given the rise of micro services, and server-less computing, we really need to be able to monitor every piece of code we deploy. So we need hooks into anything new you build to answer that question.

This could be as simple as building in logging for custom telemetry into Application Insights, or logging incoming and outgoing requests, logging exceptions, etc. But we can’t make sure something is running without implementing these metrics.

Item #7 – Control Configuration

This one, I’m building upon comments above. The biggest mistake that I see people get to with regard to these kinds of implementations is that they don’t manage how configuration changes are made to an environment. Ultimately this leads to a “pets” vs “Cattle” mentality. I had a boss once in my career who had a sign above his office that said “Servers are cattle, not pets…sometimes you have to make hamburgers.”

And as funny as the above statement is, there is an element of truth to it. If you allow configuration to be changes and fixes applied directly to an environment, you create a situation where it is impossible to rely on automation with any degree of trust. And it makes monitoring and every other element of a truly high available or resilient architecture completely irrelevant.

So the best thing you can do, leverage the automated pipeline, and if any change needs to be made it must be pushed through the pipeline, ideally remove peoples access to production for anything outside of read for metrics and logging.

Item #8 – Remove “uniqueness” of environment

And like above, we need to make sure everything about our environments is repeatable. In theory I should be able to blow an entire environment away, and with a click of a button deploy a new one. And this is only done through scripting everything. I’m a huge fan of terraform to help resolve this problem, but bash scripts, powershell, cli, pick your poison.

The more you can remove anything unique about an environment, the easier it is to replicate it and create at minimum an active / passive environment.

Item #9 – Start implementing availability patterns

If you are starting down this road of implementing more practices to enhance the resiliency of your applications, there are several practices you should consider that as you build out new services would help to create the type of resiliency you are building towards. Those patterns include:

  • Health Endpoint Monitoring – Implementing functional checks in an application to ensure that external tools can be leveraged to help.
  • Queue-Based Load Leveling – Leveraging queues that act as a buffer, or put a layer of abstraction between how your applications handle incoming requests in a more resilient manner.
  • Throttling – This pattern helps with managing resource consumption so that you can meet system demand while controlling consumption.
  • Circuit Breaker – This pattern is extremely valuable in my experience. Your service should be smart enough to use an incremental retry and back off if a downstream service is impacted.
  • Bulk Head – This pattern leverages separation and a focus on fault tolerance to ensure that because one service is down the application is not.
  • Compensating Transaction – If you are using a bulkhead, or any kind of fault tolerance, or have separation of concerns its important that you be able to roll a transaction back to its original state.
  • Retry – The most basic pattern to implement and essential to build transient fault tolerance.

Item #10 – Remember this is an evolving process

As was described earlier in this post, the intention here are that if you are looking to build out more cloud based functionality, and in turn increase the resiliency of your applications, the best advice I can give is to remember that this is an iterative process and to look for opportunities to update your application and to increase resiliency.

For example, let’s say I have to make changes to an API that sends notification. If I’m going to make those updates, maybe I can implement queues, logging and make some changes to break that out to a micro service to increase resiliency. As you do this you will find that your applications position will improve.

How to learn TerraForm

How to learn TerraForm

So as should surprise no-one, I’ve been doing a lot of work with TerraForm lately, and I’m a huge fan of it in general. Recently doing a post talking about the basics of modules. (which can be found here).

But one common question I’ve gotten a lot of is how to go about Learning TerraForm. Where do I start? So I wanted to do a post gathering some education resources to help.

First for the what is TerraForm, TerraForm is an open source product, created by HashiCorp which enables infrastructure-as-code, specifically designed to be cloud vendor agnostic. If you want to learn the basics, I recommend this video I did with Steve Michelotti about TerraForm and Azure Government:

But more than that, the question becomes how do I go about learning TerraForm. The first part is configuring your machine, and for that you can find a blog post I did here. There are somethings you need to do to setup your environment for terraform, and without any guidance it can be confusing.

But once you know what TerraForm is, the question becomes, how do I learn about / how to use it?

Outside of these, what I recommend is using the module registry, so one of the biggest strengths of TerraForm is a public module repository that allows you to see re-usable code written by others. I highly recommend this as a great way to see working code and play around with it. Here’s the public module registry.

So that’s a list of some resources to get you started on learning TerraForm, obviously there are also classes by PluralSight, Udemy, and Lynda. But I’ve not leveraged those, if you are a fan of structured class settings, those would be good places to start.

Microsoft Certifications – Explained

Microsoft Certifications – Explained

Hello All, education is something that I’ve always felt strongly about. I come from a family where most of the people in my family have worked / do work in education at a variety of levels. And even in my career was a college professor for a while. So that being said, a lot of people look to certifications as a great way to driving learning and validating it for your resume.

I personally like certifications as something that you can use and point to as a standard for your skills. Now that being said the certifications are complex and lots of people have questions about what they mean. So I put this together to help people navigate the different options for certification on the azure platform:

Exam Title Topics
AZ900 Azure Fundamentals Understand cloud concepts (15-20%) Understand core Azure services (30-35%)
Understand security, privacy, compliance, and trust (25-30%)
Understand Azure pricing and support (25-30%)
AZ103 Azure Administration Manage Azure subscriptions and resources (15-20%)
Implement and manage storage (15-20%)
Deploy and manage virtual machines (VMs) (15-20%)
Configure and manage virtual networks (30-35%)
Manage identities (15-20%)
AZ203 Developing Solutions for Azure Develop Azure Infrastructure as a Service compute solution (10-15%)
Develop Azure Platform as a Service compute solution (20-25%)
Develop for Azure storage (15-20%) Implement Azure security (10-15%) Monitor, troubleshoot, and optimize solutions (10-15%)
Connect to and consume Azure and third-party services (20-25%)
AZ300 Azure Architect Technologies Deploy and configure infrastructure (25-30%)
Implement workloads and security (20-25%)
Create and deploy apps (5-10%) Implement authentication and secure data (5-10%)
Develop for the cloud and for Azure storage (20-25%)
AZ301 Azure Architect Design Determine workload requirements (10-15%)
Design for identity and security (20-25%)
Design a data platform solution (15-20%)
Design a business continuity strategy (15-20%)
Design for deployment, migration, and integration (10-15%)
Design an infrastructure strategy (15-20%)
AZ400 Azure DevOps Solutions Design a DevOps strategy (20-25%)
Implement DevOps development processes (20-25%)
Implement continuous integration (10-15%)
Implement continuous delivery (10-15%)
Implement dependency management (5-10%)
Implement application infrastructure (15-20%)
Implement continuous feedback (10-15%)
AZ500 Azure Security Technologies Manage identity and access (20-25%)
Implement platform protection (35-40%)
Manage security operations (15-20%)
Secure data and applications (30-35%)
AI100 Design and Implementing an Azure AI Solution Analyze solution requirements (25-30%)
Design AI solutions (40-45%)
Implement and monitor AI solutions (25-30%)
DP100 Design and Implementing a Data Science Solution on Azure Define and prepare the development environment (15-20%)
Prepare data for modeling (25-30%)
Perform feature engineering (15-20%)
Develop models (40-45%)
DP200 Implementing an azure data solution Implement data storage solutions (40-45%)
Manage and develop data processing (25-30%)
Monitor and optimize data solutions (30-35%)
DP201 Design an Azure Data Solution Design Azure data storage solutions (40-45%)
Design data processing solutions (25-30%)
Design for data security and compliance (25-30%)

So if you are interested in getting certifications, and in moving forward with these, the next question is usually, now I know what I have to learn, but what about the how. The good news is that their are a lot of free resources to help.

  • MS Learn : This is a great site that provides a lot of structured learning paths of different sizes that can assist in your learning these skills.
  • Channel 9 : A great video site on just about everything Microsoft which would help if you want to be walked through something.
Working With Modules in Terraform

Working With Modules in Terraform

I’ve done a bunch of posts on TerraForm, and there seems to be a bigger and bigger demand for it. If you follow this blog at all, you know that I am a huge supporter of TerraForm, and the underlying idea of Infrastructure-as-code. The value-prop of which I think is essential to any organization that wants to leverage the cloud.

Now that being said, it won’t take long after you start working with TerraForm, before you stumble across the concept of Modules. And it also won’t take long before you see the value of those modules as well.

So the purpose of this post is to walk you through creating your first module, and give you an idea of how to do this benefit you.

So what is a module? A module in TerraForm is a way of creating smaller re-usable components that can help to make management of your infrastructure significantly easier. So let’s take for example, a basic TerraForm template. The following will generate a single VM in a Virtual Network.

provider "azurerm" {
  subscription_id = "...."
}

resource "azurerm_resource_group" "rg" {
  name     = "SingleVM"
  location = "eastus"

  tags {
    environment = "Terraform Demo"
  }
}

resource "azurerm_virtual_network" "vnet" {
  name                = "singlevm-vnet"
  address_space       = ["10.0.0.0/16"]
  location            = "eastus"
  resource_group_name = "${azurerm_resource_group.rg.name}"

  tags {
    environment = "Terraform Demo"
  }
}

resource "azurerm_subnet" "vnet-subnet" {
  name                 = "default"
  resource_group_name  = "${azurerm_resource_group.rg.name}"
  virtual_network_name = "${azurerm_virtual_network.vnet.name}"
  address_prefix       = "10.0.2.0/24"
}

resource "azurerm_public_ip" "pip" {
  name                = "vm-pip"
  location            = "eastus"
  resource_group_name = "${azurerm_resource_group.rg.name}"
  allocation_method   = "Dynamic"

  tags {
    environment = "Terraform Demo"
  }
}

resource "azurerm_network_security_group" "nsg" {
  name                = "vm-nsg"
  location            = "eastus"
  resource_group_name = "${azurerm_resource_group.rg.name}"
}

resource "azurerm_network_security_rule" "ssh-access" {
  name                        = "ssh"
  priority                    = 100
  direction                   = "Outbound"
  access                      = "Allow"
  protocol                    = "Tcp"
  source_port_range           = "*"
  destination_port_range      = "*"
  source_address_prefix       = "*"
  destination_address_prefix  = "*"
  destination_port_range      = "22"
  resource_group_name         = "${azurerm_resource_group.rg.name}"
  network_security_group_name = "${azurerm_network_security_group.nsg.name}"
}

resource "azurerm_network_interface" "nic" {
  name                      = "vm-nic"
  location                  = "eastus"
  resource_group_name       = "${azurerm_resource_group.rg.name}"
  network_security_group_id = "${azurerm_network_security_group.nsg.id}"

  ip_configuration {
    name                          = "myNicConfiguration"
    subnet_id                     = "${azurerm_subnet.vnet-subnet.id}"
    private_ip_address_allocation = "dynamic"
    public_ip_address_id          = "${azurerm_public_ip.pip.id}"
  }

  tags {
    environment = "Terraform Demo"
  }
}

resource "random_id" "randomId" {
  keepers = {
    # Generate a new ID only when a new resource group is defined
    resource_group = "${azurerm_resource_group.rg.name}"
  }

  byte_length = 8
}

resource "azurerm_storage_account" "stgacct" {
  name                     = "diag${random_id.randomId.hex}"
  resource_group_name      = "${azurerm_resource_group.rg.name}"
  location                 = "eastus"
  account_replication_type = "LRS"
  account_tier             = "Standard"

  tags {
    environment = "Terraform Demo"
  }
}

resource "azurerm_virtual_machine" "vm" {
  name                  = "singlevm"
  location              = "eastus"
  resource_group_name   = "${azurerm_resource_group.rg.name}"
  network_interface_ids = ["${azurerm_network_interface.nic.id}"]
  vm_size               = "Standard_DS1_v2"

  storage_os_disk {
    name              = "singlevm_os_disk"
    caching           = "ReadWrite"
    create_option     = "FromImage"
    managed_disk_type = "Premium_LRS"
  }

  storage_image_reference {
    publisher = "Canonical"
    offer     = "UbuntuServer"
    sku       = "16.04.0-LTS"
    version   = "latest"
  }

  os_profile {
    computer_name  = "singlevm"
    admin_username = "uadmin"
  }

  os_profile_linux_config {
    disable_password_authentication = true

    ssh_keys {
      path     = "/home/uadmin/.ssh/authorized_keys"
      key_data = "{your ssh key here}"
    }
  }

  boot_diagnostics {
    enabled     = "true"
    storage_uri = "${azurerm_storage_account.stgacct.primary_blob_endpoint}"
  }

  tags {
    environment = "Terraform Demo"
  }
}

Now that TerraForm script shouldn’t surprise anyone, but here’s the problem, what if I want to take that template and make it deploy 10 VMs instead of 1 in that virtual network.

Now I could take lines 64-90 and lines 103-147 (a total of 70 lines) and do some copy and pasting for the other 9 VMs, which would add 630 lines of code to my terraform template. Then manually make sure they are configured the same, and add the lines of code for the load balancer, which would probably be another 20-30….

If this hasn’t made you cringe, I give up.

The better approach would be to implement a module, so the question is, how do we do that. We start with our folder structure, I would recommend the following:

  • Project Folder
    • Modules
      • Network
      • VirtualMachine
      • LoadBalancer
  • main.tf
  • terraform.tfvars
  • secrets.tfvars

Now the idea here being, that we create a folder to contain all of our modules, and then a separate folder for each. Now when I was learning about modules, this tripped me up. You can’t have the “tf” files for your modules in the same directory, especially if they have any similar named parameters like “region”. If you put them in the same directory you will get errors about duplicate variables.

Now once you have your folders, what do we put in each of them, the answer is this…main.tf. I do this because it makes it easy to reference and track the core module in my code. Being a developer and devops fan, I firmly believe in consistency.

So what does that look like, below is the file I put in “Network\main.tf”

variable "address_space" {
    type = string
    default = "10.0.0.0/16"
}

variable "default_subnet_cidr" {
    type = string 
    default = "10.0.2.0/24"
}

variable "location" {
    type = string
}

resource "azurerm_resource_group" "basic_rig_network_rg" {
    name = "vm-Network"
    location = var.location
}

resource "azurerm_virtual_network" "basic_rig_vnet" {
    name                = "basic-vnet"
    address_space       = [var.address_space]
    location            = azurerm_resource_group.basic_rig_network_rg.location
    resource_group_name = azurerm_resource_group.basic_rig_network_rg.name
}

resource "azurerm_subnet" "basic_rig_subnet" {
 name                 = "basic-vnet-subnet"
 resource_group_name  = azurerm_resource_group.basic_rig_network_rg.name
 virtual_network_name = azurerm_virtual_network.basic_rig_vnet.name
 address_prefix       = var.default_subnet_cidr
}

output "name" {
    value = "BackendNetwork"
}

output "subnet_instance_id" {
    value = azurerm_subnet.basic_rig_subnet.id
}

output "networkrg_name" {
    value = azurerm_resource_group.basic_rig_network_rg.name
}

Now there are a couple of key elements, that I make use of here, and you’ll notice that there is a variables section, a TerraForm template, and an outputs section.

It’s important to remember that every TerraForm template is self contained, similar to how you scope parameters, you pass the values into the module and then use them accordingly. And by identifying the “Output” variables, I can pass things back to the main template.

Now the question becomes, what does that look like to implement it. When I go back to my root level “main.tf”, I find I can now leverage the following:

module "network" {
  source = "./modules/network"

  address_space = var.address_space
  default_subnet_cidr = var.default_subnet_cidr
  location = var.location
}

A couple of key elements to reference here, are that the “source” property points to the module folder that contains the main.tf. And then I am mapping variables at my environment level to the module. This allows for me to control what gets passed into each instance of the module. So this shows how to get module values into the module.

The next question is how do you get them out, in my root main.tf file, I would have code like the following:

network_subnet_id = module.network.subnet_instance_id

To reference it and interface with the underlying map, I would just reference, module.network.___________ and reference the appropriate output variable.

Now I want to be clear this is probably the most simplistic module I can think of, but it illustrates how to hit the ground running and create new modules, or even use existing modules in your code.

For more information, here’s a link to the HashiCorp learn site, and here is a link to the TerraForm module registry, which is a collection of prebuilt modules that you can leverage in your code.

High Availability – a storage architecture

High Availability – a storage architecture

Hello all, so I’ve been doing a lot of work around availability in the cloud and how to build applications that are architected for resiliency. And one of the common elements that comes up, is how do I architecture for resiliency around storage.

So the scenario is this, and its a common one, I need to be able to write new files to blob storage, and read from my storage accounts, and need it to be as resilient as possible.

So let’s start with the SLA, so currently if you are running LRS storage, then your SLA is 99.9%, which from a resiliency perspective isn’t ideal for a lot of applications. But if I use RA-GRS, my SLA goes up to 99.99%.

Now, I want to be clear about storage SLAs, this SLA says that I will be able to read data from blob storage, and that it will be available 99.99% of the time using RA-GRS.

For those who are new to blob storage, let’s talk about the different types of storage available:

  • Locally Redundant Storage (LRS) : This means that the 3 copies of the data you put in blob storage, are stored within the same zone.
  • Zone Redundant Storage (ZRS): This means that the 3 copies of the data you put in blob storage, and stored across availability zones.
  • Geo Redundant Storage (GRS) : This means that the 3 copies of the data you put in blob storage, are stored across multiple regions, following azure region pairings.
  • Read Access Geo Redundant Storage (RA-GRS): This means that the 3 copies of the data you put in blob storage, are stored across multiple regions, following azure region pairings. But in this case you get a read access endpoint you can control.

So based on the above, the recommendation is that for the best availability you would use RA-GRS, which is a feature that is unique to Azure. RA-GRS enables you to have a secondary endpoint where you can get read-only access to the back up copies that are saved in the secondary region.

For more details, look here.

So based on that, you gain the fact that if your storage account is called:

storagexyz.blob.core.windows.net

Your secondary read-access endpoint would be :

storagexyz-secondary.blob.core.windows.net

So the next question is, “That’s great Kevin, but I need to be able to write and read”, and I have an architecture pattern I recommend for that. And it is this:

So the above architecture, is oversimplified but focuses on the storage account configuration for higher availability. In the above architecture, we have a web application that is deployed behind traffic manager, with an instance in a primary region, and an instance in a secondary region.

Additionally we have an Azure SQL database that is ego-replicated into a backup region.

Let’s say for the sake of argument, with the above:

  • Region A => East US
  • Region B => West US

But for storage, we do the following, Storage Account A, will be in East US, which means that it will automatically replicate to West US.

For Storage Account B, will be in West US, which means it replicates to East US.

So let’s look at the Region A side:

  • New Blobs are written to Storage Account A
  • Blobs are read based on database entries.
  • Application tries to read from database identified blob storage, if it fails it uses the “-secondary” endpoint.

So for Region B side:

  • New Blobs are written to Storage Account B
  • Blobs are read based on database entries
  • Application tries to read from database identified blob storage, if it fails it uses the “-secondary” endpoint.

So in our databases I would recommend the following fields for every blob saved:

  • Storage Account Name
  • Container Name
  • Blob Name

This allows for me to easily implement the “-secondary” when it is required.

So based on the above, let’s play out a series of events:

  • We are writing blobs to Storage Account A. (1,2,3)
  • There is a failure, we fail over to Region B.
  • We start writing new blobs to Storage Account B. (4,5,6)
  • If we want to read Blob 1, we do so through the “-secondary” endpoint from Storage Account A.
  • The issue resolves
  • We read Blob 1-3 from Storage Account A (primary endpoint)
  • If we read Blob 4-6 it would be from the “-secondary” endpoint of Storage Account B

Now some would ask the question, “when do we migrate the blobs from B to A?” I would make the argument you don’t, at the end of the day, storage accounts cost nothing, and you would need to incur additional charges to move the data to the other account for no benefit. As long as you store each piece of data you can always find the blobs so I don’t see a benefit from merging.