Browsed by
Tag: practices

Embracing the Chaos

Embracing the Chaos

So I’ve done quite a few posts recently about resiliency. And it’s a topic that more and more is very important to everyone as you build out solutions in the cloud.

The new buzz word that’s found its way onto the scene is Chaos engineering. And really this is a practice of building out solutions that are more resilient. That can survive faults and issues that arise, and ensure the best possibly delivery of those solutions to end customers. The simple fact is that software solutions are absolutely critical to every element of most operations, and to have them go down can ultimately break down a whole business if this is not done properly.

At its core, Chaos engineering is about pessimism :). Things are going to fail.

Sort of like every other movement, like Agile and DevOps, Chaos Engineering embraces a reality. In this case that reality is that failures will happen, and should be expected. The goal being that you assume, that there will be failures and should architected to support resiliency.

So what does that actually mean, it means that you determine the strength of the application, by doing controlled experiments that are designed to inject faults into your applications and seeing the impact. The intention being that the application grows stronger and able to handle any faults and issues while maintaining the highest resiliency possible.

How this something new?

Now a lot of people will read the above, and say that “chaos engineering” is just the latest buzz word to cover something everyone’s doing. And there is an element of truth to that, but the details are what matters.

And what I mean by that, is that there is a defined approach to doing this and doing it in a productive manner. Much like agile, and devops. In my experience, some are probably doing elements of this, but by putting a name and methodology to it, we are calling attention to the practice for those who aren’t, and helping with a guide of sorts to how we approach the problem.

There are several key elements that you should keep in mind as you find ways to grow your solution by going down this path.

  • Embrace the idea that failures happen.
  • Find ways to be proactive about failures.
  • Embrace monitoring and visibility

Sort of how Agile embraced the reality that “Requirements change”, and DevOps embrace that “All Code must be deployed.” Chaos engineering embraces that the application will experience failures. This is a fact. We need to assume that any dependency can break, or that components will fail or be unavailable. So what do we mean at a high level for each of these:

Embrace the idea…failure happens

The idea being that elements of your solution will fail, and we know this will happen. Servers go down, service interruptions occur, and to steal a quote from Batman Begins, “Sometime things just go bad.”

I was in a situation once where an entire network connection was taken down by a Squirrel.

So we should build our code and applications in such a way that embraces that failures will eventually occur and build resiliency into our applications to accommodate that. You can’t solve a problem, until you know there is one.

How do we do that at a code level? Really this comes down to looking at your application, or micro service and doing a failure mode analysis. And a taking an objective look at your code and asking key questions:

  • What is required to run this code?
  • What kind of SLA is offered for that service?
  • What dependencies does the service call?
  • What happens if a dependency call fails?

That analysis will help to inform how you handle those faults.

Find ways to be proactive about failure

In a lot of ways, this results in leveraging tools such as patterns, and practices to ensure resiliency.

After you’ve done that failure mode analysis, you need to figure out what happens when those failures occur:

  • Can we implement patterns like circuit breaker, retry logic, load leveling, and libraries like Polly?
  • Can we implement multi-zone, multi-region, cluster based solutions to lower the probability of a fault?

Also at this stage, you can start thinking about how you would classify a failure. Some failures are transient, others are more severe. And you want to make sure you respond appropriately to each.

For example, a monitoring networking outage is very different from a database being down for an extended period. So another key element to consider is how long the fault lasts for.

Embrace Monitoring and Visibility

Now based on the above, the next question is, how do I even know this is happening? With micro service architectures, applications are becoming more and more decentralized means that there are more moving parts that require monitoring to support.

So for me, the best next step is to go over all the failures, and identify how you will monitor and alerts for those events, and what your mitigations are. Say for example you want to do manual failover for your database, you need to determine how long you return failures from a dependency service before it notifies you to do a failover.

Or how long does something have to be down before an alert is sent? And how do you log these so that your engineers will have visibility into the behavior. Sending an alert after a threshold does no one any good if they can’t see when the behavior started to happen.

Personally I’m a fan of the concept her as it calls out a very important practice that I find gets overlooked more often than not.

Distributed Computing and Architecture Patterns

Distributed Computing and Architecture Patterns

So lately I’ve been doing a lot of work on distributed programming, and specifically looking less at projects that are living on-premise, but need to be moved to cloud, and more with projects that are born in the cloud and how to optimize.  What I’m talking about here is applicable for the “lift-and-shift” type of project.

Ultimately the “cloud” is just like any other development projects, there are considerations that need to be handled as part of leveraging the environment to the best possible outcome.  So there are things you can do to help make your applications perform to their peak in the cloud.

In the traditional “Monolithic” approach to designing applications, we would work ourselves into a corner or more less.  And what I mean by that is we would build out applications to consume servers and predetermined resources, and that meant that if you wanted to take that application and sell it, traditionally you were looking at a large capital expense.  More than that also, if you wanted to increase scale, guess what…another capital expense, and this time with all the time required for a corporate purchase of that size.

Distributed Computing attempts to solve that problem, by enabling us to take that monolithic application and break it into the smallest parts we can.  And then making each of those parts independently scalable, to meet need.  So instead of one big app, we have a “web” of smaller pieces doing different jobs, and the total is more than the sum of its parts.

The value add here, is by leveraging smaller more isolated components, we can really focus on what does the best job.  For example, you might have a dotnet application, but if Python is the best fit for a microservice, why would you handicap yourself and not use the best tool for the job.  Microservices allow you to do that.

Let’s start with things you should keep in mind when building distributed applications within your application:

  • Loosely-Coupled Components:  For a distributed solution to truly work, all the pieces need to be loosely coupled.  And this takes the form of creating “buffers” between these services.  These “buffers” normally take the form of messaging between services, you could use a service bus, or even just a queue to communicate between services.  But the idea being that the services don’t know anything about each other, they just know that one adds an item to a queue, and the other removes the item from the queue.  This allows them to function independently and allows for the value add of creating ability to deploy these components separately.
  • Handle Communication Appropriately:  Given what we just talked about, how your application is a series of interconnected smaller apps, its important take some time and think about how you will pass information back-and-forth between applications.  Given that the application components are subject to change (platform, technology, end points, networking, etc).  You need to remember that you need an abstraction layer between the different micro services to make sure that they can be separated in a way to provide the best overall support for keeping these as separately deploy-able pieces.
  • Build with Monitoring in Mind:  Also, given that your application is really made of all these separate parts, its important to remember that for your application to work, every component must be functioning properly.  Just like an Olympic team can’t play unless every player is operating at the top of their game, your app can’t work if a component is unhealthy.  So when you set out to build micro service applications, make sure that you build and architecture your code with the logic in mind.
  • Build with Scale in Mind:  Given that your application is being built this way to encourage scaling, its important that you build your app in such as way that it can scale to meet the demands users are putting on the system.  Part of this comes down to making sure that your leveraging resources appropriately and not building systems that over (or under) consume the resources that you are using.
  • Build with Errors in mind:  Another item to consider is that your application is now a sum of “moving parts” and that being said, sometimes things can have errors or breakdowns that need to be handled.  These can be unplanned (exception or errors) or planned (upgrade of a service).  So your application should be able to respond to these “transient” faults, and not break down.  For example, one way is to leverage queues.  If component “A” is talking directly to component “B”, and “B” is in the middle of an upgrade, “A” might start throwing errors which will bubble up to the user, during the upgrade.  So now I need to notify users of downtime for the smallest change.  If I put a queue between them, then component “A” can continue to add elements to the queue, and while “B” is updating no errors occur.  When “B” finishes upgrading, it just starts pulling items off the queue as if nothing happened.  This is even less of a problem when you have scaling built into your app.

So the question is how do you do that?  There are a lot of ways to approach this particular problem, and ways to ensure that your app respects these.

I recommend the following steps:

  • Leverage a micro service approach:  There are a lot of articles out there (and some linked to below) that will talk you through how to build a micro service application, this can leverage a lot of technologies including “Service Fabric”, “Kubernetes”, “Docker Swarm”, and others to push your applications out with containers to support this approach.  You don’t have to use containers to do micro services, but they definitely help.
  • Always consider the best tool for the job:  One of the biggest benefits of micro services are the ability that you can leverage different stacks to solve different problems.  Don’t ignore this.
  • Leverage abstraction in communication between services:  As I mentioned above, this is paramount.  You must account for communication using a communication strategy, and a lot of times it helps to be consistent in how you approach this across your apps.  It will make your life simpler in the long run.
  • Make your services backward compatible:  As mentioned above, the benefits of abstraction are that I can push updates to individual components any time.  But to truly take advantage of this, I need to make my services backwards compatible.  Take my example above, imagine that service “A” writes to a queue, and then service “B” reads from that queue to do processing.  Now if Service “B” has scaled out to 10 nodes, and I try to update service “B”, I don’t want to shut them all down at once and take that part of the app offline, instead I want to do a rolling update.  So the idea being, that while service “B.1” is down, B.2-B.10 are still processing messages.  But in order to do that, I have to be careful how I change the signature of the queue and the changes to the database.  If i change the underlying database for service “B”, the database of service “B.2” has to be able to talk to it, even though its running old code.
  • Assume everything can change:  This is the best advice I can give, assume that anything can change around each Micro service that you build and by default you will be able to gracefully handle “transient faults” or “schema changes” without having to debug huge problems.
  • Leverage configuration management:  This is sort of a “1A” to the above entry, if you leverage configuration management, using services like redis, table storage, key vault or other platforms, you can make changes without requiring a redeploy of the application.  This makes your life much easier when you deploy a service and can change the configuration of other services.
  • There is no need to re-invent the wheel from an architectural standpoint, especially when you are learning.  If this is your first distributed project, lean on the known patterns and then take risks when you know more.

I hope this helps, as you start down this road.  What I will tell you, is that despite the changes I’m reviewing here.  Below are some addition links to help with this discussion:

Cloud Design Patterns :  This site provides you with detailed write-ups of some of the most common architecture patterns out there for Cloud.  These are especially helpful if you are currently more accustom to working an on-premise world, as they give a view of some of the considerations you should keep in mind.  I also like this because it pulls in concepts on things you might not be fully used to supporting (High availability vs disaster recovery, for example).

Architecture Center : Another great site, that contains just general architecture guidance for cloud or on-premise.  Its just a helpful site that lays out the pros / cons of common patterns so that you can design solutions appropriately to meet needs.

Architecting Distributed Applications:  A great online course that will walk you through what it means to build a truly distributed application, and this course is technology agnostic which is always a good thing.

Building Distributed Applications with Akka.net:  A great video with an overview of Akka.net which is a technology to help create applications using an Actor pattern.

Distributed Architecture with Microservices / Messaging:  A great video on Microservices which are the corner stone of distributed computing.

Rethinking Distributed Systems for Data Centers:  Another great article on how to build applications in a distributed world to accommodate a varying degree of scale.