Browsed by
Tag: practices

Errors and Punishment – The Art of Debugging

Errors and Punishment – The Art of Debugging

So recently I have been blessed enough to talk to several people, who are new to the software development field, and been able to do some mentoring. And firstly, I’m the one that’s lucky for this, as there are few things better than meeting with people who are new to this industry and getting to engage with their ideas. If it isn’t something you do regularly, you should start.

But one of the things that has become very much apparent to me, just how little time is spent actually teaching how to debug. I know I saw this when I was teaching, but there’s this tendency by many in academy to show students how to code, and when they run into errors show them how to fix them. Which at it’s core sounds like “Yes, Kevin that’s what teachers do…” but I would actually argue it is a fundamentally flawed principle. The reason being that error messages and fixing things that are broken is a pretty large part of being a developer, and by giving junior developers the answer, we are doing the preverbal “giving them a fish, rather than teaching them to fish.”

To that end, I wanted to at least start the conversation on the a mindset for debugging, and how to figure out what to do when you encounter an error. Now obviously I can’t cover everything, but I wanted to give some key tips to how to approach debugging when you have an error message.

Honestly, debugging is a lot like a police procedural, and it’s a good way to remember the steps, so hang with me through the metaphor.

Tip #1 – Start at the Scene of the Crime – The Error Message

Let’s be honest

Now I know this sounds basic, but you would be surprised how often even senior devs make this mistake. Take the time to stop, and really read the error message and what I mean by that is do the following:]

  • What does the error message tell you?
  • Can you find where the error is occurring?
  • Is there a StackTrace?
  • What component or microservice is throwing the error?
  • What is the error type?

Looking at an error message is not just reading the words of the error, but there are usually other clues that can help you solve the mystery. Things such as the exception type, or a stack trace where you can find the exact line of the code is going to be critical.

Honestly, most people just read the words and then start making assumptions about where an error occurred. And this can be dangerous right out of the gate.

Tip #2 – Look for Witnesses – Digging through logs

Now, in my experience an error message is only 1 piece of the puzzle / mystery, the next step is to really look for more information. If you think about a police procedural on TV, they start at crime scene, but what do they do next…talk to witnesses!

Now, in terms of debugging we have the added benefit of being able to refer to logs. Most applications have some form of logging, even if it’s just outputting messages to a console window, and that information can be very valuable in determining an error message’s meaning.

Start looking for logs that were captured around the same time, specially looking for:

  • What was occurring right before the error?
  • What data was being moved through the solution?
  • What was the request volume that the system was handling?
  • Were there any other errors around the same time?

Any information you can find in the logs is critical to identifying and fixing the issue.

Tip #3 – Deal only in facts

Now this next on, is absolutely critical, and all to commonly overlooked. Many developers will start making assumptions as this point, and start immediately announcing, I know what it is and start changing things. Resist this urge, no matter what.

Now, I’m not going to lie, some errors are easy and with a little bit of searching it becomes really easy to see the cause and address it, and if you are 100% sure, that should be the case. But I would argue in the TV procedural perspective, this is the different between the rookie and the veteran. If you are new to this field, resist the urge to jump to an answer and only deal in facts.

What I mean by this is to not start letting your jumping to conclusions cloud the story you are building of what occurred and why.

Tip #4 – Keep a running log of findings and things you tried

This is something I do, that I started and it pays dividends. Just like the cops in a police procedural, they make a case file as soon as they capture their original findings, and you should to. Keep a running document, either in word, or for me I use OneNote. I will copy into that document all the findings.

  • Error Messages
  • Relevant Logs
  • Configuration Information
  • Dates / times of the errors occurring
  • Links to documentation

Anything I find and I will keep appending new information to the document as I find it.

Tip #5 – Look for changes

The other key piece of evidence most people overlook is the obvious question of “What changed?” Code is static, and does not degrade at the code level overtime. If it was working before and isn’t anymore, something changed. Look for what might have changed in the solution:

  • Was code updated?
  • Were packages or libraries updated?
  • Was a dependency updated?
  • Was their a hardware change?

All of this is valuable evidence to helping to find your reason.

Tip #6 – Check documentation

A good next step is to check any documentation, and what I mean by this is look to any reference material that could explain to you how the code is supposed to work. This can include the following:

  • Documentation on libraries and packages
  • ReadMe / GitHub issues / System Docs
  • Code Comments

Anything can help you better understand how the code is supposed to work and identify the actual way the code is supposed to behave.

Tip #7 – Trust Nothing – Especially your own code

At this stage, again people like to make assumptions, and I can’t tell you the number of times I have done this personally, but you stare at code and say it doesn’t make sense. I know X, and Y, and Z are correct, so why is it failing? Only to find out one of your assumptions about X, Y, or Z was false. You need to throw all assumptions out the window and if necessary go and manually verify everything you can. This will help you identify the underlying problem in the end.

Also at this stage I see the other common mistake. Keep your ego out of debugging. Many developers will look at the code they’ve built and they trust it because they built it. But this bias is usually the most damaging to your investigation.

Similar to the running joke of “The husband always did it…” I recommend adopting the philosophy of “Guilty until proven innocent” when it comes to any code you write. Assume that something in your code is broken, and until you can prove it, don’t start looking elsewhere. This will help in the long run.

Let me give an example, let’s say I am building code that hits an API, and I write my code and it looks good to me, and I go to run it and I get back a 404 error saying not found. I’ve all too often seen devs that would then ping the API team to see if their service is down, or networking to see if something is blocking the traffic, all before even looking to see “Did I get the endpoint right?”

Doing this makes you look foolish, and wastes people’s time. It’s better to verify that your code is working properly, and then that will empower you to have that conversation with networking as:

You: “I think it’s a networking issue.”

Network Engineer: “Why do you think that?”

You: “I’ve done the following to rule out anything else…so I think it could be ________________”

Tip #8 – Try to reproduce in isolation / Don’t Make it a hatchet job!

If you get stuck at this point, a good trick I find is to try and reproduce the error in isolation, especially when you are looking at a microservice architecture, there can be a lot of moving parts. But it can be helpful to try and recreate an error away from the existing code base by isolating components. This can make things easier to give evidence, and not unlike a police procedural where they try to reproduce the events of a theory, it can be a great way to isolate a problem.

The one thing to try really hard to avoid, is taking a hatchet to code, all too many times I’ve seen people start doing this pattern to solve a problem:

  • I’m going to try this…
  • Run Code
  • Still Broken…
  • Change this…
  • Run Code
  • Still Broken…

You are actually making your life harder by not being methodical, now I’m not saying don’t try things, but try to be more deliberate and make sure you take time to log your thoughts and attempts if your running log. This can be critical to keeping things logical and methodical and not spinning your wheels.

Tip #9 – When you find the answer right it down.

When you finally find the answer, there is this tendency to celebrate, and push that commit, cut that PR and be done. But really your not doing yourself any favors if you stop there. I find it helpful to make sure you take the time to answer the following:

  • Do I fully understand why this occurred?
  • Can I document and explain this?
  • Am I convinced this is the best fix for this problem?

Really you want to make sure you have a full understanding and complete your running log by documenting the findings so that you can refer to them in the future.

Tip #10 – Make it easier and test in the future

The other thing that is largely overlooked and skipped due to the “Fix Celebration” is the debrief on the issue. All to often we stop and assume that we are done because we made the fix. But really we should be looking at the following:

  • Is there an automated way I can test for this bug?
  • How will I monitor to make sure my fix worked?
  • Does this hot fix require further work down the line?
  • Does this fix introduce any technical debt?
  • What can I do to make this type of error easier to debug in the future?
  • What parts of the debug and testing cycle made it hard to identify this error?
  • What could I have done differently to make this go faster?
  • What does this experience teach me?

These kinds of questions are critical to ongoing success in your software development career and the health of your project longer term.

I hope you found these 10 tips helpful!

Embracing the Chaos

Embracing the Chaos

So I’ve done quite a few posts recently about resiliency. And it’s a topic that more and more is very important to everyone as you build out solutions in the cloud.

The new buzz word that’s found its way onto the scene is Chaos engineering. And really this is a practice of building out solutions that are more resilient. That can survive faults and issues that arise, and ensure the best possibly delivery of those solutions to end customers. The simple fact is that software solutions are absolutely critical to every element of most operations, and to have them go down can ultimately break down a whole business if this is not done properly.

At its core, Chaos engineering is about pessimism :). Things are going to fail.

Sort of like every other movement, like Agile and DevOps, Chaos Engineering embraces a reality. In this case that reality is that failures will happen, and should be expected. The goal being that you assume, that there will be failures and should architected to support resiliency.

So what does that actually mean, it means that you determine the strength of the application, by doing controlled experiments that are designed to inject faults into your applications and seeing the impact. The intention being that the application grows stronger and able to handle any faults and issues while maintaining the highest resiliency possible.

How this something new?

Now a lot of people will read the above, and say that “chaos engineering” is just the latest buzz word to cover something everyone’s doing. And there is an element of truth to that, but the details are what matters.

And what I mean by that, is that there is a defined approach to doing this and doing it in a productive manner. Much like agile, and devops. In my experience, some are probably doing elements of this, but by putting a name and methodology to it, we are calling attention to the practice for those who aren’t, and helping with a guide of sorts to how we approach the problem.

There are several key elements that you should keep in mind as you find ways to grow your solution by going down this path.

  • Embrace the idea that failures happen.
  • Find ways to be proactive about failures.
  • Embrace monitoring and visibility

Sort of how Agile embraced the reality that “Requirements change”, and DevOps embrace that “All Code must be deployed.” Chaos engineering embraces that the application will experience failures. This is a fact. We need to assume that any dependency can break, or that components will fail or be unavailable. So what do we mean at a high level for each of these:

Embrace the idea…failure happens

The idea being that elements of your solution will fail, and we know this will happen. Servers go down, service interruptions occur, and to steal a quote from Batman Begins, “Sometime things just go bad.”

I was in a situation once where an entire network connection was taken down by a Squirrel.

So we should build our code and applications in such a way that embraces that failures will eventually occur and build resiliency into our applications to accommodate that. You can’t solve a problem, until you know there is one.

How do we do that at a code level? Really this comes down to looking at your application, or micro service and doing a failure mode analysis. And a taking an objective look at your code and asking key questions:

  • What is required to run this code?
  • What kind of SLA is offered for that service?
  • What dependencies does the service call?
  • What happens if a dependency call fails?

That analysis will help to inform how you handle those faults.

Find ways to be proactive about failure

In a lot of ways, this results in leveraging tools such as patterns, and practices to ensure resiliency.

After you’ve done that failure mode analysis, you need to figure out what happens when those failures occur:

  • Can we implement patterns like circuit breaker, retry logic, load leveling, and libraries like Polly?
  • Can we implement multi-zone, multi-region, cluster based solutions to lower the probability of a fault?

Also at this stage, you can start thinking about how you would classify a failure. Some failures are transient, others are more severe. And you want to make sure you respond appropriately to each.

For example, a monitoring networking outage is very different from a database being down for an extended period. So another key element to consider is how long the fault lasts for.

Embrace Monitoring and Visibility

Now based on the above, the next question is, how do I even know this is happening? With micro service architectures, applications are becoming more and more decentralized means that there are more moving parts that require monitoring to support.

So for me, the best next step is to go over all the failures, and identify how you will monitor and alerts for those events, and what your mitigations are. Say for example you want to do manual failover for your database, you need to determine how long you return failures from a dependency service before it notifies you to do a failover.

Or how long does something have to be down before an alert is sent? And how do you log these so that your engineers will have visibility into the behavior. Sending an alert after a threshold does no one any good if they can’t see when the behavior started to happen.

Personally I’m a fan of the concept her as it calls out a very important practice that I find gets overlooked more often than not.

Distributed Computing and Architecture Patterns

Distributed Computing and Architecture Patterns

So lately I’ve been doing a lot of work on distributed programming, and specifically looking less at projects that are living on-premise, but need to be moved to cloud, and more with projects that are born in the cloud and how to optimize.  What I’m talking about here is applicable for the “lift-and-shift” type of project.

Ultimately the “cloud” is just like any other development projects, there are considerations that need to be handled as part of leveraging the environment to the best possible outcome.  So there are things you can do to help make your applications perform to their peak in the cloud.

In the traditional “Monolithic” approach to designing applications, we would work ourselves into a corner or more less.  And what I mean by that is we would build out applications to consume servers and predetermined resources, and that meant that if you wanted to take that application and sell it, traditionally you were looking at a large capital expense.  More than that also, if you wanted to increase scale, guess what…another capital expense, and this time with all the time required for a corporate purchase of that size.

Distributed Computing attempts to solve that problem, by enabling us to take that monolithic application and break it into the smallest parts we can.  And then making each of those parts independently scalable, to meet need.  So instead of one big app, we have a “web” of smaller pieces doing different jobs, and the total is more than the sum of its parts.

The value add here, is by leveraging smaller more isolated components, we can really focus on what does the best job.  For example, you might have a dotnet application, but if Python is the best fit for a microservice, why would you handicap yourself and not use the best tool for the job.  Microservices allow you to do that.

Let’s start with things you should keep in mind when building distributed applications within your application:

  • Loosely-Coupled Components:  For a distributed solution to truly work, all the pieces need to be loosely coupled.  And this takes the form of creating “buffers” between these services.  These “buffers” normally take the form of messaging between services, you could use a service bus, or even just a queue to communicate between services.  But the idea being that the services don’t know anything about each other, they just know that one adds an item to a queue, and the other removes the item from the queue.  This allows them to function independently and allows for the value add of creating ability to deploy these components separately.
  • Handle Communication Appropriately:  Given what we just talked about, how your application is a series of interconnected smaller apps, its important take some time and think about how you will pass information back-and-forth between applications.  Given that the application components are subject to change (platform, technology, end points, networking, etc).  You need to remember that you need an abstraction layer between the different micro services to make sure that they can be separated in a way to provide the best overall support for keeping these as separately deploy-able pieces.
  • Build with Monitoring in Mind:  Also, given that your application is really made of all these separate parts, its important to remember that for your application to work, every component must be functioning properly.  Just like an Olympic team can’t play unless every player is operating at the top of their game, your app can’t work if a component is unhealthy.  So when you set out to build micro service applications, make sure that you build and architecture your code with the logic in mind.
  • Build with Scale in Mind:  Given that your application is being built this way to encourage scaling, its important that you build your app in such as way that it can scale to meet the demands users are putting on the system.  Part of this comes down to making sure that your leveraging resources appropriately and not building systems that over (or under) consume the resources that you are using.
  • Build with Errors in mind:  Another item to consider is that your application is now a sum of “moving parts” and that being said, sometimes things can have errors or breakdowns that need to be handled.  These can be unplanned (exception or errors) or planned (upgrade of a service).  So your application should be able to respond to these “transient” faults, and not break down.  For example, one way is to leverage queues.  If component “A” is talking directly to component “B”, and “B” is in the middle of an upgrade, “A” might start throwing errors which will bubble up to the user, during the upgrade.  So now I need to notify users of downtime for the smallest change.  If I put a queue between them, then component “A” can continue to add elements to the queue, and while “B” is updating no errors occur.  When “B” finishes upgrading, it just starts pulling items off the queue as if nothing happened.  This is even less of a problem when you have scaling built into your app.

So the question is how do you do that?  There are a lot of ways to approach this particular problem, and ways to ensure that your app respects these.

I recommend the following steps:

  • Leverage a micro service approach:  There are a lot of articles out there (and some linked to below) that will talk you through how to build a micro service application, this can leverage a lot of technologies including “Service Fabric”, “Kubernetes”, “Docker Swarm”, and others to push your applications out with containers to support this approach.  You don’t have to use containers to do micro services, but they definitely help.
  • Always consider the best tool for the job:  One of the biggest benefits of micro services are the ability that you can leverage different stacks to solve different problems.  Don’t ignore this.
  • Leverage abstraction in communication between services:  As I mentioned above, this is paramount.  You must account for communication using a communication strategy, and a lot of times it helps to be consistent in how you approach this across your apps.  It will make your life simpler in the long run.
  • Make your services backward compatible:  As mentioned above, the benefits of abstraction are that I can push updates to individual components any time.  But to truly take advantage of this, I need to make my services backwards compatible.  Take my example above, imagine that service “A” writes to a queue, and then service “B” reads from that queue to do processing.  Now if Service “B” has scaled out to 10 nodes, and I try to update service “B”, I don’t want to shut them all down at once and take that part of the app offline, instead I want to do a rolling update.  So the idea being, that while service “B.1” is down, B.2-B.10 are still processing messages.  But in order to do that, I have to be careful how I change the signature of the queue and the changes to the database.  If i change the underlying database for service “B”, the database of service “B.2” has to be able to talk to it, even though its running old code.
  • Assume everything can change:  This is the best advice I can give, assume that anything can change around each Micro service that you build and by default you will be able to gracefully handle “transient faults” or “schema changes” without having to debug huge problems.
  • Leverage configuration management:  This is sort of a “1A” to the above entry, if you leverage configuration management, using services like redis, table storage, key vault or other platforms, you can make changes without requiring a redeploy of the application.  This makes your life much easier when you deploy a service and can change the configuration of other services.
  • There is no need to re-invent the wheel from an architectural standpoint, especially when you are learning.  If this is your first distributed project, lean on the known patterns and then take risks when you know more.

I hope this helps, as you start down this road.  What I will tell you, is that despite the changes I’m reviewing here.  Below are some addition links to help with this discussion:

Cloud Design Patterns :  This site provides you with detailed write-ups of some of the most common architecture patterns out there for Cloud.  These are especially helpful if you are currently more accustom to working an on-premise world, as they give a view of some of the considerations you should keep in mind.  I also like this because it pulls in concepts on things you might not be fully used to supporting (High availability vs disaster recovery, for example).

Architecture Center : Another great site, that contains just general architecture guidance for cloud or on-premise.  Its just a helpful site that lays out the pros / cons of common patterns so that you can design solutions appropriately to meet needs.

Architecting Distributed Applications:  A great online course that will walk you through what it means to build a truly distributed application, and this course is technology agnostic which is always a good thing.

Building Distributed Applications with Akka.net:  A great video with an overview of Akka.net which is a technology to help create applications using an Actor pattern.

Distributed Architecture with Microservices / Messaging:  A great video on Microservices which are the corner stone of distributed computing.

Rethinking Distributed Systems for Data Centers:  Another great article on how to build applications in a distributed world to accommodate a varying degree of scale.