Browsed by
Tag: blob

How to pull blobs out of Archive Storage

How to pull blobs out of Archive Storage

So if you’re building a modern application, you definitely have a lot of options for storage of data, whether that be traditional database technologies (SQL, MySQL, etc) or NoSQL (Mongo, Cosmos, etc), or even just blob storage. Of the above options, Blob storage is by far the cheapest, providing a very low cost option for storing data long term.

The best way though to ensure that you get the most value out of blob storage, is to leverage the different tiers to your benefit. By using a tier strategy for your data, you can pay significantly less to store it for the long term. You can find the pricing for azure blob storage here.

Now most people are hesitant to leverage the archive tier because the idea of having to wait for the data to be re hydrated has a tendency to scare them off. But it’s been my experience that most data leveraged for business operations, has a shelf-life, and archiving that data is definitely a viable option. Especially for data that is not accessed often, which I would challenge most people storing blobs to capture data on and see how much older data is accessed. When you compare this need to “wait for retrieval” vs the cost savings of archive, in my experience it tends to really lean towards leveraging archive for data storage.

How do you move data to archive storage

When storing data in azure blob storage, the process of upload a blob is fairly straight forward, and all it takes is setting the access tier to “Archive” to move data to blob storage.

The below code generates a random file and uploads it to blob storage:

var accountClient = new BlobServiceClient(connectionString);

            var containerClient = accountClient.GetBlobContainerClient(containerName);

            // Get a reference to a blob
            BlobClient blobClient = containerClient.GetBlobClient(blobName);

            Console.WriteLine("Uploading to Blob storage as blob:\n\t {0}\n", blobClient.Uri);

            // Open the file and upload its data
            using FileStream uploadFileStream = File.OpenRead(localFilePath);
            var result = blobClient.UploadAsync(uploadFileStream, true);

            result.Wait();

            uploadFileStream.Close();

            Console.WriteLine("Setting Blob to Archive");

            blobClient.SetAccessTier(AccessTier.Archive);

How to re-hydrate a blob in archive storage?

There are two ways of re-hydrating blobs:

  1. Copy the blob to another tier (Hot or Cool)
  2. Set the access tier to Hot or Cool

It really is that simple, and it can be done using the following code:

var accountClient = new BlobServiceClient(connectionString);

            var containerClient = accountClient.GetBlobContainerClient(containerName);

            // Get a reference to a blob
            BlobClient blobClient = containerClient.GetBlobClient(blobName);

blobClient.SetAccessTier(AccessTier.Hot);

After doing the above, it will start the process of re-hydrating the blob automatically. And you need to monitor the properties of the blob which will allow you to see when it has finished hydrating.

Monitoring the re-hydration of a blob

One easy pattern for monitoring the blobs as they are rehydrated is to implement a queue and an azure function to monitor the blob during this process. I did this by implementing the following:

For the message model, I used the following to track the hydration process:

public class BlobHydrateModel
    {
        public string BlobName { get; set; }
        public string ContainerName { get; set; }
        public DateTime HydrateRequestDateTime { get; set; }
        public DateTime? HydratedFileDataTime { get; set; }
    }

And then implemented the following code to handle the re-hydration process:

public class BlobRehydrationProvider
    {
        private string _cs;
        public BlobRehydrationProvider(string cs)
        {
            _cs = cs;
        }

        public void RehydrateBlob(string containerName, string blobName, string queueName)
        {
            var accountClient = new BlobServiceClient(_cs);

            var containerClient = accountClient.GetBlobContainerClient(containerName);

            // Get a reference to a blob
            BlobClient blobClient = containerClient.GetBlobClient(blobName);

            blobClient.SetAccessTier(AccessTier.Hot);

            var model = new BlobHydrateModel() { BlobName = blobName, ContainerName = containerName, HydrateRequestDateTime = DateTime.Now };

            QueueClient queueClient = new QueueClient(_cs, queueName);
            var json = JsonConvert.SerializeObject(model);
            string requeueMessage = Convert.ToBase64String(Encoding.UTF8.GetBytes(json));
            queueClient.SendMessage(requeueMessage);
        }
    }

Using the above code, when you set the blob to hot, and queue a message it triggers an azure function which would then monitor the blob properties using the following:

[FunctionName("CheckBlobStatus")]
        public static void Run([QueueTrigger("blobhydrationrequests", Connection = "StorageConnectionString")]string msg, ILogger log)
        {
            var model = JsonConvert.DeserializeObject<BlobHydrateModel>(msg);
            
            var connectionString = Environment.GetEnvironmentVariable("StorageConnectionString");

            var accountClient = new BlobServiceClient(connectionString);

            var containerClient = accountClient.GetBlobContainerClient(model.ContainerName);

            BlobClient blobClient = containerClient.GetBlobClient(model.BlobName);

            log.LogInformation($"Checking Status of Blob: {model.BlobName} - Requested : {model.HydrateRequestDateTime.ToString()}");

            var properties = blobClient.GetProperties();
            if (properties.Value.ArchiveStatus == "rehydrate-pending-to-hot")
            {
                log.LogInformation($"File { model.BlobName } not hydrated yet, requeuing message");
                QueueClient queueClient = new QueueClient(connectionString, "blobhydrationrequests");
                string requeueMessage = Convert.ToBase64String(Encoding.UTF8.GetBytes(msg));
                queueClient.SendMessage(requeueMessage, visibilityTimeout: TimeSpan.FromMinutes(5));
            }
            else
            {
                log.LogInformation($"File { model.BlobName } hydrated successfully, sending response message.");
                //Trigger appropriate behavior
            }
        }

By checking the ArchiveStatus, we can tell when the blob is re-hydrated and can then trigger the appropriate behavior to push that update back to your application.

Copying blobs between storage accounts / regions

Copying blobs between storage accounts / regions

So a common question I get is copying blobs. So if you are working with azure blob storage, it’s sort of inevitable that you would need to do a data copy. Whether that be for a migration, re-architecture, any number of reasons … you will need to do a data copy.

Now this is something where I’ve seen all different versions of doing a data copy. And I’m going to talk through those options here, and ultimately how best to execute a copy within Azure Blob Storage.

I want to start with the number 1, DO NOT DO, option. That option is “build a utility to cycle through and copy blobs one by one.” This is the least desirable option for moving data for a couple of reasons:

  • Speed – This is going to be a single threaded, synchronous operation.
  • Complexity – This feels counter-intuitive, but the process of ensuring data copies, building fault handling, etc…is not easy. And not something you want to take on when you don’t have to.
  • Chances of Failure – Long running processes are always problematic, always. As these processes can fail, and when they do they can be difficult to recover from. So you are opening yourself up to potential problems.
  • Cost – At the end of the day, you are creating a long running process that will need to have compute running 24/7 for an extended period. Compute in the cloud costs money, so this is an additional cost.

So the question is, if I shouldn’t build my own utility, how do we get this done. There are really two options that I’ve used in the past to success:

  • AzCopy – This is the tried and true option. This utility provides an easy command line interface for kicking off copy jobs that can be run either in a synchronous or asynchronous method. Even in its synchronous option, you will see higher throughput for the copy. This removes some of the issues from above, but not all.
  • Copy API – a newer option, the Rest API enables a copy operation. This provides the best possible throughput and prevents you from having to create a VM, allowing for asynchronous copy operations in azure to facilitate this operation. The API is easy to use and documentation can be found here.

Ultimately, there are lots of ways and scenarios you can leverage these tools to copy data. The other one that I find usually raises questions, is if I’m migrating a large volume of data, how do I do it to minimize downtime.

The way I’ve accomplished this, is to break your data down accordingly.

  • Sort the data by age oldest to newest.
  • Starting with the oldest blobs, break them down into the following chucks.
  • Move the first 50%
  • Move the next 30%
  • Move the next 10-15%
  • Take a downtime window to copy the last 5-10%

By doing so, you gain the ability to minimize your downtime window while maximizing the backend copy. Now the above process only works if your newer data is accessed more often, it creates a good option for moving your blobs, and minimizing downtime.