Azure Cloud Services

stability via scheduled and

auto-healing reboots

Why is rebooting Cloud Service Role instances important?

Azure Cloud Services is a classic Azure resource, originally introduced by Azure back in 2008.  This technology was designed to support scalable web and worker role applications running on Windows.  While Azure has been taking steps to move forward with its newer Azure VM Scale Sets, Classic Cloud Services still remains a popular deployment choice for many legacy environments in Azure.

When running Cloud Services, Microsoft recommends at least 2 active instances in each role, so that if one instance is being rebooted or malfunctioning, the other instance remains operational.  This is why, auto-rebooting of instances is a vital strategy to heal memory leaks, connection leaks, and other similar issues.

In this case study, we’ll discuss a few very simple approaches to keep your Azure Cloud Roles stable proactively and reactively. CloudMonix is a tool that helps to ensure stability for Azure Cloud Roles as well as for its successor Azure VM Scale Sets. We’ll also discuss methods to ensure that reboots do not cause issues or outages.

Proactive stability – DAILY SCHEDULED REBOOTS

Probably the simplest and most effective method to keep your applications stable is to reboot your cloud role instances on a regular basis.  No matter how little memory leaks your application has, the longer your server is running without reboots, the more issues and performance degradations creep up.  Memory and disk fragmentation, poorly closed connections, obsolete data in a cache, large temporary folders, and of course memory leaks can cause your application’s performance to slowly degrade over time. Unfortunately, very few organizations reboot their Cloud Role instances on a regular basis, largely because it is not trivial to do this in an automated fashion without accidentally taking the role down.

CloudMonix has specific functionality that makes proactive daily rebooting of Cloud Role instances, one at a time, at a simple checkbox click.  Read below >

Reactive stability – REBOOTS ON DEMAND

Daily reboots are a great proactive measure for the stability of Azure cloud role instances.  But what happens when your application encounters critical issues throughout the day?  Severe memory leaks, queued up or “stuck” IIS requests, hung processes, etc. can all lead to major instability of the application at random times during the day.

CloudMonix allows for immediate and automatic recovery from such events via reactive reboots. Read below >

Gracefully handling reboots

Azure makes it relatively simple to keep web roles stable while individual instances are being rebooted. Some extra work may need to be done for worker roles. Read below for more information on how to handle instances rebooting.

How To Do Daily Reboots

CloudMonix offers a way to reboot instances one at a time, without impacting the overall stability of the role itself.  The idea is to reboot every instance once per day, triggering reboot for each new instance.

“Daily Reboot” action is disabled by default. To enable this action go to settings of the specific resource, open “Actions” tab and select the “Daily Reboot” action from the default list. Check the “Enabled” checkbox and adjust parameters as applicable.

CloudMonix triggers “Daily Reboot”  action with a single line of code:

CheckTimeUtc.Hour == (InstanceIndex % 24)

“CheckTimeUtc” is a variable that represents the current time in UTC.  “InstanceIndex” is another variable that CloudMonix tracks when it evaluates action against every Azure cloud role instance. The rest of the code checks if the division of that number by 24 returns the remainder equal to the current hour. This is a simple and elegant way to ensure every instance is restarted once per day.

CloudMonix - proactive reboots

How To Configure Reboots On Demand

Cloudmonix allows to set up actions to reboot instances when free memory is low or IIS requests start piling up. By default, CloudMonix tracks available memory with the help of “MemoryFree” metric and queued up requests with “AspNetRequestsQueued”.   

Setting up an action that reboots an instance when available memory drops below some threshold for a sustained amount of time is trivial and takes a few seconds. This action is built-in into the default CloudMonix profiles as “Low Ram Reboot” and is disabled by default. In order to activate this action select the “Low Ram Reboot” action under “Actions” tab in instance settings and check the “Enabled” checkbox. Configure additional parameters as described in the picture.

CloudMonix - proactive reboots

Gracefully Handling Reboots

Keeping things stable during a reboot may be a bit tricky. Do keep in mind that Azure may reboot any and all of your instances a few times per month, anyway, as a part of its scheduled updates, so handling reboots is necessary regardless of CloudMonix actions outlined above.  Also, do keep in mind that Microsoft highly recommends that every Cloud Role has at least 2 instances running, so that one instance can be upgraded, rebooted, migrated, etc. while other(s) are handling the live load.

Azure Web Roles:

Most of the time such reboots are handled by the platform out of the box. The instance is taken out of the load balancer first but is not rebooted right away. This allows it to stop receiving any further requests and finalize web requests that are in the queue. Rebooting Azure Web Roles runs relatively painless if the amount of current web requests is small.

Azure Worker Roles and Azure Web Roles with slower response times:

In this case, Azure will wait with the reboot until all work is complete. This is done by overriding the “OnStop” method in the WorkerRole class and ensuring that work is completed before allowing the method to exit.  Do keep in mind that Azure will wait for up to 5 minutes before it forces a reboot, so it’s necessary to quickly clean up any work.

A great article that outlines proper handling of “OnStop” event in the WorkerRole class written by Rick Anderson at Microsoft is available here.