Dynamically Right-Sizing Your Cloud Infrastructure

See All Blogs

The benefits of cloud service providers such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform cannot be overstated. With the click of a button (or execution of a script), a user can create anything from a simple web site to hundreds of servers, each with hundreds of cores of CPU power. This allows development resources to be created and scaled on-demand, giving development teams access to bleeding-edge technologies without having to wait for a dedicated infrastructure team to design, request, and build the supporting infrastructure. However, as we will see, with great power comes great fiscal responsibility.

A common objection to cloud migrations, as well as a motivation behind rolling back cloud systems to traditional on-premises infrastructure, is that of the persistent cost involved. In the case of an on-prem setup, machines are right-sized up front to account for bursts of resource demands, that is, machines are sized to be able to handle some level of peak usage and sit idle when not needed. A high initial cost is paid, but after the initial investment, costs drop down significantly as the servers largely run in maintenance mode. In this case, right-sized infrastructure is directly tied with the absolute peak system demands, regardless of how often these resources are actually required.

In the case of a cloud subscription, the up-front cost is a mere fraction of the up-front investment of physical infrastructure, but the costs persist for the life of the systems – many machines are billed with a formula along the lines of base machine cost * hours of operation. The recurring spend is arguably the most common cost-related objection to cloud providers – physical systems are a one-time expense (until the systems need to be upgraded). That’s not to say significant savings in cloud systems are impossible – cloud services today provide on-demand scaling of resources, allowing you to ensure your systems are not only right-sized for peak demands, but also scaled down when demands are lower. There are many tactics one can use to manage cloud costs effectively; this article dives into two possible ways to alter one or both of the above multipliers as needed to reduce compute costs – but it does require insight as to how – and perhaps more importantly, when, your systems are being used.

Example 1: Scaling Vertically
You are using Platform-as-a-Service (PaaS) managed services for SQL databases. One such database is used internally for handling batch processing operations. Investigation into the activity of the resource shows a heartbeat-like graph indicating extreme write-heavy activity for about one hour every day at 4AM corresponding to when a file import is processed. The rest of the time, usage never climbs above 10%. In the on-prem world, the server would be built to be able to handle the spikes and sit idle the rest of the day; however, taking this same approach with managed resources would result in substantial unnecessary costs for the 23 hours per day that 90% of the available resources are unused.

In this case, we have a very predictable pattern with respect to system demands and time of day and can take proactive action to supply system resources as needed. The actual implementation can vary based on cloud provider and personal preference; examples include automation accounts or functions in Azure clouds, lambdas in AWS, or an orchestrator machine running scheduled tasks in any environment.

The basic logical flow would read as follows:
At 3:50 AM, (10 minutes before daily spike), scale database server up to premium tier.

At 5:10 AM (10 minutes after spike time ends), verify usage has dropped down to normal levels; if so, scale down to standard tier.

If usage does not ramp up or drop down in the expected window, notify relevant teams.

Example 2: Scaling Horizontally
You are operating a set of load-balanced Infrastructure-as-a-Service (IaaS) Virtual Machines hosting a consumer-facing web application. During off-peak hours, two machines is enough to handle incoming traffic while allowing for failover if necessary. During the workday, traffic gradually ramps up, requiring the resources of one to three additional machines, variable by day and time of day. This graph has much more variance than the previous example, with some days experiencing higher load than others, but never represented by an immediate spike.

In this scenario, we can be more reactive with respect to our resizing and make use of threshold-based metrics to tell us when we need to scale up or down. Again, there a multitude of ways to accomplish this – for example, a cloud monitor tied to trigger an event when machines begin to see 60% CPU load to scale out, and a second monitor to send the ‘scale in’ broadcast when load has dropped below 40% for more than five minutes.

The logical flow here would read as follows:
If total CPU exceeds 60%, add another machine to the set.

If the VM pool contains more than two machines and total usage drops below 40% for more than five minutes, remove one machine from the pool. Report any anomalies to relevant teams.

As you can see, this example makes no assumptions about time of day and may upsize the pool for 3 hours beginning at 7AM on Monday but only need to scale out for 2 hours at 1:30PM on Wednesday.

Complex Scaling Strategies in the Real World
In all likelihood, you will see benefit from a mix and match of these (and other) cost-management strategies in your cloud environment. Oftentimes a single piece of the solution architecture will serve as a candidate for multiple scaling strategies; for example, in an internal build system, one may seek to scale up an individual agent to allocate more power to resource-intensive jobs, while at the same time scaling out to create more build agents and allow parallelization of builds. This is the principle behind many build orchestration tools which provide the ability to spawn ephemeral build agents in response to demand. Similarly, a tax preparation service may need to create significantly more instances of their webserver farms around tax day to handle higher levels of internet traffic, while also vertically scaling to increase the processing power of any given database server.

Conclusion
Like development itself, a cloud migration is forever a work-in-progress. Simply lift-and-shifting your systems from physical hardware to managed solutions provide immense benefits in ease of use and reliability; however, best practices for things like right-sizing your systems have changed and there are significant benefits from constantly evaluating your system architecture and looking for areas of improvement. By routinely monitoring how and when your systems are being most heavily used, you can then work to implement solutions to get the most out of your infrastructure as well as your budget.

About the Author:
Chris Gutmanis is an engineering consultant based out of the Milwaukee Development Center. He’s worked as a software and systems engineer and has been focusing lately on DevOps and cloud computing. Chris has a wide range of experience covering everything from startups to large financial services and health care companies. He lives in Milwaukee with his wife, two dogs and three cats and enjoys Brazilian jiu-jitsu, heavy metal music, making guacamole and trips to the dog park.

Blog

Apr 15, 2025

Analysis Paralysis in AI Adoption

Learn why endless discussions and the relentless pursuit of flawless data are actually costing you valuable time, insights, and competitive advantage – just like it did for giants like Kodak and Blockbuster.

Blog

Apr 4, 2025

Don’t Take Product Out of the Equation: How to Nail Your AI Implementation

AI isn't just about the technology, it's about solving real problems and delivering real value. One way to do that is to keep product at the forefront during your AI implementation. Learn more about why having a product-first mindset is so important in this article by Principal Product Strategist Heather Harris.

Blog

Apr 3, 2025

Navigating AI in Banking and Financial Services: A Risk-Based Rebellion for Leaders

Every shiny AI use case in regulated industries has a shadow: governance, compliance, model risk, ethics, bias, explainability, cyberattack vectors and more. It's not that organizations and leaders don’t want AI, it’s that they’re paralyzed by the political, regulatory, and operational realities of deploying it. Sparq's Chief Technology Officer Derek Perry and VP, BFSI Industry Leader Rob Murray argue we need to change that. Check out this article to learn how to actually ship production AI use cases in regulated environments.

Blog

Apr 2, 2025

Five Important Questions to Ask Before Starting Your AI Implementation

Creating a lasting impact with AI requires more than just technical output. In this article by Principal Product Strategist Heather Harris, learn five questions to ask before starting an AI implementation so it can deliver long-term business value.

See All Blogs

Dynamically Right-Sizing Your Cloud Infrastructure

Sparq IT Blog Cookies Policy