Avoid Overspending with AWS Batch Using Cost Guardian Serverless Monitoring Architecture
Paid assets are a compelling but intimidating concept for budget-conscious search clients. Cloud cost uncertainty is a barrier to entry for most, and having near real-time cost visibility is critical. This is true for research enterprises, including nonprofits, K-12s, and higher education institutions that are funded by grants and have limited budgets. Grants are difficult to write and apply for, so cost overruns are a significant risk for research performed in the cloud.
More and more research customers are now taking advantage of managed compute services beyond just Amazon Elastic Compute Cloud (Amazon EC2) for cloud experimentation (e.g. AWS Batch, Amazon Elastic Container Service (Amazon ECS), AWS Fargate). These services enable the kind of rapid scalability that researchers need for high-performance computing (HPC) workloads. However, this scalability can be a double-edged sword when it comes to budget planning. Experimental research is inherently unpredictable primarily because it is not steady state. HPC workloads are typically scalable and sometimes time-consuming, making it harder to predict their costs.
In this blog, we present a new architecture for managing the costs of your cloud workloads, leveraging a low-cost serverless technology, capable of near real-time cost polling and terminating resources before to spend too much. This solution is developed for use with AWS Batch and currently supports the Fargate deployment model. However, it is open source on GitHub for future extensibility. This architecture is based on Financial Management Solutions in the AWS Cloud as the primary cost analysis mechanism, and complements them with the added functionality of near real-time visibility.
Updates and billing frequency
There are a number of AWS Cloud Financial Management solutions to provide cost transparency. This architecture is based on several tools:
- The AWS Cost and Usage Report (CUR) tracks your AWS usage and breaks down your costs by providing an estimate of the charges associated with your account. Each report contains line items for your AWS products, usage type, and operation used in your account. You can customize the AWS CUR to aggregate information by hour, day, or month. The AWS CUR publishes AWS billing reports to a Amazon’s Simple Storage Service (Amazon S3) bucket you own.
- AWS Cost Explorer is a tool that helps you manage your AWS costs by providing detailed information about your bill line items. Cost Explorer visualizes daily, monthly and forecast expenses by combining an array of available filters. Filters allow you to reduce costs based on AWS service type, linked accounts, and tags.
- AWS budgets lets you set custom budgets to track your costs, from the simplest to the most complex use cases. AWS Budgets supports email or SNS notification when actual or forecasted costs exceed your budget threshold.
These tools are great for retrospective cost analysis, but they don’t update in real time. They can update up to three times a day, or 8-12 hours of granularity. This means that if you set an alert for a limit that you have exceeded, the notification may be delayed for half a day until the next billing cycle update. This lack of real-time cost data is a barrier to entry for grant-funded research customers looking to leverage AWS.
This solution is designed to focus on AWS Batch on AWS Fargate use for research clients. The architecture relies on low-cost serverless technology to monitor and manage long-running experimental computations found in research. Event-driven processes will be initiated to monitor and terminate managed compute spend before it goes over budget. This solution is meant to be an enhancement of AWS cost tools with high frequency polling using serverless compute/storage options.
Additionally, this solution provides a model for similar monitoring of high-frequency cost tracking, as pricing variables can be exchanged through the AWS Price List API and what resources are tracked via cost allocation tags.
The following diagram illustrates the architecture of the solution.
The solution presents an AWS Batch environment on AWS Fargate where researchers submit compute jobs. The Batch Fargate environment has a cost allocation tag for AWS Budgets and Cost Explorer to track its cost.
The architecture starts with ingesting the AWS CUR, which lets us know that the most recent billing data is available. The CUR is sent to an S3 bucket, which triggers a AWS Lambda function we named Batch Expense Checker. Once this Lambda function is triggered, it will perform two actions:
- Check allocated budget and current spending in AWS Budgets
- Run a custom AWS Cost Explorer query to verify budget spend
If both checks return spent budget below a configurable threshold, say 80%, Amazon Simple Notification Service (SNS) will send an email notification showing the amount spent as well as the amount remaining based on AWS budgets and Cost Explorer.
If one of the controls returns above the configurable threshold, for example 80%, the architecture will launch a AWS step functions workflow. Step Functions will orchestrate the stopping of new jobs in batches to capture and maintain a running cost tally based on a configurable near real-time polling frequency. Users can set the polling frequency to be as short as a few seconds, but ultimately must set it to be less than 6 hours. This ensures greater granularity than existing AWS Cloud Financial Management solutions.
The Step Functions workflow leverages Amazon DynamoDB tables as a data store for running individual Batch jobs and aggregated job data. In parallel, Amazon Event Bridge initiates an event-driven process that will delete completed individual Batch jobs from the DynamoDB table. Another Lambda function will keep a running cost count in the aggregated task’s DynamoDB table. Once the AWS Budgets threshold is reached, the Step Functions workflow queries individual running Batch jobs from DynamoDB and then stops them to avoid incurring additional costs.
AWS step functions and event-driven workflows
Here is a more detailed view of the architecture once the 80% threshold is crossed. The figure illustrates the Step Functions workflow with the event-driven parallel process.
The Step Functions workflow steps are as follows:
- After receiving the spend trigger > 80% of Batch Expense Checkerthe state machine is executed.
- The workflow calls immediately Stop new batch jobs to stop adding new batch jobs to the queue because the budget threshold is almost reached.
- Save batch jobs will record running batch tasks in the Running Batch Tasks DynamoDB table. This is done once, but is kept up to date via event-driven processes.
- The High Speed Query Batch Cost Checker is invoked continuously to interrogate the Aggregated grouped task DynamoDB table and update running cost of existing batch jobs.
- Finally, once the budget threshold is reached at 100%, stopping batch job execution will end all remaining batch jobs.
In parallel, these event processes run in the background:
- If one of the running batch jobs completes, EventBridge will call the Delete Task Lambda function to delete completed batch job from Running Batch Tasks DynamoDB table.
- DynamoDB Streams captures all changes in the Running Batch Tasks DynamoDB table and calls Update Aggregate Batch Task to aggregate the vCPU and memory of all running batch jobs and update the Aggregated grouped task DynamoDB table.
Before you begin, you must have the following prerequisites:
- An AWS account
- Have the AWS Cloud Development Kit (AWS CDK) and Python installed (Python 3.6+) on a local machine or cloud IDE of your choice.
- Create an AWS CUR with the accompanying S3 bucket. Follow the instructions here. (NOTE: creation may take up to 24 hours).
- Set report path prefix to “reports”
- Set Granularity to “Hourly”
- Set report versioning to “create new report version”
Deploy the solution
To deploy the solution using the CDK, follow these steps to bootstrap your environment…