Spot Instances: A Risk-Adjusted Analysis for Finance Teams

Spot Instances are AWS’s most aggressive discount instrument — unused EC2 capacity sold at prices 70–90% below on-demand rates. They’re popular in the FinOps community, and the savings case is real. A workload that costs $10,000/month on on-demand might cost $1,500–$3,000/month on Spot.

But Spot Instances are not a free discount. They come with a specific risk: AWS can interrupt your instance with two minutes’ notice when it needs the capacity back. For workloads that can absorb that interruption, Spot is excellent. For workloads that can’t, Spot is not an option at any discount level.

The analysis that most organizations skip is the risk-adjusted one: what is the actual cost of an interruption, and does the expected savings justify that expected cost?

Spot Instance suitability is determined entirely by interruption cost, not discount depth. Workloads in the lower quadrant — batch, ML training, CI — are ideal candidates. Production stateful workloads are not, at any discount level.

How Spot pricing works

AWS maintains a pool of unused EC2 capacity across instance types and availability zones. When demand for that capacity is low, it’s available at Spot prices. When demand rises, AWS can reclaim the capacity — it will terminate Spot Instances to make room for on-demand or reserved customers.

The interruption frequency varies significantly by instance type and availability zone. Some instance pools have interruption rates below 5% per month. Others are above 20%. AWS publishes this data in the Spot Instance Advisor, which categorizes instance types by interruption frequency.

Spot prices also fluctuate, though they’re generally more stable than they were several years ago. The discount relative to on-demand is usually in the 70–90% range for Linux instances, somewhat lower for Windows.

Workloads that are appropriate for Spot

The key question is whether the workload can tolerate interruption gracefully. Categories that typically can:

Batch processing — Data pipelines, ETL jobs, report generation, image processing. These workloads are checkpointable: if an instance is interrupted, the work can be resumed from the last checkpoint on a new instance. The delay costs time, not data integrity.

CI/CD build workers — Continuous integration jobs are inherently retriable. If a build is interrupted, it restarts. The cost is build time, not production availability.

Stateless web tier capacity — Applications that run multiple identical instances can use Spot for a portion of their fleet. If a Spot instance is interrupted, load balancers route traffic to the remaining instances. This requires a mixed fleet strategy (some on-demand baseline, some Spot) and an application architecture that assumes stateless instances.

ML training jobs — Large training jobs on distributed compute are designed to checkpoint periodically. Interruptions are recoverable at the cost of some training time.

Dev and test environments — Non-production environments where availability expectations are loose. If a development environment goes down, engineers restart it. The inconvenience is acceptable at 80% discount.

Workloads that are NOT appropriate for Spot

Single-instance production databases — A Spot interruption means downtime and potential data loss. Databases require stable, predictable infrastructure. Reserved Instances or on-demand are appropriate.

Real-time API endpoints — If your application depends on a single server processing real-time requests, a Spot interruption is an outage. Even with multi-instance architectures, the capacity reduction during an interruption creates risk for applications with tight availability SLAs.

Stateful workloads without checkpointing — Any job that can’t restart cleanly from a checkpoint will lose work on interruption. If the cost of re-running the lost work is significant, Spot economics change.

Long-running jobs without interruption handling — A 12-hour Spot job that fails after 11 hours with no checkpointing loses 11 hours of compute. The restart could end up costing more than on-demand would have.

The risk-adjusted math

A clean Spot analysis has three inputs:

Expected savings — Current monthly cost for the workload × Spot discount rate. If you’re spending $8,000/month on-demand for a batch processing workload and Spot gives you 75% discount, expected savings are $6,000/month.

Expected interruption cost — (Interruption probability per month) × (Cost per interruption event). Interruption cost includes: engineering time to handle the interruption, compute cost of any re-run work, SLA penalties if applicable, and downstream delays if the workload feeds other systems.

For a batch job with checkpointing: interruption cost might be 20 minutes of re-run time + 10 minutes of engineer time. At typical rates, that’s $50–100 per event.

For a job without checkpointing: interruption cost might be hours of re-run plus downstream delays.

Net expected value = Expected savings − (Interruption probability × Cost per event)

If the instance type you’re targeting has a 10% monthly interruption rate and each interruption costs $200, the expected monthly cost of interruptions is $20 — trivial against $6,000 in expected savings. Spot is clearly the right choice.

If each interruption costs $5,000 (SLA penalties, significant data loss, 8-hour remediation), the math is different.

Architectural patterns that make Spot viable

Organizations that use Spot successfully at scale typically implement a few patterns:

Mixed instance fleet — Instead of targeting a single Spot instance type, use multiple instance families with similar specifications. If one pool is interrupted or unavailable, the fleet shifts to others. Spot Fleet and EC2 Auto Scaling groups with multiple instance types make this manageable.

Capacity fall-through — Configure your instances to try Spot first, then fall through to on-demand if Spot capacity is unavailable. This ensures availability while still capturing Spot savings when capacity is available.

Spot interruption handlers — AWS sends a two-minute warning before interrupting a Spot instance. Applications instrumented to receive and act on this signal can checkpoint in-progress work, drain connections, and deregister from load balancers before the termination happens. This significantly reduces the blast radius of an interruption.

On-demand baseline + Spot burst — Keep a minimum on-demand capacity that handles baseline load. Use Spot for additional capacity during peak periods. If Spot is interrupted, you’re never below minimum capacity.

What finance should ask before approving Spot usage

What is the interruption probability for the specific instance types and availability zones proposed?
Does the workload have checkpointing, or does an interruption require a full restart?
What is the downstream impact of an interruption — are there SLAs or dependent systems?
What is the mixed-fleet strategy, and what is the on-demand floor?
Is there Spot interruption handling instrumented in the application?

If engineering can answer these questions clearly, the savings case is likely sound. If the answer to any of them is “we haven’t thought about that,” the risk hasn’t been properly evaluated.

CostDefender surfaces your current on-demand spending by workload type, making it easy to identify Spot candidates and quantify the savings opportunity before engineering scopes the implementation.