Performance Testing: Loads, Stress, and Soak

Contributor
Apr 28
5 min read

"Is our system fast enough?" is a fuzzy question. "Can it handle 1000 concurrent users?" is a clearer one. "Will it survive a 10x spike?" is another. "Does it stay healthy over 48 hours?" is yet another. Each is a different test, with a different setup, finding different problems.

This guide is the practical view of the main performance test types and when to use each.

The Three Main Types

Load testing: How does the system behave under expected load? Verifies that normal traffic produces acceptable response times.

Stress testing: What happens when the system is pushed beyond expected load? Reveals breaking points and behavior under overload.

Soak testing: Does the system remain stable over long periods? Surfaces memory leaks, resource exhaustion, and slow-accumulating problems.

There are others (spike, scalability, capacity) but these three cover most needs.

Load Testing

The most common performance test. Run the system at expected production load and measure:

Response times (p50, p95, p99)
Error rates
Throughput (requests per second)
Resource utilization (CPU, memory, network)

A working load test:

Models realistic traffic patterns (not just one endpoint hit repeatedly)
Includes realistic data (not all the same record)
Includes realistic user behavior (think time between requests)
Runs for long enough to reach steady state (typically 15-30 minutes)
Asserts against specific thresholds

Example assertion: "p95 of /api/checkout under 800ms at 500 concurrent users."

Stress Testing

Push past expected load until something breaks. The goal is understanding limits.

Variants:

Ramp test: gradually increase load until performance degrades
Spike test: sudden burst of high load
Sustained overload: maintain high load to see what fails first

Stress testing answers:

What's our actual capacity?
What fails first under overload? (Database? API gateway? Memory?)
How does the system recover after the load is removed?
Are our scaling triggers correctly set?

Soak Testing

Long-running tests at moderate load. Often 24-72 hours.

What soak tests find:

Memory leaks
Connection pool exhaustion
Disk filling up (logs, temp files)
Background job queue buildup
Database performance degradation as data grows
Caching artifacts

Soak tests can be expensive (long-running infrastructure). For most teams, run them periodically — monthly or before major releases.

Spike Testing

Sudden traffic burst to test elasticity.

Examples:

Marketing campaign launches
Black Friday rushes
Social media virality

What spike tests verify:

Auto-scaling triggers correctly
The system doesn't fail before scaling completes
Recovery is clean after the spike

Capacity / Scalability Testing

Determines maximum sustainable throughput.

Methodology:

Start at low load
Increase incrementally until error rate exceeds threshold
Note the throughput at that point
Note where additional capacity should be added

The result informs capacity planning: "we can handle 1500 RPS on the current cluster; we'd need to add a node at 2000."

Setting Targets

Performance tests need numeric targets. Without them, the test reveals data but doesn't pass or fail.

Targets come from:

SLOs (service level objectives) you've committed to
Customer expectations for the use case
Competitive benchmarks for similar products
Engineering intuition for what's "fast enough"

Examples:

p95 response time under 500ms for read endpoints
p99 response time under 1500ms for write endpoints
Error rate under 0.1% at expected load
System handles 2x expected peak load with degradation but no errors

Document targets explicitly. Performance is the second-biggest source of vague engineering disagreements (after "good user experience").

What to Test

You can't load test everything. Pick:

Critical user journeys. What customers do most.
Money paths. Checkout, billing, payment.
Read-heavy endpoints. Where most traffic concentrates.
Reportedly slow areas. Where customers have complained.
Recently changed code. Where regressions might hide.

A typical team has 5-15 endpoints worth dedicated performance testing.

Tools

Common open-source options:

k6: modern JavaScript-based, good developer experience
JMeter: mature, GUI-driven, broad protocol support
Locust: Python-based, code-driven
Gatling: Scala-based, high-concurrency

Cloud options (Loader.io, Loadster, Octoperf) reduce setup cost.

For most teams, k6 is a reasonable starting point. The exact tool matters less than the discipline of using it.

Environment

Performance tests need realistic environments.

Production: the most realistic, but disruptive. Used carefully (during low-traffic windows, with kill switches).
Production-equivalent staging: ideal — same infrastructure shape and capacity.
Smaller staging: useful for comparative testing (regression between releases) but absolute numbers don't translate.

A common mistake: load testing a smaller environment and assuming the results scale linearly. They often don't.

When to Run

Cadence varies by team:

Continuously in CI: lightweight smoke load tests on every change (does this introduce a regression?)
Pre-release: full load and stress tests before major releases
Periodically: soak and capacity tests monthly or quarterly
Pre-launch: specific spike tests before known traffic events

The fastest-feedback layer (CI smoke perf) catches obvious regressions. Slower tests catch subtler issues.

Interpreting Results

Raw response times alone are misleading. Look at:

Percentile distributions. p50, p95, p99. Averages hide tail latency.
Throughput vs. latency curves. As load increases, where does latency start climbing?
Error rates. Is the system returning errors before slowing down?
Resource utilization. What's saturating? CPU, memory, network, database connections?

A successful performance test produces graphs and conclusions, not just numbers.

Common Performance Test Mistakes

Unrealistic traffic patterns. All requests hitting one endpoint with one user ID. Doesn't model real traffic; doesn't catch real bugs.

Insufficient ramp time. Sudden jump to full load. Misses what happens during scaling.

Ignoring think time. Real users pause between requests. Removing think time produces synthetic patterns no real user creates.

Testing only happy paths. Real traffic includes errors, retries, slow responses. Test those too.

No baseline. Performance tests without historical comparison are absolute numbers without context. Track over time.

Testing infrastructure, not application. Hitting the load balancer hard reveals how the load balancer scales, not whether your app does.

Performance Bugs to Look For

N+1 queries that don't surface in single-request testing
Memory growth proportional to request count
Cache misses that cascade
Lock contention under concurrency
Slow paths in third-party dependencies
Garbage collection pauses

These typically surface in load and soak tests, not unit tests.

Reporting

Performance test reports should include:

Configuration: what was tested, with what setup
Results: response times, throughput, errors, resource use
Comparison: against prior runs and against targets
Findings: bottlenecks identified
Recommendations: what to address before production

A good report is short. Long reports get filed; short reports get read.

Key Takeaway

Performance testing isn't one activity. Load tests verify expected traffic, stress tests find breaking points, soak tests reveal slow accumulation, spike tests verify elasticity. Set specific numeric targets. Test the critical journeys, not everything. Use realistic traffic patterns. Run lightweight tests in CI, heavier tests pre-release, soak tests periodically. Report results with comparisons, not just numbers. Performance testing is most valuable when it produces actionable findings, not just data.

ShiftQuality