Load vs Stress Testing

February 21, 2024

Overview

Load Testing and Stress Testing both fall under the umbrella of Performance Testing. Their primary objectives are:

Load Testing: tests the system under normal or expected load, up to the point where errors appear.
Stress Testing: tests the system in an incremental manner to find the point where errors appear, or its breaking point.

For more details, continue reading. There is overlap between these two, and it is discussed in the Conclusion.

Comparison

Load Testing

Reactively Scaled System

Direct Request / Response Services

Asynchronous Services with Persistent Intermediary

Stress Testing

Conclusion

Comparison

	Load Testing	Stress Testing
Objective	Determine system performance under normal or expected load conditions.	Determine system's breaking point by incrementally increasing load.
Expected Outcome	Discover performance bottlenecks.	Discover points of failure, and recovery capabilities.
Metrics to Measure	Latency, throughput before system begins to exhibit too many errors (beyond SLA).	Latency, throughput, rate of errors. Points of failure. Recovery time.
Cost Implications	Helps to identify over-allocation of resources.	Can help to minimize production downtime in the event of a system crash caused by overload.

Table template created by ChatGPT of OpenAI.

SLA: Service-Level Agreement. This has a broad meaning, but in the context of latency, throughput, and error rate, it is the rate that is communicated (in writing or verbally to customers or other client services of this server) as a guaranteed level of performance.

Load Testing

The purpose of Load Testing is to identify either:

The system works as expected at normal load.
The system's peak load before an error rate above SLA is observed.

Reactively Scaled System

Reactively Scaled Systems are systems that define a minimum and maximum number of replicas for its microservices. When idle, the system's services deploy the minimum number of replicas. However, this minimum is likely not enough to handle normal or peak loads.

Direct Request / Response Services

Testing reactively scaled systems can be a challenge especially for Direct Request / Response Services. Even if retry mechanisms are implemented to provide a larger window of time for the downstream service (the service receiving the requests) to handle requests, these retry mechanisms exist at the application layer. Requests can be dropped at the network infrastructure layer, before they ever reach the application.

How this problem is mitigated depends on the traffic pattern of the system, and its deployment and scaling strategy. If the traffic pattern makes sharp changes that are too fast for the service's reactive scaling, then:

Adjust the scaling parameters (to be more sensitive, or select parameters that are more relevant for your service).
Create a scaling schedule.

Asynchronous Services with Persistent Intermediary

Testing Asynchronous Services with a Persistent Intermediary (e.g. a message broker or distributed storage) is less of a challenge than Direct Request / Response Services, because downstream services that consume from the Persistent Intermediary have time to scale up as messages accumulate. This can still pose a problem if SLAs need to be met. This can be mitigated using similar mechanisms as Direct Request / Response Services.

Stress Testing

All the considerations above also apply to Stress Testing. This is where you take a system to the point of failure, and then observe its recovery. The exact definition of failure is up to interpretation. It could be the point where errors are observed, or it could be the point where services begin to crash.

Conclusion

Stress Testing is typically taking the system one step beyond where Load Testing takes it. The exact definition of this extent is up to interpretation. There is overlap. If you take the system up to the point where errors appear, then this could be categorized as either Load Testing (because you're measuring its peak), or it could also be categorized as Stress Testing (because you've pushed the system beyond what it can handle).

Diagrams to be added, as well as a deeper dive into scaling parameters.

Overview

Table of Contents

Comparison

Load Testing

Reactively Scaled System

Direct Request / Response Services

Asynchronous Services with Persistent Intermediary

Stress Testing

Conclusion