Chaos Test
What is chaos test?
It is actually come from chaos engineering that netflix started on 2009, their approach is make AWS Instances to restart/shutdown and see how their services impacted. After see the impact from chaos test, they make improvement so that the services still up even when chaos happend.
The term chaos test actually have same concept as QA test, the difference is QA test focusing on software and chaos test focusing on infrastructure.
What I've done in my case
To describe clear context, I make these table to show the tools.
| Tools | Function | Description |
|---|---|---|
| Locust | Chaos Test | Chaos test tools using python |
| K6 | Chaos Test | Chaos test tools using javascript |
| Grafana Dashboard | Grafana Stack | Dashboard to visualize data |
| Grafana Mimir | Grafana Stack | Metrics data storage |
I've been assigned to supervise Grafana Mimir cluster so it'll be performant when released to all engineers, the result is Grafana Dashboard capable to query Mimir in 100 rps (1 hour query range,) of course to getting there need many improvement.
Methodology
Query (Locust)
- Inspect Dashboard.
- List http post request in Grafana Dashboard.
- Create python script same as dashboard query.
- Run the script on locust.
- See error rate.
- Monitor dashboard, query-frontend, & querier component resource.
Ingestion (K6)
- Create js script to write random metrics to Grafana Mimir.
- See error rate.
- Monitor distributor & ingester component resource.
Scenario & Result
Simplified version of scenario test.
First scenario
Scenario:
- Test 8 query continously with 30 concurrent/s for 5 minutes to Grafana Dashboard & Mimir, the result is 15 rps and 50% error rate.
- Test random remote write metrics with 1000 write/s to Grafana Mimir.
Results:
- Mimir Querier component collapse & Dashboard bottleneck at 15 rps, even though have increase more than 30 concurrent/s.
- Mimir Ingester component all good, but need improvement if write/s increased.
Learning:
- Change Dashboard database from sqlite to postgresql, sqlite make query bottleneck around 15 rps.
- Increase Mimir Querier component resource, from 500mhz to 2ghz and enable autoscale.
Second scenario
Scenario:
- Test 8 query continously with 150 concurrent/s for 5 minutes to Grafana Dashboard & Mimir, the result is 100 rps and 0% error rate.
- Test random remote write metrics with 2000 write/s to Grafana Mimir.
Results:
- Previously Mimir Querier component & Dashboard collapse, but now capable for 100 rps.
- Mimir Ingester component more performant.
Learning:
- Need high resource to make Grafana Stack performant
Summary
With chaos test, we can identify Grafana Stack capability and learn how to make performant with any infrastructure environment.