Chaos Test

Muhammad Iqbal,Chaos Engineering

What is chaos test?

It is actually come from chaos engineering that netflix started on 2009, their approach is make AWS Instances to restart/shutdown and see how their services impacted. After see the impact from chaos test, they make improvement so that the services still up even when chaos happend.

The term chaos test actually have same concept as QA test, the difference is QA test focusing on software and chaos test focusing on infrastructure.

What I've done in my case

To describe clear context, I make these table to show the tools.

ToolsFunctionDescription
LocustChaos TestChaos test tools using python
K6Chaos TestChaos test tools using javascript
Grafana DashboardGrafana StackDashboard to visualize data
Grafana MimirGrafana StackMetrics data storage

I've been assigned to supervise Grafana Mimir cluster so it'll be performant when released to all engineers, the result is Grafana Dashboard capable to query Mimir in 100 rps (1 hour query range,) of course to getting there need many improvement.

Methodology

Query (Locust)

  1. Inspect Dashboard.
  2. List http post request in Grafana Dashboard.
  3. Create python script same as dashboard query.
  4. Run the script on locust.
  5. See error rate.
  6. Monitor dashboard, query-frontend, & querier component resource.

Ingestion (K6)

  1. Create js script to write random metrics to Grafana Mimir.
  2. See error rate.
  3. Monitor distributor & ingester component resource.

Scenario & Result

Simplified version of scenario test.

First scenario

Scenario:

Results:

Learning:

Second scenario

Scenario:

Results:

Learning:

Summary

With chaos test, we can identify Grafana Stack capability and learn how to make performant with any infrastructure environment.