This document describes the architecture of async-service-chaos_kitty, which follows the same pattern as async-service-supervisor.
ChaosKitty is a chaos monkey system for testing service resilience. It uses a client-server architecture where workers connect to a central chaos server, and various chaos operations are unleashed on the connected workers.
The main chaos server that:
- Accepts connections from workers (victims)
- Manages chaos operations
- Coordinates between chaos operations and connected victims
- Tracks all connected victims via controllers
A worker process that:
- Connects to the chaos server
- Registers itself as a victim
- Exposes victim controller methods for chaos operations
- Runs the main application logic
Base client class for connecting to the chaos server. Extended by Worker.
Server-side controller that:
- Manages victim registration
- Provides access to victim proxies
- Handles status queries
- Tracks victim metadata (ID, process ID, connection)
Client-side controller that:
- Exposes methods that can be invoked by chaos operations
- Implements chaos actions: delay, raise_error, allocate_memory, cpu_spin, trigger_gc
- Logs chaos events
All chaos operations follow the same pattern:
register(chaos_controller): Called when a new victim connectsremove(chaos_controller): Called when a victim disconnectsstatus(): Returns current statusrun(): Starts the chaos operation loop
Causes random delays and blocking operations.
Parameters:
interval: How often to check for chaos opportunitiesprobability: Chance of causing chaos (0.0-1.0)min_delay: Minimum delay durationmax_delay: Maximum delay duration
Effect: Calls victim.delay(duration:) on random victims
Randomly terminates victim processes.
Parameters:
interval: How often to check for chaos opportunitiesprobability: Chance of causing chaos (0.0-1.0)signal: Signal to send to process
Effect: Sends signal to victim's process ID
Creates random memory spikes.
Parameters:
interval: How often to check for chaos opportunitiesprobability: Chance of causing chaos (0.0-1.0)min_size_mb: Minimum memory allocationmax_size_mb: Maximum memory allocationhold_duration: How long to hold the allocation
Effect: Calls victim.allocate_memory(size_mb:, hold_duration:)
Generates random CPU spikes.
Parameters:
interval: How often to check for chaos opportunitiesprobability: Chance of causing chaos (0.0-1.0)min_duration: Minimum CPU spin durationmax_duration: Maximum CPU spin duration
Effect: Calls victim.cpu_spin(duration:)
Raises random exceptions.
Parameters:
interval: How often to check for chaos opportunitiesprobability: Chance of causing chaos (0.0-1.0)messages: Array of possible error messages
Effect: Calls victim.raise_error(message:)
-
Worker Startup:
- Worker creates a connection to chaos server
- Worker creates VictimController and binds it
- Worker calls
chaos.register(victim_proxy, process_id:) - Server allocates ID and calls
chaos_operation.register()for each operation
-
Chaos Execution:
- Chaos operation runs in a loop at specified interval
- On each iteration, randomly selects a victim
- Checks probability to determine if chaos should occur
- Invokes remote method on victim via proxy
- Victim controller executes the chaos action
-
Worker Shutdown:
- Connection closes
- Server calls
chaos_operation.remove()for each operation - Controller is removed from tracking
- Uses Unix domain sockets for inter-process communication
- Default socket path:
chaos_kitty.ipc - Built on async-bus for RPC capabilities
- Supports multi-hop forwarding for proxy calls
- Built on Async framework for cooperative concurrency
- Each chaos operation runs in its own Async task
- Connection handling is concurrent
- Chaos operations execute independently
| Supervisor | ChaosKitty | Purpose |
|---|---|---|
| Monitor | Chaos Operation | Watches/affects workers |
| MemoryMonitor | Floop | Memory-related |
| ProcessMonitor | Scratch | Process-related |
| Worker | Worker | Connects to server |
| SupervisorController | ChaosController | Server-side RPC |
| WorkerController | VictimController | Client-side RPC |
| Monitors health | Causes chaos | Core function |
Both systems share the same architectural pattern but serve opposite purposes: one monitors and maintains health, the other intentionally causes problems to test resilience.