Error Handling
Build resilient pipelines that gracefully handle failures.
Retry
Retry operations that may fail transiently.
Basic Retry
type: retry
attempts: 3
child:
ref: flaky-operation
Retries up to 3 times on any error.
Retry with Backoff
type: retry
attempts: 5
backoff: "100ms"
child:
ref: rate-limited-api
Waits backoff duration between attempts, doubling each time (exponential backoff).
When to Use
- Network calls that may timeout
- Rate-limited APIs that return temporary errors
- Database operations during high load
- Message queue operations
Configuration
| Field | Default | Description |
|---|---|---|
attempts | 3 | Maximum retry attempts |
backoff | (none) | Initial backoff duration |
Fallback
Provide alternative handling when primary fails.
Basic Fallback
type: fallback
children:
- ref: primary-service
- ref: backup-service
Tries primary first; if it errors, tries backup.
Cascading Fallbacks
type: fallback
children:
- ref: primary-cache
- type: fallback
children:
- ref: secondary-cache
- ref: database
When to Use
- Service redundancy
- Cache miss handling
- Feature degradation
- Default value provision
Timeout
Enforce time limits on operations.
Basic Timeout
type: timeout
duration: "5s"
child:
ref: slow-operation
Cancels operation if not complete within duration.
Configuration
| Field | Default | Description |
|---|---|---|
duration | 30s | Maximum execution time |
Duration Formats
duration: "100ms" # 100 milliseconds
duration: "5s" # 5 seconds
duration: "2m" # 2 minutes
duration: "1h30m" # 1.5 hours
When to Use
- External API calls
- Long-running computations
- User-facing requests requiring responsiveness
- Preventing resource exhaustion
Circuit Breaker
Prevent cascading failures by stopping calls to failing services.
Basic Circuit Breaker
type: circuit-breaker
failure_threshold: 5
recovery_timeout: "60s"
child:
ref: external-service
How It Works
- Closed - Normal operation, failures counted
- Open - After
failure_thresholdfailures, requests fail immediately - Half-Open - After
recovery_timeout, one test request allowed - Closed - If test succeeds, resume normal operation
Configuration
| Field | Default | Description |
|---|---|---|
failure_threshold | 5 | Failures before opening |
recovery_timeout | 60s | Time before testing recovery |
When to Use
- Calling external services
- Protecting against cascade failures
- Allowing degraded operation
- Preventing resource exhaustion from retrying failing calls
Rate Limiting
Control request throughput.
Basic Rate Limit
type: rate-limit
requests_per_second: 100.0
burst_size: 10
child:
ref: rate-sensitive-api
Configuration
| Field | Default | Description |
|---|---|---|
requests_per_second | 10.0 | Sustained rate limit |
burst_size | 1 | Allowed burst above limit |
When to Use
- Calling APIs with rate limits
- Protecting downstream services
- Ensuring fair resource usage
- Preventing thundering herd
Combining Patterns
Resilient External Call
type: circuit-breaker
failure_threshold: 3
recovery_timeout: "30s"
child:
type: timeout
duration: "10s"
child:
type: retry
attempts: 3
backoff: "200ms"
child:
ref: external-api
Order matters:
- Circuit breaker - Fails fast if service is down
- Timeout - Limits total wait time
- Retry - Handles transient failures
Fallback with Resilience
type: fallback
children:
- type: circuit-breaker
failure_threshold: 5
child:
type: retry
attempts: 2
child:
ref: primary-service
- ref: backup-service
Primary has full resilience; backup is simpler.
Rate Limited with Timeout
type: timeout
duration: "30s"
child:
type: rate-limit
requests_per_second: 50.0
burst_size: 5
child:
type: retry
attempts: 3
child:
ref: rate-limited-api
Total request time bounded even if queued for rate limiting.
Error Types
Understand which errors trigger which behaviour:
| Connector | Triggers On |
|---|---|
| Retry | Any error returned |
| Fallback | Any error from primary |
| Timeout | Context deadline exceeded |
| Circuit Breaker | Any error (tracks count) |
Best Practices
1. Set Realistic Timeouts
# Too short - may fail healthy requests
duration: "100ms"
# Too long - poor user experience
duration: "5m"
# Right - based on actual service SLA
duration: "5s"
2. Tune Circuit Breaker Thresholds
# Too sensitive - opens on minor issues
failure_threshold: 1
# Too lenient - doesn't protect
failure_threshold: 100
# Balanced - opens after pattern emerges
failure_threshold: 5
recovery_timeout: "30s"
3. Use Backoff for Rate Limits
type: retry
attempts: 5
backoff: "1s" # Gives rate limit time to reset
child:
ref: rate-limited-api
4. Don't Retry Non-Retryable Errors
Some errors shouldn't be retried. Design processors to return wrapped errors:
type nonRetryableError struct {
error
}
func (e nonRetryableError) Unwrap() error { return e.error }
// In processor
if isInvalidInput(err) {
return data, nonRetryableError{err}
}
5. Monitor Error Rates
Use Capitan events to track failures:
capitan.Handle(flume.SchemaBuildFailed, logFailure)
// Also monitor your processor errors
Next Steps
- Testing - Test error scenarios
- Connector Types Reference - All options