zoobzio December 11, 2025 Edit this page

Error Handling

Build resilient pipelines that gracefully handle failures.

Retry

Retry operations that may fail transiently.

Basic Retry

type: retry
attempts: 3
child:
  ref: flaky-operation

Retries up to 3 times on any error.

Retry with Backoff

type: retry
attempts: 5
backoff: "100ms"
child:
  ref: rate-limited-api

Waits backoff duration between attempts, doubling each time (exponential backoff).

When to Use

  • Network calls that may timeout
  • Rate-limited APIs that return temporary errors
  • Database operations during high load
  • Message queue operations

Configuration

FieldDefaultDescription
attempts3Maximum retry attempts
backoff(none)Initial backoff duration

Fallback

Provide alternative handling when primary fails.

Basic Fallback

type: fallback
children:
  - ref: primary-service
  - ref: backup-service

Tries primary first; if it errors, tries backup.

Cascading Fallbacks

type: fallback
children:
  - ref: primary-cache
  - type: fallback
    children:
      - ref: secondary-cache
      - ref: database

When to Use

  • Service redundancy
  • Cache miss handling
  • Feature degradation
  • Default value provision

Timeout

Enforce time limits on operations.

Basic Timeout

type: timeout
duration: "5s"
child:
  ref: slow-operation

Cancels operation if not complete within duration.

Configuration

FieldDefaultDescription
duration30sMaximum execution time

Duration Formats

duration: "100ms"   # 100 milliseconds
duration: "5s"      # 5 seconds
duration: "2m"      # 2 minutes
duration: "1h30m"   # 1.5 hours

When to Use

  • External API calls
  • Long-running computations
  • User-facing requests requiring responsiveness
  • Preventing resource exhaustion

Circuit Breaker

Prevent cascading failures by stopping calls to failing services.

Basic Circuit Breaker

type: circuit-breaker
failure_threshold: 5
recovery_timeout: "60s"
child:
  ref: external-service

How It Works

  1. Closed - Normal operation, failures counted
  2. Open - After failure_threshold failures, requests fail immediately
  3. Half-Open - After recovery_timeout, one test request allowed
  4. Closed - If test succeeds, resume normal operation

Configuration

FieldDefaultDescription
failure_threshold5Failures before opening
recovery_timeout60sTime before testing recovery

When to Use

  • Calling external services
  • Protecting against cascade failures
  • Allowing degraded operation
  • Preventing resource exhaustion from retrying failing calls

Rate Limiting

Control request throughput.

Basic Rate Limit

type: rate-limit
requests_per_second: 100.0
burst_size: 10
child:
  ref: rate-sensitive-api

Configuration

FieldDefaultDescription
requests_per_second10.0Sustained rate limit
burst_size1Allowed burst above limit

When to Use

  • Calling APIs with rate limits
  • Protecting downstream services
  • Ensuring fair resource usage
  • Preventing thundering herd

Combining Patterns

Resilient External Call

type: circuit-breaker
failure_threshold: 3
recovery_timeout: "30s"
child:
  type: timeout
  duration: "10s"
  child:
    type: retry
    attempts: 3
    backoff: "200ms"
    child:
      ref: external-api

Order matters:

  1. Circuit breaker - Fails fast if service is down
  2. Timeout - Limits total wait time
  3. Retry - Handles transient failures

Fallback with Resilience

type: fallback
children:
  - type: circuit-breaker
    failure_threshold: 5
    child:
      type: retry
      attempts: 2
      child:
        ref: primary-service
  - ref: backup-service

Primary has full resilience; backup is simpler.

Rate Limited with Timeout

type: timeout
duration: "30s"
child:
  type: rate-limit
  requests_per_second: 50.0
  burst_size: 5
  child:
    type: retry
    attempts: 3
    child:
      ref: rate-limited-api

Total request time bounded even if queued for rate limiting.

Error Types

Understand which errors trigger which behaviour:

ConnectorTriggers On
RetryAny error returned
FallbackAny error from primary
TimeoutContext deadline exceeded
Circuit BreakerAny error (tracks count)

Best Practices

1. Set Realistic Timeouts

# Too short - may fail healthy requests
duration: "100ms"

# Too long - poor user experience
duration: "5m"

# Right - based on actual service SLA
duration: "5s"

2. Tune Circuit Breaker Thresholds

# Too sensitive - opens on minor issues
failure_threshold: 1

# Too lenient - doesn't protect
failure_threshold: 100

# Balanced - opens after pattern emerges
failure_threshold: 5
recovery_timeout: "30s"

3. Use Backoff for Rate Limits

type: retry
attempts: 5
backoff: "1s"  # Gives rate limit time to reset
child:
  ref: rate-limited-api

4. Don't Retry Non-Retryable Errors

Some errors shouldn't be retried. Design processors to return wrapped errors:

type nonRetryableError struct {
    error
}

func (e nonRetryableError) Unwrap() error { return e.error }

// In processor
if isInvalidInput(err) {
    return data, nonRetryableError{err}
}

5. Monitor Error Rates

Use Capitan events to track failures:

capitan.Handle(flume.SchemaBuildFailed, logFailure)
// Also monitor your processor errors

Next Steps