Error Handling

Build resilient pipelines that gracefully handle failures.

Retry

Retry operations that may fail transiently.

Basic Retry

type: retry
attempts: 3
child:
  ref: flaky-operation

Retries up to 3 times on any error.

Retry with Backoff

type: retry
attempts: 5
backoff: "100ms"
child:
  ref: rate-limited-api

Waits backoff duration between attempts, doubling each time (exponential backoff).

When to Use

Network calls that may timeout
Rate-limited APIs that return temporary errors
Database operations during high load
Message queue operations

Configuration

Field	Default	Description
`attempts`	3	Maximum retry attempts
`backoff`	(none)	Initial backoff duration

Fallback

Provide alternative handling when primary fails.

Basic Fallback

type: fallback
children:
  - ref: primary-service
  - ref: backup-service

Tries primary first; if it errors, tries backup.

Cascading Fallbacks

type: fallback
children:
  - ref: primary-cache
  - type: fallback
    children:
      - ref: secondary-cache
      - ref: database

When to Use

Service redundancy
Cache miss handling
Feature degradation
Default value provision

Timeout

Enforce time limits on operations.

Basic Timeout

type: timeout
duration: "5s"
child:
  ref: slow-operation

Cancels operation if not complete within duration.

Configuration

Field	Default	Description
`duration`	30s	Maximum execution time

Duration Formats

duration: "100ms"   # 100 milliseconds
duration: "5s"      # 5 seconds
duration: "2m"      # 2 minutes
duration: "1h30m"   # 1.5 hours

When to Use

External API calls
Long-running computations
User-facing requests requiring responsiveness
Preventing resource exhaustion

Circuit Breaker

Prevent cascading failures by stopping calls to failing services.

Basic Circuit Breaker

type: circuit-breaker
failure_threshold: 5
recovery_timeout: "60s"
child:
  ref: external-service

How It Works

Closed - Normal operation, failures counted
Open - After failure_threshold failures, requests fail immediately
Half-Open - After recovery_timeout, one test request allowed
Closed - If test succeeds, resume normal operation

Configuration

Field	Default	Description
`failure_threshold`	5	Failures before opening
`recovery_timeout`	60s	Time before testing recovery

When to Use

Calling external services
Protecting against cascade failures
Allowing degraded operation
Preventing resource exhaustion from retrying failing calls

Rate Limiting

Control request throughput.

Basic Rate Limit

type: rate-limit
requests_per_second: 100.0
burst_size: 10
child:
  ref: rate-sensitive-api

Configuration

Field	Default	Description
`requests_per_second`	10.0	Sustained rate limit
`burst_size`	1	Allowed burst above limit

When to Use

Calling APIs with rate limits
Protecting downstream services
Ensuring fair resource usage
Preventing thundering herd

Combining Patterns

Resilient External Call

type: circuit-breaker
failure_threshold: 3
recovery_timeout: "30s"
child:
  type: timeout
  duration: "10s"
  child:
    type: retry
    attempts: 3
    backoff: "200ms"
    child:
      ref: external-api

Order matters:

Circuit breaker - Fails fast if service is down
Timeout - Limits total wait time
Retry - Handles transient failures

Fallback with Resilience

type: fallback
children:
  - type: circuit-breaker
    failure_threshold: 5
    child:
      type: retry
      attempts: 2
      child:
        ref: primary-service
  - ref: backup-service

Primary has full resilience; backup is simpler.

Rate Limited with Timeout

type: timeout
duration: "30s"
child:
  type: rate-limit
  requests_per_second: 50.0
  burst_size: 5
  child:
    type: retry
    attempts: 3
    child:
      ref: rate-limited-api

Total request time bounded even if queued for rate limiting.

Error Types

Understand which errors trigger which behaviour:

Connector	Triggers On
Retry	Any error returned
Fallback	Any error from primary
Timeout	Context deadline exceeded
Circuit Breaker	Any error (tracks count)

Best Practices

1. Set Realistic Timeouts

# Too short - may fail healthy requests
duration: "100ms"

# Too long - poor user experience
duration: "5m"

# Right - based on actual service SLA
duration: "5s"

2. Tune Circuit Breaker Thresholds

# Too sensitive - opens on minor issues
failure_threshold: 1

# Too lenient - doesn't protect
failure_threshold: 100

# Balanced - opens after pattern emerges
failure_threshold: 5
recovery_timeout: "30s"

3. Use Backoff for Rate Limits

type: retry
attempts: 5
backoff: "1s"  # Gives rate limit time to reset
child:
  ref: rate-limited-api

4. Don't Retry Non-Retryable Errors

Some errors shouldn't be retried. Design processors to return wrapped errors:

type nonRetryableError struct {
    error
}

func (e nonRetryableError) Unwrap() error { return e.error }

// In processor
if isInvalidInput(err) {
    return data, nonRetryableError{err}
}

5. Monitor Error Rates

Use Capitan events to track failures:

capitan.Handle(flume.SchemaBuildFailed, logFailure)
// Also monitor your processor errors

Next Steps

Testing - Test error scenarios
Connector Types Reference - All options