Skip to content

Webserver OOM crash from unbounded OTEL span buffer on export failure #627

@lbedner

Description

@lbedner

Bug

The webserver container gets OOM-killed after running for several days. The root cause is the OpenTelemetry BatchSpanProcessor accumulating spans in memory when Logfire cloud exports fail or are slow.

Environment

  • Aegis Stack v0.6.7 with observability component
  • DigitalOcean droplet, 2GB RAM
  • Webserver container memory limit: 512MB (default)
  • Logfire enabled with logfire.instrument_fastapi()

Symptoms

  1. Webserver starts at ~300MB, grows steadily over days
  2. Logfire export retries appear in logs: Currently retrying 1 failed export(s) (899 bytes)
  3. After ~7 days, container hits 512MB memory limit → kernel OOM kill
  4. Docker sees container still "running" (process restarted inside cgroup but hung on stale connections)
  5. restart: unless-stopped doesn't trigger because the container never exited
  6. Traefik drops the backend from routing → all requests return 404

Root Cause

The OTEL BatchSpanProcessor defaults are too permissive for constrained environments:

  • OTEL_BSP_MAX_QUEUE_SIZE = 2048 spans buffered in memory
  • No export timeout — failed batches retry indefinitely
  • Failed export batches stay in memory with no upper bound on retry buffer

When Logfire cloud is slow or unreachable (rate limiting, network blip), the retry buffer grows without limit until OOM.

dmesg confirmation

oom-kill: Memory cgroup out of memory: Killed process 2771661 (python3)
total-vm:1039940kB, anon-rss:366944kB

Solution

Two changes needed:

1. Set OTEL buffer limits in config

Add these to app/core/config.py in the observability section so they're configurable via .env:

# OTEL span buffer limits (prevent unbounded memory growth on export failure)
OTEL_BSP_MAX_QUEUE_SIZE: int = 1024
OTEL_BSP_MAX_EXPORT_BATCH_SIZE: int = 256
OTEL_BSP_EXPORT_TIMEOUT: int = 10000  # ms
OTEL_BSP_SCHEDULE_DELAY: int = 5000   # ms

These are standard OTEL env vars — BatchSpanProcessor reads them from the environment automatically. No middleware changes needed since Pydantic settings bind to env vars with matching names.

2. Right-size default container memory limits

The template defaults over-provision memory (4.5GB total limits on a 2GB box):

Container Current Default Actual Usage Recommended
webserver 512MB ~300MB 768MB
worker-* 256MB-1GB ~67MB 256MB
redis 512MB ~5MB 128MB
traefik 256MB ~20MB 128MB

Reservations should also be dropped from the defaults — they compound the over-provisioning.

3. Document swap recommendation

For droplets ≤ 4GB RAM, recommend adding swap as a safety net:

fallocate -l 1G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions