Bug
The webserver container gets OOM-killed after running for several days. The root cause is the OpenTelemetry BatchSpanProcessor accumulating spans in memory when Logfire cloud exports fail or are slow.
Environment
- Aegis Stack v0.6.7 with observability component
- DigitalOcean droplet, 2GB RAM
- Webserver container memory limit: 512MB (default)
- Logfire enabled with
logfire.instrument_fastapi()
Symptoms
- Webserver starts at ~300MB, grows steadily over days
- Logfire export retries appear in logs:
Currently retrying 1 failed export(s) (899 bytes)
- After ~7 days, container hits 512MB memory limit → kernel OOM kill
- Docker sees container still "running" (process restarted inside cgroup but hung on stale connections)
restart: unless-stopped doesn't trigger because the container never exited
- Traefik drops the backend from routing → all requests return 404
Root Cause
The OTEL BatchSpanProcessor defaults are too permissive for constrained environments:
OTEL_BSP_MAX_QUEUE_SIZE = 2048 spans buffered in memory
- No export timeout — failed batches retry indefinitely
- Failed export batches stay in memory with no upper bound on retry buffer
When Logfire cloud is slow or unreachable (rate limiting, network blip), the retry buffer grows without limit until OOM.
dmesg confirmation
oom-kill: Memory cgroup out of memory: Killed process 2771661 (python3)
total-vm:1039940kB, anon-rss:366944kB
Solution
Two changes needed:
1. Set OTEL buffer limits in config
Add these to app/core/config.py in the observability section so they're configurable via .env:
# OTEL span buffer limits (prevent unbounded memory growth on export failure)
OTEL_BSP_MAX_QUEUE_SIZE: int = 1024
OTEL_BSP_MAX_EXPORT_BATCH_SIZE: int = 256
OTEL_BSP_EXPORT_TIMEOUT: int = 10000 # ms
OTEL_BSP_SCHEDULE_DELAY: int = 5000 # ms
These are standard OTEL env vars — BatchSpanProcessor reads them from the environment automatically. No middleware changes needed since Pydantic settings bind to env vars with matching names.
2. Right-size default container memory limits
The template defaults over-provision memory (4.5GB total limits on a 2GB box):
| Container |
Current Default |
Actual Usage |
Recommended |
| webserver |
512MB |
~300MB |
768MB |
| worker-* |
256MB-1GB |
~67MB |
256MB |
| redis |
512MB |
~5MB |
128MB |
| traefik |
256MB |
~20MB |
128MB |
Reservations should also be dropped from the defaults — they compound the over-provisioning.
3. Document swap recommendation
For droplets ≤ 4GB RAM, recommend adding swap as a safety net:
fallocate -l 1G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
Bug
The webserver container gets OOM-killed after running for several days. The root cause is the OpenTelemetry
BatchSpanProcessoraccumulating spans in memory when Logfire cloud exports fail or are slow.Environment
logfire.instrument_fastapi()Symptoms
Currently retrying 1 failed export(s) (899 bytes)restart: unless-stoppeddoesn't trigger because the container never exitedRoot Cause
The OTEL
BatchSpanProcessordefaults are too permissive for constrained environments:OTEL_BSP_MAX_QUEUE_SIZE = 2048spans buffered in memoryWhen Logfire cloud is slow or unreachable (rate limiting, network blip), the retry buffer grows without limit until OOM.
dmesg confirmation
Solution
Two changes needed:
1. Set OTEL buffer limits in config
Add these to
app/core/config.pyin the observability section so they're configurable via.env:These are standard OTEL env vars —
BatchSpanProcessorreads them from the environment automatically. No middleware changes needed since Pydantic settings bind to env vars with matching names.2. Right-size default container memory limits
The template defaults over-provision memory (4.5GB total limits on a 2GB box):
Reservations should also be dropped from the defaults — they compound the over-provisioning.
3. Document swap recommendation
For droplets ≤ 4GB RAM, recommend adding swap as a safety net: