Webserver OOM crash from unbounded OTEL span buffer on export failure

## Bug

The webserver container gets OOM-killed after running for several days. The root cause is the OpenTelemetry `BatchSpanProcessor` accumulating spans in memory when Logfire cloud exports fail or are slow.

## Environment

- Aegis Stack v0.6.7 with observability component
- DigitalOcean droplet, 2GB RAM
- Webserver container memory limit: 512MB (default)
- Logfire enabled with `logfire.instrument_fastapi()`

## Symptoms

1. Webserver starts at ~300MB, grows steadily over days
2. Logfire export retries appear in logs: `Currently retrying 1 failed export(s) (899 bytes)`
3. After ~7 days, container hits 512MB memory limit → kernel OOM kill
4. Docker sees container still "running" (process restarted inside cgroup but hung on stale connections)
5. `restart: unless-stopped` doesn't trigger because the container never exited
6. Traefik drops the backend from routing → all requests return 404

## Root Cause

The OTEL `BatchSpanProcessor` defaults are too permissive for constrained environments:

- `OTEL_BSP_MAX_QUEUE_SIZE = 2048` spans buffered in memory
- No export timeout — failed batches retry indefinitely
- Failed export batches stay in memory with no upper bound on retry buffer

When Logfire cloud is slow or unreachable (rate limiting, network blip), the retry buffer grows without limit until OOM.

## dmesg confirmation

```
oom-kill: Memory cgroup out of memory: Killed process 2771661 (python3)
total-vm:1039940kB, anon-rss:366944kB
```

## Solution

Two changes needed:

### 1. Set OTEL buffer limits in config

Add these to `app/core/config.py` in the observability section so they're configurable via `.env`:

```python
# OTEL span buffer limits (prevent unbounded memory growth on export failure)
OTEL_BSP_MAX_QUEUE_SIZE: int = 1024
OTEL_BSP_MAX_EXPORT_BATCH_SIZE: int = 256
OTEL_BSP_EXPORT_TIMEOUT: int = 10000  # ms
OTEL_BSP_SCHEDULE_DELAY: int = 5000   # ms
```

These are standard OTEL env vars — `BatchSpanProcessor` reads them from the environment automatically. No middleware changes needed since Pydantic settings bind to env vars with matching names.

### 2. Right-size default container memory limits

The template defaults over-provision memory (4.5GB total limits on a 2GB box):

| Container | Current Default | Actual Usage | Recommended |
|-----------|----------------|--------------|-------------|
| webserver | 512MB | ~300MB | 768MB |
| worker-* | 256MB-1GB | ~67MB | 256MB |
| redis | 512MB | ~5MB | 128MB |
| traefik | 256MB | ~20MB | 128MB |

Reservations should also be dropped from the defaults — they compound the over-provisioning.

### 3. Document swap recommendation

For droplets ≤ 4GB RAM, recommend adding swap as a safety net:

```bash
fallocate -l 1G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webserver OOM crash from unbounded OTEL span buffer on export failure #627

Bug

Environment

Symptoms

Root Cause

dmesg confirmation

Solution

1. Set OTEL buffer limits in config

2. Right-size default container memory limits

3. Document swap recommendation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Container	Current Default	Actual Usage	Recommended
webserver	512MB	~300MB	768MB
worker-*	256MB-1GB	~67MB	256MB
redis	512MB	~5MB	128MB
traefik	256MB	~20MB	128MB

Webserver OOM crash from unbounded OTEL span buffer on export failure #627

Description

Bug

Environment

Symptoms

Root Cause

dmesg confirmation

Solution

1. Set OTEL buffer limits in config

2. Right-size default container memory limits

3. Document swap recommendation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions