|
| 1 | +# Overseer |
| 2 | + |
| 3 | +## Why This Exists |
| 4 | + |
| 5 | +**Nothing is more annoying than the shrug.** |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +Something's broken in production. You ask what happened. You get a shrug. You ask when it started. Another shrug. You ask where the logs are. Shrug. You ask how we can fix it. The biggest fucking shrug you've ever seen. |
| 10 | + |
| 11 | +If something is wrong, I want to know **where**, **when**, **how**, **why**, and **how can we reconcile it**. Christ! Is that too much to ask? |
| 12 | + |
| 13 | +**It shouldn't be so fucking hard to know what happened, when, where.** |
| 14 | + |
| 15 | +You work with Datadog until management decides to migrate to New Relic. Or you're a solo dev who just wants to see if your background jobs are running without paying enterprise prices. Overseer solves this: centralized monitoring that you own, built into every Aegis Stack project from day one. |
| 16 | + |
| 17 | +## What It Is |
| 18 | + |
| 19 | +**Overseer is a read-only health monitoring dashboard** built into your Aegis Stack application. It provides real-time visibility into component and service health through a web UI and CLI commands. |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | +The dashboard displays: |
| 24 | + |
| 25 | +- **Component Cards**: Backend, Database, Worker, Scheduler health |
| 26 | +- **Service Cards**: Auth, AI, Comms health (when included) |
| 27 | +- **Header**: Overall health summary and theme toggle |
| 28 | +- **Auto-refresh**: Polls health endpoint every 30 seconds |
| 29 | + |
| 30 | +## Current Capabilities |
| 31 | + |
| 32 | +- Component health monitoring (Backend, Database, Worker, Scheduler) |
| 33 | +- Service health monitoring (Auth, AI, Comms) |
| 34 | +- System metrics (CPU, memory, disk usage) |
| 35 | +- Status hierarchy (Healthy, Warning, Unhealthy, Info) |
| 36 | +- Web dashboard with auto-refresh (30-second polling) |
| 37 | +- CLI health commands via your generated app |
| 38 | + |
| 39 | +## How It Works |
| 40 | + |
| 41 | +```mermaid |
| 42 | +sequenceDiagram |
| 43 | + participant C as Components/Services |
| 44 | + participant R as Health Registry |
| 45 | + participant E as /health/ Endpoint |
| 46 | + participant D as Dashboard UI |
| 47 | +
|
| 48 | + Note over C,R: Startup: Registration Phase |
| 49 | + C->>R: register_health_check("backend", check_func) |
| 50 | + C->>R: register_health_check("database", check_func) |
| 51 | + C->>R: register_service_health_check("auth", check_func) |
| 52 | +
|
| 53 | + Note over E,D: Runtime: Monitoring Phase |
| 54 | + D->>E: GET /health/ (every 30s) |
| 55 | + E->>R: Run all registered checks |
| 56 | + R->>C: Execute health check functions |
| 57 | + C->>R: Return ComponentStatus |
| 58 | + R->>E: Aggregate into SystemStatus |
| 59 | + E->>D: Return health data |
| 60 | + D->>D: Render component/service cards |
| 61 | +``` |
| 62 | + |
| 63 | +**The Flow:** |
| 64 | + |
| 65 | +1. **Registration**: During app startup, components and services register their health check functions with the health registry |
| 66 | +2. **Aggregation**: The `/health/` endpoint runs all registered checks and aggregates results into a hierarchical status tree |
| 67 | +3. **Polling**: The dashboard polls the health endpoint every 30 seconds |
| 68 | +4. **Display**: Component and service cards render with real-time status, metrics, and details |
| 69 | + |
| 70 | +## Component & Service Cards |
| 71 | + |
| 72 | +Each card shows real-time health status, component-specific metrics, and configuration details. Click any card to open a detailed modal with diagnostics, performance data, and system information. |
| 73 | + |
| 74 | +## Health Status Indicators |
| 75 | + |
| 76 | +Each card displays a status indicator using the Overseer status hierarchy: |
| 77 | + |
| 78 | +| Status | Color | Visual | Meaning | |
| 79 | +|--------|-------|--------|---------| |
| 80 | +| **✅ Healthy** | Green | Solid green border | Component/service fully operational | |
| 81 | +| **ℹ️ Info** | Blue | Solid blue border | Informational status, not a problem | |
| 82 | +| **⚠️ Warning** | Yellow | Orange border | Operational but with issues | |
| 83 | +| **❌ Unhealthy** | Red | Red border | Component/service down or failing | |
| 84 | + |
| 85 | +**Status Propagation**: Parent components inherit the worst child status: |
| 86 | + |
| 87 | +- Any child **Unhealthy** → Parent **Unhealthy** |
| 88 | +- Any child **Warning** (no unhealthy) → Parent **Warning** |
| 89 | +- Any child **Info** (no unhealthy/warning) → Parent **Info** |
| 90 | +- All children **Healthy** → Parent **Healthy** |
| 91 | + |
| 92 | +## Theme Support |
| 93 | + |
| 94 | +The dashboard automatically adapts to light and dark themes: |
| 95 | + |
| 96 | +- **Light Mode**: White cards, dark text, subtle shadows |
| 97 | +- **Dark Mode**: Dark cards, light text, enhanced contrast |
| 98 | +- **Toggle**: Click the theme icon in the header to switch |
| 99 | + |
| 100 | +Images and status colors adjust automatically to maintain visibility in both themes. |
| 101 | + |
| 102 | +## CLI Health Access |
| 103 | + |
| 104 | +The same health data is accessible via CLI: |
| 105 | + |
| 106 | +```bash |
| 107 | +# View system health |
| 108 | +your-app health |
| 109 | + |
| 110 | +# Example output: |
| 111 | +┌────────────────────────────────────────┐ |
| 112 | +│ System Health │ |
| 113 | +├────────────────────────────────────────┤ |
| 114 | +│ Components │ |
| 115 | +│ ✅ backend - FastAPI healthy │ |
| 116 | +│ ✅ database - SQLite connected │ |
| 117 | +│ ✅ worker - arq processing │ |
| 118 | +│ ✅ scheduler - 3 jobs scheduled │ |
| 119 | +│ │ |
| 120 | +│ Services │ |
| 121 | +│ ✅ auth - 42 users, HS256 │ |
| 122 | +│ ✅ ai - Anthropic/Claude │ |
| 123 | +└────────────────────────────────────────┘ |
| 124 | +``` |
| 125 | + |
| 126 | +## What's Coming |
| 127 | + |
| 128 | +Overseer is evolving into a full operational control plane. Want to know where this is headed and why I'm so confident it'll work? |
| 129 | + |
| 130 | +**[Read the full story →](story.md)** - How Overseer evolved from solving production problems at iHeartMedia (2022-2024) to becoming the built-in control plane for Aegis Stack. |
| 131 | + |
| 132 | +## Next Steps |
| 133 | + |
| 134 | +- **[The Overseer Story](story.md)** - Evolution from Streamlit to Aegis Stack, roadmap, and vision |
| 135 | +- **[Integration Guide](integration.md)** - Add health checks to custom components/services |
0 commit comments