Cloud-native log analytics platform. Ingests application events, stores them as partitioned Parquet, and answers operational questions (p95 latency, error rate, backlog) with SQL — the same approach used by managed log analytics services.
Status: Local-first MVP working. AWS deployment (S3, Athena, Glue, Lambda, Kinesis/Firehose, CloudWatch) is in progress. See roadmap. Inspired by the AWS Samples
web-analytics-on-awsarchitecture. All code is original (see docs/REFERENCES.md).
- Generates synthetic application logs (services, endpoints, status codes, latencies).
- Persists them to a raw zone (JSONL) and a query-optimized processed zone
(Hive-partitioned Parquet by
event_date / service_name / status_code). - Queries operational metrics with DuckDB (the local stand-in for Athena).
- Benchmarks partitioned vs unpartitioned scans to quantify partition pruning.
Local-first, with a stack that maps 1:1 onto AWS — full diagrams in docs/ARCHITECTURE.md.
flowchart TD
GEN[Synthetic Log Generator] --> RAW[(data/raw · JSONL)]
RAW --> PARQ[(data/processed · Partitioned Parquet)]
PARQ --> DUCK[DuckDB Query Layer]
DUCK --> M[p95 latency · error rate · volume]
Partition pruning on a service + day filter, measured with scripts/run_benchmark.py:
- Data scanned: −92.7% (86.7 MB → 6.3 MB). This demonstrates the same partition-pruning mechanic Athena uses to reduce scanned data and cost. Exact savings in production depend on S3 layout, file sizes, compression, and query selectivity.
- Query results are identical across both layouts. Partitioning changes cost, not answers.
- Local wall-clock did not improve at this scale — DuckDB column projection and small-file overhead dominate at this volume. This is a documented design finding and the reason a compaction job is on the roadmap. The latency win is expected at S3 scale.
Full report: docs/BENCHMARKS.md.
| Choice | Why |
|---|---|
| Parquet (columnar) | Compressed, column-pruned scans for analytical queries over logs. |
| Hive partitioning | Skip whole prefixes on event_date/service/status → fewer bytes scanned. |
| DuckDB | Local columnar engine with predicate pushdown that mirrors Athena's query cost model. |
| Pydantic | One typed event contract drives the schema, partitions, and tests. |
| AWS (next): S3 + Athena + Glue | Same columnar/partitioned model at scale, serverless, pay-per-scan. |
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python scripts/generate_logs.py --count 100000 # -> data/raw/events.jsonl
python scripts/write_parquet.py # -> data/processed/{partitioned,unpartitioned}
python scripts/run_queries.py # operational metrics
python scripts/run_benchmark.py # -> docs/BENCHMARKS.md (real numbers)
pytest -qLocal (dev):
uvicorn signallake.api:app --reloadDocker:
docker compose up --build # starts on http://localhost:8000
# or without compose:
docker build -t signallake .
docker run -p 8000:8000 signallakeThe server starts on http://localhost:8000. Interactive docs at http://localhost:8000/docs.
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness check |
POST |
/ingest |
Ingest a single LogEvent (JSON body) |
POST |
/ingest/batch |
Ingest a list of LogEvent objects |
GET |
/query/p95-latency |
p50/p95/p99 latency by service |
GET |
/query/error-rate |
5xx error-rate % by endpoint |
GET |
/query/slowest-endpoints |
Top-10 slowest endpoints by p95 |
Ingested events are appended to data/raw/events.jsonl. Query endpoints read from
data/processed/partitioned/ (Parquet) when present, falling back to the raw JSONL.
Override the data root with SIGNALLAKE_DATA_DIR=<path> uvicorn ....
Every request is logged as structured JSON (request_id, endpoint, status_code,
service_name, latency_ms) to stdout.
- Event schema (Pydantic) + seeded synthetic generator
- Raw JSONL zone + partitioned/unpartitioned Parquet writers
- DuckDB query layer (p50/p95/p99 latency, error rate, slowest endpoints, 5xx by region, hourly volume)
- Partition benchmark report (scan reduction %)
- PyTest suite
- FastAPI
/ingestand/queryendpoints (health, single and batch ingest, p95 latency, error rate, slowest endpoints) - Structured JSON request logging (request_id, endpoint, status_code, latency_ms)
- Docker (Dockerfile + docker-compose.yml) + GitHub Actions CI
- Retry/backoff/jitter, dead-letter handling, failed-event replay
- In-process queue + batch worker for buffered ingestion
- Query cost estimator, backlog simulator
- AWS slice: S3 zones, Glue Catalog, Athena partition projection, Lambda, CloudWatch
src/signallake/ schema · generator · storage · query · api
scripts/ generate_logs · write_parquet · run_queries · run_benchmark
sql/ reference DuckDB queries
docs/ ARCHITECTURE · DESIGN_DECISIONS · BENCHMARKS · REFERENCES
tests/ schema · generator · storage · api
Dockerfile production image (python:3.12-slim + uvicorn)
docker-compose.yml local stack with named data volume
.github/workflows/ci.yml GitHub Actions: install + pytest on push/PR
Reasoning behind the DuckDB-to-Athena migration path, the partition scheme, and the planned SQS and DynamoDB choices is in docs/DESIGN_DECISIONS.md.