Skip to content

igopalakrishna/signallake

Repository files navigation

SignalLake

Cloud-native log analytics platform. Ingests application events, stores them as partitioned Parquet, and answers operational questions (p95 latency, error rate, backlog) with SQL — the same approach used by managed log analytics services.

Status: Local-first MVP working. AWS deployment (S3, Athena, Glue, Lambda, Kinesis/Firehose, CloudWatch) is in progress. See roadmap. Inspired by the AWS Samples web-analytics-on-aws architecture. All code is original (see docs/REFERENCES.md).

What it does

  1. Generates synthetic application logs (services, endpoints, status codes, latencies).
  2. Persists them to a raw zone (JSONL) and a query-optimized processed zone (Hive-partitioned Parquet by event_date / service_name / status_code).
  3. Queries operational metrics with DuckDB (the local stand-in for Athena).
  4. Benchmarks partitioned vs unpartitioned scans to quantify partition pruning.

Architecture

Local-first, with a stack that maps 1:1 onto AWS — full diagrams in docs/ARCHITECTURE.md.

flowchart TD
    GEN[Synthetic Log Generator] --> RAW[(data/raw · JSONL)]
    RAW --> PARQ[(data/processed · Partitioned Parquet)]
    PARQ --> DUCK[DuckDB Query Layer]
    DUCK --> M[p95 latency · error rate · volume]
Loading

Results (local, 1M synthetic events)

Partition pruning on a service + day filter, measured with scripts/run_benchmark.py:

  • Data scanned: −92.7% (86.7 MB → 6.3 MB). This demonstrates the same partition-pruning mechanic Athena uses to reduce scanned data and cost. Exact savings in production depend on S3 layout, file sizes, compression, and query selectivity.
  • Query results are identical across both layouts. Partitioning changes cost, not answers.
  • Local wall-clock did not improve at this scale — DuckDB column projection and small-file overhead dominate at this volume. This is a documented design finding and the reason a compaction job is on the roadmap. The latency win is expected at S3 scale.

Full report: docs/BENCHMARKS.md.

Tech stack — and why

Choice Why
Parquet (columnar) Compressed, column-pruned scans for analytical queries over logs.
Hive partitioning Skip whole prefixes on event_date/service/status → fewer bytes scanned.
DuckDB Local columnar engine with predicate pushdown that mirrors Athena's query cost model.
Pydantic One typed event contract drives the schema, partitions, and tests.
AWS (next): S3 + Athena + Glue Same columnar/partitioned model at scale, serverless, pay-per-scan.

Quickstart

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

python scripts/generate_logs.py --count 100000        # -> data/raw/events.jsonl
python scripts/write_parquet.py                        # -> data/processed/{partitioned,unpartitioned}
python scripts/run_queries.py                          # operational metrics
python scripts/run_benchmark.py                        # -> docs/BENCHMARKS.md (real numbers)
pytest -q

Running the API

Local (dev):

uvicorn signallake.api:app --reload

Docker:

docker compose up --build        # starts on http://localhost:8000
# or without compose:
docker build -t signallake .
docker run -p 8000:8000 signallake

The server starts on http://localhost:8000. Interactive docs at http://localhost:8000/docs.

Method Path Description
GET /health Liveness check
POST /ingest Ingest a single LogEvent (JSON body)
POST /ingest/batch Ingest a list of LogEvent objects
GET /query/p95-latency p50/p95/p99 latency by service
GET /query/error-rate 5xx error-rate % by endpoint
GET /query/slowest-endpoints Top-10 slowest endpoints by p95

Ingested events are appended to data/raw/events.jsonl. Query endpoints read from data/processed/partitioned/ (Parquet) when present, falling back to the raw JSONL. Override the data root with SIGNALLAKE_DATA_DIR=<path> uvicorn ....

Every request is logged as structured JSON (request_id, endpoint, status_code, service_name, latency_ms) to stdout.

What's built and roadmap

  • Event schema (Pydantic) + seeded synthetic generator
  • Raw JSONL zone + partitioned/unpartitioned Parquet writers
  • DuckDB query layer (p50/p95/p99 latency, error rate, slowest endpoints, 5xx by region, hourly volume)
  • Partition benchmark report (scan reduction %)
  • PyTest suite
  • FastAPI /ingest and /query endpoints (health, single and batch ingest, p95 latency, error rate, slowest endpoints)
  • Structured JSON request logging (request_id, endpoint, status_code, latency_ms)
  • Docker (Dockerfile + docker-compose.yml) + GitHub Actions CI
  • Retry/backoff/jitter, dead-letter handling, failed-event replay
  • In-process queue + batch worker for buffered ingestion
  • Query cost estimator, backlog simulator
  • AWS slice: S3 zones, Glue Catalog, Athena partition projection, Lambda, CloudWatch

Project structure

src/signallake/         schema · generator · storage · query · api
scripts/                generate_logs · write_parquet · run_queries · run_benchmark
sql/                    reference DuckDB queries
docs/                   ARCHITECTURE · DESIGN_DECISIONS · BENCHMARKS · REFERENCES
tests/                  schema · generator · storage · api
Dockerfile              production image (python:3.12-slim + uvicorn)
docker-compose.yml      local stack with named data volume
.github/workflows/ci.yml  GitHub Actions: install + pytest on push/PR

Design decisions

Reasoning behind the DuckDB-to-Athena migration path, the partition scheme, and the planned SQS and DynamoDB choices is in docs/DESIGN_DECISIONS.md.

About

Cloud-native log analytics platform. Ingests events, stores as Hive-partitioned Parquet, queries with DuckDB. Local-first MVP with AWS S3 + Athena deployment path.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors