SignalLake

Cloud-native log analytics platform. Ingests application events, stores them as partitioned Parquet, and answers operational questions (p95 latency, error rate, backlog) with SQL — the same approach used by managed log analytics services.

Status: Local-first MVP working. AWS deployment (S3, Athena, Glue, Lambda, Kinesis/Firehose, CloudWatch) is in progress. See roadmap. Inspired by the AWS Samples web-analytics-on-aws architecture. All code is original (see docs/REFERENCES.md).

What it does

Generates synthetic application logs (services, endpoints, status codes, latencies).
Persists them to a raw zone (JSONL) and a query-optimized processed zone (Hive-partitioned Parquet by event_date / service_name / status_code).
Queries operational metrics with DuckDB (the local stand-in for Athena).
Benchmarks partitioned vs unpartitioned scans to quantify partition pruning.

Architecture

Local-first, with a stack that maps 1:1 onto AWS — full diagrams in docs/ARCHITECTURE.md.

flowchart TD
    GEN[Synthetic Log Generator] --> RAW[(data/raw · JSONL)]
    RAW --> PARQ[(data/processed · Partitioned Parquet)]
    PARQ --> DUCK[DuckDB Query Layer]
    DUCK --> M[p95 latency · error rate · volume]

Results (local, 1M synthetic events)

Partition pruning on a service + day filter, measured with scripts/run_benchmark.py:

Data scanned: −92.7% (86.7 MB → 6.3 MB). This demonstrates the same partition-pruning mechanic Athena uses to reduce scanned data and cost. Exact savings in production depend on S3 layout, file sizes, compression, and query selectivity.
Query results are identical across both layouts. Partitioning changes cost, not answers.
Local wall-clock did not improve at this scale — DuckDB column projection and small-file overhead dominate at this volume. This is a documented design finding and the reason a compaction job is on the roadmap. The latency win is expected at S3 scale.

Full report: docs/BENCHMARKS.md.

Tech stack — and why

Choice	Why
Parquet (columnar)	Compressed, column-pruned scans for analytical queries over logs.
Hive partitioning	Skip whole prefixes on `event_date`/`service`/`status` → fewer bytes scanned.
DuckDB	Local columnar engine with predicate pushdown that mirrors Athena's query cost model.
Pydantic	One typed event contract drives the schema, partitions, and tests.
AWS (next): S3 + Athena + Glue	Same columnar/partitioned model at scale, serverless, pay-per-scan.

Quickstart

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

python scripts/generate_logs.py --count 100000        # -> data/raw/events.jsonl
python scripts/write_parquet.py                        # -> data/processed/{partitioned,unpartitioned}
python scripts/run_queries.py                          # operational metrics
python scripts/run_benchmark.py                        # -> docs/BENCHMARKS.md (real numbers)
pytest -q

Running the API

Local (dev):

uvicorn signallake.api:app --reload

Docker:

docker compose up --build        # starts on http://localhost:8000
# or without compose:
docker build -t signallake .
docker run -p 8000:8000 signallake

The server starts on http://localhost:8000. Interactive docs at http://localhost:8000/docs.

Method	Path	Description
`GET`	`/health`	Liveness check
`POST`	`/ingest`	Ingest a single `LogEvent` (JSON body)
`POST`	`/ingest/batch`	Ingest a list of `LogEvent` objects
`GET`	`/query/p95-latency`	p50/p95/p99 latency by service
`GET`	`/query/error-rate`	5xx error-rate % by endpoint
`GET`	`/query/slowest-endpoints`	Top-10 slowest endpoints by p95

Ingested events are appended to data/raw/events.jsonl. Query endpoints read from data/processed/partitioned/ (Parquet) when present, falling back to the raw JSONL. Override the data root with SIGNALLAKE_DATA_DIR=<path> uvicorn ....

Every request is logged as structured JSON (request_id, endpoint, status_code, service_name, latency_ms) to stdout.

What's built and roadmap

Project structure

src/signallake/         schema · generator · storage · query · api
scripts/                generate_logs · write_parquet · run_queries · run_benchmark
sql/                    reference DuckDB queries
docs/                   ARCHITECTURE · DESIGN_DECISIONS · BENCHMARKS · REFERENCES
tests/                  schema · generator · storage · api
Dockerfile              production image (python:3.12-slim + uvicorn)
docker-compose.yml      local stack with named data volume
.github/workflows/ci.yml  GitHub Actions: install + pytest on push/PR

Design decisions

Reasoning behind the DuckDB-to-Athena migration path, the partition scheme, and the planned SQS and DynamoDB choices is in docs/DESIGN_DECISIONS.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SignalLake

What it does

Architecture

Results (local, 1M synthetic events)

Tech stack — and why

Quickstart

Running the API

What's built and roadmap

Project structure

Design decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
diagrams		diagrams
docs		docs
scripts		scripts
sql		sql
src/signallake		src/signallake
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SignalLake

What it does

Architecture

Results (local, 1M synthetic events)

Tech stack — and why

Quickstart

Running the API

What's built and roadmap

Project structure

Design decisions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages