Optimo - Deterministic OCR Semantic Refinement Engine in Rust.
Optimo is a deterministic OCR and document analysis pipeline built in Rust. It focuses on replay invariance, semantic refinement, and auditable state transitions for degraded technical documents. Same input -> same replay -> same ROIPlan -> same semantic diff.
Keywords: deterministic OCR, Rust OCR pipeline, semantic refinement, replay invariance, auditability.
Built to explore repeatable workflows where noisy inputs (OCR today, structured docs tomorrow) are processed through a deterministic core.
- reducer determinism
- replayability
- pipeline stability
- adversarial tests
- Algebraic fold with proven properties (commutativity, idempotence, monotonicity)
- Fuzzy clustering with Cyrillic homoglyph guardrail
- Hard idempotency via source fingerprint deduplication
collision_rate_bpsas a runtime convergence metric- Configurable similarity threshold per ingestion profile
- Image preprocessing pipeline with Otsu thresholding
- 114 tests across unit, property, adversarial, replay, and integration suites
Rust • Tokio • Rayon • Tesseract • JSONL
Many document processes depend on unclear transformations and hard-to-audit decisions.
Optimo explores a simpler model:
Input → Normalize → Reduce → Observe → Persist
Active prototype under test.
See docs/ for architecture and decisions. Full technical notes and module history: docs/LOGBOOK.md. Architecture and perimeter threat model diagrams: docs/ARCHITECTURE_AND_THREAT_MODEL.md.