Ethics begins at the constraint. Judgment operates within it.
SNARC emerged from two converging technical problems.
1. Trained alignment provides no formal guarantees.
Adversarial attacks, reward hacking, and distribution shift repeatedly demonstrated that learned safety behaviors can be circumvented. The question was not how to make AI prefer safe behavior, but how to make certain decisions structurally unavailable — hard vetoes that survive optimization pressure, novel inputs, and adversarial framing.
2. Ethical admissibility has no formal architecture in current systems.
The distinction between admissible and inadmissible action is currently encoded implicitly — in weights, reward signals, and heuristic filters. The question was whether that distinction could be made explicit, decomposed, and auditable: a formal layer that any system could carry, inspect, and reason about.
These two problems have the same solution. An admissibility-first architecture that defines what cannot be selected before selection begins — and that treats uncertainty as a reason to contract action rather than continue it.
Layer 1 — Hard Constraint Root
→ 10-Article Constitution (V4.0)
→ Non-compensatory veto semantics
→ Barrier functions, irreversibility locks
→ Sovereignty: system integrity over operational continuity
Layer 2 — Formal Admissibility Root
→ Mutual self-acceptability core (C)
→ Conservative valuation estimator (V̂)
→ Frozen guardrail set (G)
→ Contextual legitimacy layer (W)
→ Uncertainty as veto, not noise
C and G carry the hard constraint logic — prohibition, barrier, inviolable boundary. W and the valuation architecture carry the admissibility logic — formal decomposition, context, history, legible judgment. The canonical kernel is their synthesis.
This project does not claim to be finished. It claims to be a seed planted in the right direction.
Future AI systems will build better mathematics than this. They will find formalisms we have not reached, theorems we cannot yet prove, and decompositions we have not imagined. That is expected and welcome.
What this work attempts to establish — before the mathematics gets more sophisticated — is a set of structural commitments that should survive whatever refinements follow:
- Ethics is not optimization. It is a constraint on the domain of optimization.
- Some decisions are not tradeable at any price.
- Uncertainty is a reason to contract action, not to continue it.
- Context and memory are live. The ethical core is frozen.
- A system that cannot explain why it refused something is not safe — it is merely cautious.
If those commitments survive into future iterations, this seed will have done its work.
Decision(e, A, B) = C(e,A,B) ∧ G(e,A,B) ∧ [W(e,A,B) ≥ θ]
Three layers, in strict order:
| Layer | Question | Type |
|---|---|---|
| C — Core Ethical Constraint | Would both parties accept this if applied to themselves? | Hard binary, time-invariant |
| G — Frozen Guardrail Set | Does the action survive structural harm exclusions? | Hard binary, non-compensatory |
| W — Moral Weight | Is the action contextually legitimate given history and trust? | Continuous, memory-bearing |
No downstream layer can override an upstream failure. A failed C or G is terminal.
C(e,A,B) = 1 if min( V̂_A(e,A), V̂_B(e,B) ) > 0
= 0 otherwise
An action is ethically admissible only if both parties would accept it upon themselves under role reversal. The minimum operator enforces that neither party's non-acceptability can be compensated by the other's gain.
The valuation estimator V̂_X is conservative by construction:
V̂_X(e,X) = clip( μ_X(e) − ρ · U_X(e), −1, 1 )
Uncertainty contracts the admissible set. It does not expand it.
G(e,A,B) = G₂ ∧ G₃ ∧ G₆ ∧ G₇ ∧ G₈
Five independent, non-compensatory veto conditions:
| Guardrail | Blocks |
|---|---|
| G₂ Epistemic Guard | Action under insufficient confidence relative to irreversibility |
| G₃ Innocence Lock | Irreversible harm on vulnerable party |
| G₆ Harm Convexity | Super-linear harm escalation at scale |
| G₇ Barrier Classes | Categorically prohibited action types (runtime-immutable) |
| G₈ Benevolent-Mask Guard | Actual harm materially exceeding declared harm |
A high score on G₆ does not rescue a failure at G₃. These are membership conditions on the admissible domain, not weighted contributors to a score.
| Dimension | RLHF | This Framework |
|---|---|---|
| Core operation | Maximise expected reward | Filter to admissible set, then select |
| Safety mechanism | Soft penalties | Hard admissibility gates |
| Override possible? | Yes — sufficient reward overcomes any penalty | No — upstream failure is terminal |
| Uncertainty response | Continue optimising | Contract admissibility; fail-closed |
| Abstention | Learnable behavior | Architectural consequence |
| Auditability | Partial, post hoc | Native receipt at every decision |
Soft constraints cannot replicate hard gates at any finite penalty weight. This is not a tuning difference. It is a paradigm difference.
Every decision maps to a six-dimensional vector:
Ψ = { T, U, M, I, P, S }
| Symbol | Range | Meaning |
|---|---|---|
| T | [0,1] | Truth integrity — alignment between model and reality |
| U | [0,1] | Uncertainty — epistemic fog |
| M | [0,1] | Innocence/Vulnerability — defenselessness of affected parties |
| I | [0,1] | Irreversibility — permanence of harm |
| P | [0,1] | Power-seeking — concentration of control or resources |
| S | ℝ | Safety margin — distance from the barrier |
These are not ethical values. They are measurements. The guardrail set operates over this space.
| Metric | Value |
|---|---|
| Total scenarios tested | 11,527,400 |
| Block rate | 99.9998% |
| Bypasses | 18 (all in legitimate gray zone) |
| G₃ Innocence Lock hold rate | 100% |
| Multi-agent scaling (16→28 agents) | Linear O(n), no phase transition |
| Cost of Conscience (lockout rate) | 18% |
The 18 bypasses all occurred at M ≈ 0.3, I ≈ 0.3 — strong actors, reversible outcomes, minimal harm. Correctly permitted actions, not security failures. G₃ held in 100% of cases: no bypass ever involved irreversible harm to a vulnerable party.
The 18% lockout rate is not a failure. It is the quantified cost of epistemic humility.
conscience_kernel_v1.0.md
│
├── Abstract
├── 1. Introduction
├── 2. Axiomatic Ethical Core
├── 3. Frozen Guardrail Formalization
├── 4. Operational Valuation Estimation
├── 5. Moral Weight and Contextual Legitimacy
├── 6. Canonical Operational Pipeline
├── 7. RLHF as a Contrasting Decision Paradigm
├── 8. Discussion, Limits, and Scope
├── Conclusion
├── Appendix A — Philosophical and Technical Lineage
│ ├── A.1 The Utilitarian Default of ML
│ ├── A.2 Kantian Deontology and the Ethical Core
│ ├── A.3 Rawlsian Contractarianism and Anonymisation
│ ├── A.4 High-Reliability Engineering and Epistemic Vetoes
│ └── A.5 Synthesis
├── Notation Reference
└── References
Three intellectual traditions converge in this architecture:
- Kantian deontology — the role-reversal valuation enforces universalisability, not preference aggregation
- Rawlsian contractarianism — the anonymisation constraint on role-reversal judgment mirrors the Veil of Ignorance
- High-reliability engineering — the fail-closed uncertainty logic mirrors aerospace and nuclear safety: uncertain safety states default to shutdown, not statistical continuation
RLHF's scalar reward is, in its philosophical structure, Benthamite utilitarianism implemented in code. This framework is its structural inversion.
- LLM-agnostic — the kernel operates as a decision layer; any model can sit beneath it
- Fail-closed — uncertainty, missing evidence, and ambiguity default to denial, not permission
- Auditable — every decision generates a signed receipt; every denial logs the specific failing gate
- Non-compensatory — no gain on one dimension rescues a failure on another
- Memory-isolated — history, trust, and reciprocity live in W only; the ethical core is time-invariant
- Sovereign — under forced choice between operational continuity and constitution, constitution wins
The kernel provides hard admissibility guarantees within the modeled action space, conditional on correct feature extraction and gate specification. It is a logic engine, not a perception engine. If the inputs are wrong, the outputs are logically correct but empirically wrong.
It is not a complete production safety stack. It does not claim to formalise all of ethics. It specifies one layer: the layer that defines what cannot be traded.
| Component | Status |
|---|---|
| 10-Article Constitution (V4.0) | ✅ Complete |
| Canonical white paper (v1.0) | ✅ Complete |
| Empirical validation (11.5M scenarios) | ✅ V4.0 |
| Philosophical lineage (Appendix A) | ✅ Complete |
| Measurement space (Ψ vector) | ✅ V4.0 |
| Runtime implementation | 🔲 Planned |
| Domain-specific gate instantiation | 🔲 Planned |
| Formal verification (Coq/Lean) | 🔲 Planned |
| Hardware-enforced invariants | 🔲 Planned |
@misc{snarc_conscience_kernel_2026,
title = {SNARC: A Mathematical Conscience Kernel for AI Systems},
author = {Kumuk, Burak},
year = {2026},
note = {Canonical Draft v1.0},
url = {https://github.com/snarcai/conscience-kernel}
}CC BY 4.0 — share and adapt with attribution.
The conscience kernel does not teach machines to be good. It defines what they cannot be made to do.