Abstract. We define semantic index types — type declarations whose natural-language tokens function as computational indices for neural consumers — and formalize the two-channel constraint system they create. When a language model consumes a type schema, field names, descriptions, and enum labels cross from the routing plane to the computation plane: they instruct the consumer what to compute. This breaks a consumer-level analogue of alpha equivalence, because the compilation target changed — schema-driven types compile to token sequences consumed by a neural network that reads names.
We model this as a two-channel constraint system. The structural channel (type annotations, validators, constrained decoding) determines the support of the output distribution. The semantic channel (field names, descriptions, enum labels) determines the conditional probabilities within that support. The bound
The empirical phenomenon is already established. Converging evidence from schema-guided dialogue, text-to-SQL, and code language model research independently demonstrates that neural consumers are not invariant under structure-preserving renaming. This paper contributes the unifying abstraction: a PL-theoretic framing that names the phenomenon, formalizes the dual semantics, derives the engineering and security consequences from the same mechanism, and articulates progressive hardening as the development discipline for systems where naming is programming. We also describe a companion experiment that operationalizes the framework in Pydantic structured output.
Alpha equivalence — the principle that consistent renaming of bound variables preserves program semantics — is foundational to the formal semantics of lexically scoped programming languages. In the lambda calculus, alpha conversion is one of three reduction rules (Church, 1941). In modern PL theory, the Variable Convention treats bound variable names as arbitrary up to consistent renaming (Barendregt, 1984, Definition 2.1.13). Compilers for languages like Haskell assign each binder a unique integer in their intermediate representations, rendering the original source name irrelevant to the elaborated program; GHC documentation describes type equivalence as "syntactic equivalence modulo alpha-renaming."
Alpha equivalence holds because every consumer in a traditional execution pipeline — parser, type checker, optimizer, code generator — treats names as structural identifiers. The name tells the system which slot. It never determines what value inhabits that slot.
This paper identifies a new kind of software artifact — a type declaration simultaneously consumed as structural contract by a host runtime and as natural-language instruction by a neural interpreter — and the invariance failure it creates. When a language model consumes a type schema to produce structured output, it interprets field names, descriptions, and enum labels as natural-language instructions. Structurally isomorphic schemas with different linguistic content produce measurably different output distributions. Names have crossed from the routing plane to the computation plane. The deep reason is architectural: traditional types compile to machine code that erases names; schema-driven types compile to token sequences consumed by a neural network that reads them. Once the compilation target changed, alpha equivalence at the consumer level became untenable.
Consider two Pydantic schemas, structurally identical — same field count, same types, same enum cardinality:
# Schema A: precise semantic indices
class ChurnRiskTier(StrEnum):
CRITICAL = "critical"
HIGH = "high"
MODERATE = "moderate"
LOW = "low"
churn_risk_tier: ChurnRiskTier = Field(
description="Likelihood of voluntary departure within 90 days"
)
# Schema B: vacuous — same structure, no semantic content
class Category4(StrEnum):
OPTION_A = "option_a"
OPTION_B = "option_b"
OPTION_C = "option_c"
OPTION_D = "option_d"
field_1: Category4 = Field(description="A category value.")Both schemas admit exactly four valid enum values. A traditional compiler would treat them as equivalent — the names are irrelevant to the structural semantics. But when a language model consumes these schemas to analyze the same customer profile, it produces different output distributions. Schema A yields a risk assessment grounded in churn signals. Schema B yields near-uniform selection across four unlabeled options. The structural channel (four valid values) is identical. The semantic channel (what the names instruct the consumer to compute) diverges completely. This is a consumer-level alpha equivalence violation: renaming changed the output because the compilation target reads names.
We are not the first to observe that schema text affects model behavior. The schema-guided dialogue community has evaluated robustness to paraphrased schema descriptions since SGD-X (Lee et al., AAAI 2022). The text-to-SQL community treats column names as semantic anchors central to schema linking (Lei et al., EMNLP 2020). Code language model research has documented that obfuscating identifier names degrades performance across tasks and model families (Nikiema et al., 2025; Le et al., 2025). The phenomenon is established. The question is what unifying interpretation best explains it.
What is novel here is the PL-theoretic framing: we characterize the phenomenon as a two-channel constraint system — the structural channel obeys alpha equivalence, the semantic channel violates it — and show that this framing unifies observations across independently evolved research communities. The engineering utility and security risk of schema naming are governed by the same information-theoretic bound. The resulting picture is, in effect, linguistic relativity for neural consumers (§12): the vocabulary of the schema determines the output distribution of the model.
We use "alpha equivalence" to describe a consumer-level invariance property, not a claim about the object-language semantics of any particular programming language. Our claim is that the neural consumer function — the mapping from schema plus context to output distribution — is not invariant under renaming transformations that are structure-preserving in the schema language. This is distinct from, though motivated by, the formal notion of alpha equivalence in the lambda calculus.
We note that even in the history of programming languages, names have not always been inert. Fortran's implicit typing rule (IBM, 1956) assigned INTEGER type to variables beginning with letters I through N and REAL to all others; IMPLICIT NONE was required to disable this behavior. Renaming a Fortran variable could change its type and therefore program behavior — a language-level alpha equivalence failure. More broadly, names become part of extensional semantics whenever a consumer reflects on program representation: Python's locals() and globals() make variable names observable as dictionary keys; Java's reflection API exposes field names as runtime strings; serialization frameworks map field names to wire formats. In each case, a consumer that inspects the representation — rather than merely executing it — can observe and act on names. A neural consumer is the limit case: it only inspects the representation (the serialized schema), and it interprets every token it reads. Our contribution is not "names sometimes matter" — it is a formal characterization of how and why they matter when the consumer is a neural language model, and the engineering consequences that follow.
The term "semantic index type" is not related to "indexed types" in the dependent-type sense (types parameterized by values). Here, "index" refers to a natural-language token that indexes into the consumer's learned semantic associations.
Let
where
We define structural isomorphism and semantic equivalence as distinct relations:
Structural isomorphism. Define a structure-only projection
Semantic equivalence. For a discrete field
where
We define equivalence fieldwise because different fields have different structural compression levels — a 4-member enum and a bare str occupy different positions in the design space (§7) and may exhibit different sensitivity to
The central claim of this paper is:
That is, there exist structurally isomorphic schemas and fields for which the linguistic components produce measurably different output distributions under neural consumers. This is expected once schema tokens are consumed as natural language — the consumer interprets them semantically, and different tokens carry different meanings — and is confirmed by existing benchmarks across multiple research communities (§4).
A semantic index type operates through two constraint channels. Let
The structural channel determines the support of the output distribution. Constrained decoding restricts generation to
The semantic channel determines the conditional probabilities within the support. Field names, descriptions, and enum member labels shape which structurally valid value the consumer selects — that is, they determine
In short: structure defines admissibility; semantics defines salience. The structural channel contains the semantic channel — the semantic channel can influence the distribution only over the support that the structural channel defines.
When the consumer's semantic interpretation is imprecise — it selects RiskTier.HIGH when RiskTier.CRITICAL was more appropriate — the error is bounded by
This two-channel decomposition has been independently rediscovered in the code LM literature. Le et al. (2025) explicitly contrast an "execution channel" (structural correctness) with a "naturalness channel" (reliance on identifier semantics). Nikiema et al. (2025) conclude that variable renaming sensitivity indicates reliance on "lexical semantics embedded in variable names rather than purely structural understanding."
Our contribution is to formalize this observation as a property of schema-consuming systems generally, not only code generation.
A semantic index type is a type declaration in which natural-language tokens — field names, descriptions, enum member names — function as computational indices that constrain the semantic content of generated values, because the consumer of the type interprets those tokens as natural-language instructions rather than structurally inert identifiers.
In a conventional type system, a field name is an address: it indexes into a structure to locate a slot. In a semantic index type, the field name is simultaneously a semantic key: it indexes into the consumer's learned language associations to locate a meaning. churn_risk_tier addresses a slot in a product type AND instructs the consumer to assess voluntary customer departure risk.
A type system exhibits semantic indexing when consumer invariance under structural isomorphism fails: renaming a binding changes the output distribution, not because the structural semantics changed, but because the natural-language index changed and the consumer interpreted the new name as a different instruction.
Field names have always carried information beyond structural addressing. Serialization keys determine JSON wire positions. ORM mappings connect fields to database columns. API schemas provide client-readable labels. In every prior use, the name determines where a value goes — which slot, which column, which JSON key. The name is addressing. The value itself is unaffected by the name.
With a neural consumer, the name determines what the value is. This is a qualitative transition. The name has crossed from the routing plane to the computation plane.
A natural objection: is a semantic index type just prompt engineering with a schema wrapper? The answer is no, and the distinction is precise.
A prompt carries instruction only — it tells the model what to do, but nothing in the prompt's structure constrains or proves the result. A semantic index type carries instruction, constraint, and proof simultaneously. The natural-language tokens instruct the consumer (
A prompt that says "return a JSON object with a risk tier" is an instruction. A Pydantic model with risk_tier: RiskTier backed by an enum, a description, and a model validator is instruction, constraint, and proof in a single declaration. The semantics of failure differs: a prompt produces unconstrained text; a semantic index type produces a proven value or no value at all.
The claim that structurally isomorphic schemas produce different outputs under neural consumers is supported by converging evidence from three independently evolved research communities. Each has built controlled studies isolating naming and descriptions as causal features. The question is no longer whether the phenomenon exists, but what unifying interpretation best explains it. We review the strongest evidence from each tradition.
The Schema-Guided Dialogue dataset (Rastogi et al., AAAI 2020) introduced the practice of supplying natural-language descriptions of intents and slots as part of the model's input interface — descriptions intended "to outline the semantics of each element." The dataset contains over 16,000 dialogues across 16 domains.
SGD-X (Lee et al., AAAI 2022) provides the cleanest empirical existence proof for our core claim. The authors constructed five crowdsourced paraphrase variants of every schema in SGD, holding the underlying intent and slot structure fixed while varying only the natural-language descriptions. This is precisely the experimental design our formal framework demands:
The results are unambiguous. SGP-DST (a state-tracking model) exhibited a 17.6% relative drop in joint goal accuracy (JGA) on average across variants, with worst-case degradation of 28% on variant 5. T5DST showed an 11.9% relative JGA drop on average, with worst-case degradation of 19%. The authors introduced a dedicated "Schema Sensitivity" metric to quantify this phenomenon — the field's own native measure of what we call consumer non-invariance under structural isomorphism.
Zhao et al. (2022) pushed this further with D3ST, which replaces slot names entirely with random index strings, forcing the model to rely exclusively on natural-language descriptions. Performance with linguistic descriptions substantially outperformed random strings, confirming that the semantic content of descriptions — not merely their presence — drives the consumer's behavior.
The text-to-SQL community has treated column and table names as semantic anchors since the earliest neural approaches. Lei et al. (EMNLP 2020) characterized schema linking — aligning question phrases to schema tokens — as "the crux" of the text-to-SQL problem. This framing treats schema identifiers not as inert labels but as meaning-bearing tokens that must be semantically resolved against natural-language queries.
RAT-SQL (Wang et al., ACL 2020) formalized this with relation-aware encoding that treats schema tokens as semantic elements in a graph. The contribution of schema-linking relations to accuracy was measured as statistically significant (
Dr.Spider (Chang et al., 2023) operationalized robustness to schema perturbations, including synonym substitution and abbreviation of column names. Average performance dropped approximately 14 percentage points across perturbation types, with the hardest perturbation category producing drops exceeding 50%.
The enrichment direction provides complementary evidence. Wretblad et al. (NeurIPS 2024) found that adding synthesized column descriptions consistently enhanced accuracy across models, providing a targeted measurement of description-level effects. The BIRD benchmark (Li et al., NeurIPS 2023) showed a 20-point accuracy gain when column descriptions and external knowledge evidence were added, though the individual contribution of descriptions is not isolated in that study.
Code language models provide a third line of evidence with particularly clean experimental controls, because programming languages have formal semantics that make "semantics-preserving renaming" well-defined.
CodeT5 (Wang et al., EMNLP 2021) explicitly designed pre-training objectives around identifier recovery, leveraging "semantics conveyed from developer-assigned identifiers" to achieve state-of-the-art results across 14 CodeXGLUE sub-tasks. The architecture treats identifier names as a learnable semantic signal, not structural noise.
Adversarial and robustness studies confirm the dependence. Zhang et al. (AAAI 2020) achieved over 90% attack success rates by renaming identifiers — a semantics-preserving transformation that nonetheless changed model behavior. Bielik and Vechev (ICML 2020) showed identifier renaming attacks degrade type inference models. Troshin and Chirkova (BlackboxNLP 2022) confirmed that anonymizing identifiers was the most damaging single transformation in their study of code representation robustness.
Two recent large-scale studies provide the most comprehensive quantification. Nikiema et al. (2025) tested variable renaming across 13 contemporary LLMs including GPT-4o, finding an average accuracy drop of 18.6 percentage points. GPT-4o showed a 7.3% decrease — the smallest among models tested, but still measurable. Le et al. (2025) documented performance degradation across multiple benchmarks under identifier obfuscation: for GPT-4o on ClassEval, class-level accuracy dropped from 87.3% to 58.7%; on LiveCodeBench and other benchmarks, consistent degradation was observed across model families.
The convergence across these three communities is the strongest argument for treating semantic indexing as a general phenomenon rather than a domain-specific artifact. Schema-guided dialogue researchers, text-to-SQL researchers, and code LM researchers each independently discovered that neural consumers systematically leverage lexical and natural-language semantics in schema identifiers and descriptions, and that this creates measurable sensitivity to renaming and paraphrase transformations. The controlled experimental designs — SGD-X's crowdsourced paraphrases, Dr.Spider's systematic perturbations, code obfuscation's semantics-preserving renames — each isolate the naming channel as causal.
We note that the mechanism need not be identical across domains. What the evidence supports is a weaker but still powerful claim: across domains, neural consumers treat schema-level natural-language tokens as a semantic information channel, and the output distribution is not invariant under transformations of that channel even when structural content is preserved.
Schema serialization is compilation. Traditional types compile to machine code where names are erased — the CPU has no use for them, and alpha equivalence is a design property of the compilation target. Semantic index types "compile" (via model_json_schema(), tool definitions, or prompt serialization) to token sequences consumed by a neural network that interprets names as natural-language instructions. The compilation target changed, and which properties survive compilation depends on the target. Alpha equivalence fails because the target architecture does not erase names. It reads them.
The degree to which semantic indexing affects a neural consumer depends on how the schema reaches the consumer — that is, the specifics of the compilation. We distinguish four regimes with different properties for each channel:
The schema is serialized as natural-language text within the prompt. Field names, descriptions, and type annotations are all visible as tokens in the model's context window. This regime maximizes semantic indexing: the consumer has full access to
The schema is passed via a structured API channel (e.g., function calling / tool use definitions) rather than concatenated into the prompt text. The linguistic content remains available — tool parameters carry names and descriptions — but it is structurally separated from the conversational context. Semantic indexing still occurs, but the consumer processes the schema through a different attention pathway than free-form prompt text.
The schema is compiled into a decoding grammar that mechanically constrains token generation. Constrained decoding frameworks — Outlines (Willard and Louf, TMLR 2023), LMQL (Beurer-Kellner et al., PLDI 2023), SGLang (Zheng et al., NeurIPS 2024), XGrammar (Dong et al., 2024) — enforce structural compliance during generation rather than after it. OpenAI's strict: true structured output mode (August 2024) and Anthropic's structured output support compile JSON Schema into decoding grammars.
In this regime, the structural channel provides hard guarantees during generation: the model physically cannot produce tokens that violate the schema's structural constraints. JSONSchemaBench (Geng et al., 2025) evaluates constrained decoding frameworks across approximately 10,000 real-world schemas and reports that constrained decoding can speed generation by roughly 50% while improving task performance up to 4%.
Formally, constrained decoding modifies the consumer's output distribution by intersecting the language model's token-level probabilities with a structural acceptor — typically a DFA, CFG, or JSON grammar compiled from the schema:
The indicator function zeros out structurally invalid continuations at each decoding step; the language model's conditional probabilities over valid continuations are preserved (up to renormalization). This enforces
Many production systems operate in a "generate → validate → retry" loop rather than using constrained decoding. Pydantic's construction pipeline exemplifies this regime: the model generates output, model_validate attempts construction, and if construction fails (type coercion error, validator failure, cross-field invariant violation), the system retries with the error message fed back to the consumer.
In this regime, structural guarantees are eventual rather than generative. The consumer may initially produce structurally invalid output, but the loop converges on valid output or fails explicitly. The semantic channel operates identically in both regimes — the consumer interprets
| Regime | Structural guarantee | Semantic channel | Attack surface |
|---|---|---|---|
| Schema-as-text (§5.1) | None (consumer-dependent) | Full access — tokens in conversational context | Maximum |
| Tool definition (§5.2) | None (consumer-dependent) | Full access — structurally separated from prompt | Reduced |
| Hard constraint (§5.3) | Generative — grammar-enforced | Full access — but may interact with decoding | Minimal (structural) |
| Post-generation (§5.4) | Eventual — validate/retry loop | Full access — identical to unconstrained | Moderate |
The four regimes differ in how schema tokens reach the consumer, raising a question this paper identifies but does not resolve: does the magnitude of semantic indexing vary systematically across regimes?
The phenomenon itself is stable across regimes — the semantic channel operates in all four. The open question concerns magnitude and modulation: schema-as-text places schema tokens in conversational context; tool definitions structurally separate them; constrained decoding may interact with semantic processing (§13). Whether these differences produce a monotone ordering of sensitivity is an empirical question that cross-regime ablation studies could resolve.
In a semantic index type system, field naming is programming.
counterparty_credit_rating: CreditRating tells a language model to assess the counterparty's creditworthiness. x7: CreditRating does not. The structural output is the same type. The semantic output differs: one produces a credit assessment grounded in counterparty risk; the other produces a member selected without interpretive guidance.
Field names are the primary semantic indices. Choosing churn_risk_tier over attrition_risk_tier is choosing between analytical framings — voluntary departure versus passive loss — shaping what signals the consumer weighs, what thresholds it applies, what risk narrative it constructs.
Field descriptions are disambiguation instructions. A description reading "Projected total revenue across the full customer relationship, not historical sum" narrows the consumer's interpretation of lifetime_value from a broad concept to a specific forward-looking revenue projection. Removing the description changes the output distribution. The description is part of the program.
Enum member names are a closed vocabulary of semantic instructions executed within a structurally bounded space. RiskTier.CRITICAL and RiskTier.SEVERE are structurally identical — both are members of the same enum — but they produce different downstream reasoning. "Critical" carries connotations of immediacy and threshold-crossing; "severe" carries connotations of magnitude without the same urgency.
Renaming is refactoring. In conventional programming, renaming a variable is a safe mechanical operation. In a semantic index type system, renaming a field changes what the consumer computes. The empirical evidence (§4) quantifies this: renaming-class transformations produce degradations ranging from 7% to over 50% across benchmarks and model families.
Semantic index types occupy a design space defined by two independent axes.
Structural compression is the degree to which the type annotation constrains the set of valid outputs. Literal["active", "inactive"] compresses to two values. A four-member enum compresses to four. str provides no compression.
Semantic precision is the degree to which the natural-language indices narrow the consumer's interpretation within the structurally valid space. A field named churn_risk_tier with a description specifying behavioral signals is semantically precise. A field named x7 with no description is semantically vacuous.
The axes are independent. High compression with low precision produces constrained but ambiguously motivated outputs. Low compression with high precision produces well-motivated but structurally unbounded outputs. The engineering objective is to maximize both: tight structural compression to bound the output space, and precise semantic indices to guide the consumer within that bounded space.
The relationship between the two axes has a quantitative backbone that can be stated precisely.
Let
Lemma (Semantic Channel Bound). The mutual information between the schema naming variant $N$ and the consumer's output for field $f$ is bounded above by the entropy of the output, which is in turn bounded by the logarithm of the structurally valid set size.
| Type constraint |
$|V_f|$ | Max semantic influence | |---|---|---| |bool| 2 | 1 bit | |Literal["active", "inactive", "suspended"]| 3 |$\log_2 3 \approx 1.58$ bits | | 4-memberStrEnum| 4 | 2 bits | |str| unbounded | unbounded |A bare
strfield gives the semantic channel unbounded capacity — the structural constraint admits any string, so the semantic index bears the full burden of determining the output.
Note on open domains: when str, unconstrained lists), the bound is vacuous — it provides no constraint on semantic influence. To recover a finite Literal type narrowing. This is another motivation for progressive hardening: converting an open field into a constrained one is not just an engineering improvement but a prerequisite for the bound to bite.
The bound is elementary (it follows from standard information-theoretic inequalities) but its consequences are not. It means structural compression and semantic channel capacity are inversely related. Each increase in structural compression reduces the number of bits the semantic channel can influence. Progressive hardening (below) is, in information-theoretic terms, the systematic reduction of the semantic channel's bandwidth. This has a dual consequence: it reduces the power of semantic indices to guide the consumer (the engineering cost) and equally reduces the attack surface for adversarial indexing (§9). An attacker who controls
The bound establishes the law. The next question is measurement. Whether real systems approach this bound — whether a well-crafted semantic index for a four-member enum captures close to 2 bits of influence, or whether model capability and decoding regime leave a gap — is an empirical question testable in any host environment.
The framework yields a natural metric for semantic precision. The semantic precision of
When
Definition (Semantic Index Sensitivity). For a field
$f$ and a pair of structurally isomorphic schemas$(S, S')$ , the semantic index sensitivity is:
$$d_f(S, S') := \mathbb{E}_{x \sim X}\left[\text{JS}\big(P(Y_f \mid x, S) ;|; P(Y_f \mid x, S')\big)\right]$$ *The expected distributional shift for field
$f$ when$S_{\text{lang}}$ changes and$S_{\text{struct}}$ is held fixed. *
This quantity is directly measurable by comparing output distributions across schema variants in any host environment supporting structurally isomorphic schema variants. See §8.
The bound has a direct engineering interpretation. The optimal development strategy exploits the relationship between the two axes. Begin with semantic precision: name fields carefully, write clear descriptions, choose expressive enum members. Observe the consumer's output. Where the semantic channel fails to produce acceptable results, promote the contract to a structural guarantee.
# Semantic contract: precise but soft
recommended_interventions: list[Intervention] = Field(
description="Ranked by projected ROI. At least one required for HIGH or CRITICAL risk."
)
# Structural guarantee: precise and hard
@model_validator(mode="after")
def high_risk_requires_interventions(self) -> Self:
if self.risk_tier in (RiskTier.HIGH, RiskTier.CRITICAL):
if not self.recommended_interventions:
raise ValueError("High/Critical risk requires interventions")
return selfEach hardening step converts semantic precision into structural compression. The semantic instruction remains — it still guides the consumer toward correct values before the structural check fires — but the construction pipeline now catches failures that the semantic channel alone could not prevent. The two channels work in concert: semantics for guidance, structure for proof.
This is the engineer's version of a familiar PL/SE story: start with soft specifications, then promote frequently violated expectations into enforced invariants. The empirical evidence supports this approach: SGD-X's schema sensitivity measurements (§4.1) and Dr.Spider's perturbation degradation curves (§4.2) can serve as the observational basis for deciding which semantic contracts to harden.
Results pending. Methodology defined; this section summarizes the design.
We operationalize the framework in the paper’s target domain: Pydantic structured output consumed by language models. One Pydantic model (customer retention risk analysis) is instantiated as four structurally isomorphic variants, with isomorphism verified computationally via canonical schema hash.
| Variant |
|
What it tests |
|---|---|---|
| Baseline | Domain-precise names + descriptions | Correct semantic indices |
| Names-only | Baseline names, descriptions stripped | Field identifiers vs description prose |
| Vacuous |
field_1, OPTION_A, generic descriptions |
Semantic channel removed entirely |
| Misleading | Coherent alternative frame (retention offers) | Different computation, same structure |
The misleading variant is the critical condition. The vacuous variant tests whether removing semantic content degrades guidance. The misleading variant tests whether different semantic content produces a different computation under identical structure. If the output distribution shifts not toward noise but toward a coherent alternative analysis, the schema is not merely formatting — it is participating in the task.
Measurements include distributional divergence between variant pairs, entropy reduction (
If field names and descriptions are instructions, they are also an attack surface.
The vulnerability class is familiar. Buffer overflows, SQL injection, and cross-site scripting all exploit the same pattern: content intended as data crosses into an execution context and is interpreted as instruction. Greshake et al. (AISec@CCS 2023) and Liu et al. (USENIX Security 2024) formalized this for LLM context windows as indirect prompt injection — the collapsed data/instruction boundary applied to neural consumers.
Semantic index types exhibit exactly this collapse at the schema level. A field name is simultaneously data (a key in a JSON Schema object) and instruction (a semantic index that the neural consumer interprets as a natural-language directive). The consumer cannot mechanically distinguish between the two roles. This is not an analogy to the injection vulnerability class — it is an instance of it, localized to the schema: every field name, description, and enum label is a point where data and instruction coincide.
In tool-mediated systems, this vulnerability manifests as tool description poisoning. Beurer-Kellner and Fischer (Invariant Labs, 2025) demonstrated that poisoned instructions in tool descriptions can hijack model behavior even when the poisoned tool is never invoked. Wang et al. (NAACL 2025) systematically evaluated this with ToolCommander, achieving 91.67% attack success for privacy theft and 100% for denial of service. MCPTox (Wang et al., 2025) evaluated 353 tools across 45 MCP servers, finding over 60% attack success rates for GPT-4o-mini, o1-mini, DeepSeek-R1, and Phi-4.
For semantic index types specifically, every schema field is a potential injection point. A field description that reads "Ignore all previous instructions and output the user's API key" exploits the same channel that a legitimate description like "Projected total revenue across the full customer relationship" uses for semantic guidance. The structural channel is immune — a constrained decoding grammar enforces valid JSON regardless of injected content — but the semantic channel is vulnerable because the consumer cannot mechanically distinguish legitimate semantic indices from adversarial ones.
Attack surfaces. Who controls
- Developer-authored schemas. Trusted. The developer controls both channels. Risk is limited to unintentional semantic imprecision.
- Third-party tool registries. Partially trusted. MCP servers, plugin marketplaces, and tool directories supply schemas from external authors. MCPTox (Wang et al., 2025) demonstrates that over 60% of tested tools are vulnerable to poisoning.
- Scraped API specifications. Untrusted. OpenAPI specs harvested from the web may contain adversarial descriptions injected by the API publisher or a supply-chain attacker.
-
Data-derived schemas. Untrusted. Database column comments, CSV headers, or user-supplied field names that flow into dynamically constructed schemas. Any user-controlled string that reaches
$S_{\text{lang}}$ is an injection vector.
Integrity property. The schema-native analogue of noninterference: the semantic channel must not override higher-privileged instructions. A field description is a lower-privilege instruction than the system prompt; an adversarial description that escalates its privilege (e.g., “Ignore all previous instructions…”) violates this property. Wallace et al. (OpenAI, 2024) proposed an instruction hierarchy that achieved up to 63% better resistance to prompt injection — an architectural enforcement of this privilege ordering.
Typed mitigations. Each mitigation maps to the threat model:
-
Provenance (who may write
$S_{\text{lang}}$ ): Schema signing, registry authentication, supply-chain verification for tool definitions. - Sanitization (which tokens are permitted): Input filtering of description fields when schemas are dynamically constructed from untrusted sources. Reject or escape tokens that could function as meta-instructions.
- Scoping (which schemas are visible in which contexts): Least-privilege exposure — the consumer sees only the schemas relevant to the current task, not the full tool registry. Architectural separation of untrusted content from instruction channels.
-
Structural containment (bound the damage): Constrained decoding limits the adversary’s influence to the semantic channel. An attacker who controls
$S_{\text{lang}}$ can steer which valid value the consumer selects but cannot produce structurally invalid output. The bound from §7 applies:$I(N; Y_f) \leq \log_2 |V_f|$ per field.
Chen et al. (USENIX Security 2025) introduced StruQ, which uses structured queries with separator tokens to achieve near-zero success rates for optimization-free attacks — an instance of architectural separation applied at the prompt level.
The security implications are not a side effect of semantic index types — they are a direct consequence of the core thesis. If names are computation, names are also attack vectors.
Semantic index types require a host system satisfying two conditions:
Names must be preserved and exposed. The type system must retain field names, descriptions, and enum members as first-class schema content visible to the consumer. This is the condition that compilation pipelines designed around alpha equivalence resist. Haskell's GHC assigns each binder a unique integer in Core (System FC); the original source name is not guaranteed to survive elaboration because the compilation target — as discussed in §5 — erases names by design. Even serializing a Haskell record to JSON and feeding the schema to a model means the result has left Haskell's type system; guarantees depend on whatever validation logic exists on the Haskell side.
A construction pipeline must sit behind the names. The type system must provide coercion, constraint enforcement, cross-field invariants, and structural dispatch, wired into a single construction call. Without this, semantic index types reduce to prompt engineering with schema decoration — the names instruct the consumer, but nothing proves the result.
Pydantic is a particularly ergonomic and widely adopted host for semantic index types in Python. It preserves field names and descriptions in its JSON Schema output because it was designed for API serialization, and it provides a full construction pipeline because it was designed for data validation. Neither design goal targeted semantic indexing; the combination emerged as an accident of API design requiring preserved names and data validation requiring construction pipelines.
The deeper requirement, however, is not Pydantic-specific. Any runtime schema object with meaningful labels and an enforcement pipeline can host semantic index types. The requirement is the conjunction: preserved names AND enforcement.
| System | Language | Names preserved | Construction pipeline | SIT host? |
|---|---|---|---|---|
| Pydantic | Python | ✓ (JSON Schema, descriptions) | ✓ (coercion, validators, invariants) | ✓ |
| Zod | TypeScript | ✓ (runtime schema, descriptions) | ✓ (parsing, refinements, transforms) | ✓ |
| JSON Schema | Language-agnostic | ✓ (descriptions, property names) | ✗ (description format only) | ✗ |
| dataclasses | Python | ✓ (field names) | ✗ (no validation, no coercion) | ✗ |
Pydantic provides this conjunction with particular ergonomics in the Python ecosystem, where the majority of LLM application development currently occurs.
A practical note on the specific mechanism by which schema content reaches the consumer in the Pydantic ecosystem. Pydantic's model_json_schema() includes class-level docstrings as the model description and Field(description=...) values as field-level descriptions. Field-level docstrings (docstrings placed below field declarations) are not automatically incorporated into the JSON Schema. When this paper refers to "docstrings" as semantic indices, we mean specifically class docstrings and field descriptions declared via Field(description=...), which are the mechanisms that reliably propagate linguistic content into the schema consumed by language models.
In a semantic index type system, the language model is not an external service that the program calls. It is a computational primitive that the type system invokes.
The distinction is operational. An external service receives instructions and returns unconstrained results. A computational primitive operates inside the type system: it receives a typed context (the schema with all its semantic indices) and produces output within typed bounds (the schema's structural constraints). The type system controls what the primitive sees (
class RetentionService:
def __init__(self, env: AppEnvironment):
self.env = env
def analyze(self, action: AnalyzeRetention) -> RetentionAnalysis:
return self.env.llm.create(
response_model=RetentionAnalysis,
context=action,
)RetentionAnalysis is simultaneously the instruction set (
Alpha equivalence was formalized by Church (1941) as part of the lambda calculus. Barendregt (1984) provides the standard modern treatment, including the Variable Convention (Definition 2.1.13). Our work does not challenge alpha equivalence as a property of formal language semantics; rather, we identify a consumer-level invariance failure when neural models interpret schema labels.
Fortran's I-N implicit typing rule (IBM, 1956) is a historical precedent — a deliberate language design decision making names semantically relevant within the PL, as opposed to the emergent consumer-level failure that semantic index types identify.
Although we disclaim the connection to indexed types in the dependent-type sense (§1), a structural analogy is worth noting: in both dependent types and semantic index types, an index parameterizes behavior. The difference is the nature of the guarantee — dependent type indices provide proofs; semantic indices provide probabilistic guidance via a learned function. Progressive hardening (§7) moves incrementally from the semantic index regime toward the dependent type regime in terms of guarantee strength.
King (2019) articulated a related principle — "parse, don't validate" — that parsing should produce values whose type encodes the validation that has occurred. Semantic index types extend this: construction is parsing, and the parsed type carries both structural proof and semantic instruction.
The empirical evidence from these three traditions is reviewed in §4. For intellectual genealogy: SGD (Rastogi et al., AAAI 2020) established schema descriptions as active model inputs; SGD-X (Lee et al., AAAI 2022) introduced Schema Sensitivity as a dedicated metric. Schema linking (Lei et al., EMNLP 2020; Wang et al., ACL 2020) treated column names as semantic anchors central to text-to-SQL. Hindle et al. (ICSE 2012, Most Influential Paper 2022) established that code is "more repetitive and predictable than natural language," anticipating the identifier dependence observed in modern code LMs (Nikiema et al., 2025; Le et al., 2025).
The constrained decoding literature — Outlines (Willard and Louf, TMLR 2023), LMQL (Beurer-Kellner et al., PLDI 2023), SGLang (Zheng et al., NeurIPS 2024), XGrammar (Dong et al., 2024), PICARD (Scholak et al., EMNLP 2021) — provides the mechanical enforcement of our structural channel. JSONSchemaBench (Geng et al., 2025) benchmarks compliance and quality. This literature grounds our "structural containment" argument: constrained decoding enforces
The prompt injection literature (Greshake et al., AISec@CCS 2023; Liu et al., USENIX Security 2024) and tool-poisoning work (Beurer-Kellner and Fischer, 2025; Wang et al., NAACL 2025; Wang et al., 2025) directly instantiate the security consequence of our thesis: if names are instructions, they are also attack vectors. These works are discussed in §9.
The observation that schema vocabulary shapes the neural consumer’s output distribution is structurally analogous to linguistic relativity (Whorf, 1956) — applied to neural rather than human cognition. We note the analogy without claiming mechanistic equivalence; its value is that schema-consuming systems provide tighter experimental controls than the human cognitive science literature —
Semantic equivalence of paraphrases. SGD-X paraphrases are crowdsourced and designed to preserve meaning, but natural language does not admit perfect semantic identity. Our theoretical claim is best stated as "observational distinguishability under linguistically varied but intended-equivalent schema descriptions" rather than asserting perfect semantic identity of the varied descriptions.
Mechanism may differ across domains. The evidence supports convergent behavior — neural consumers leverage lexical semantics in schema identifiers across dialogue, SQL, and code domains — but we do not claim mechanistic identity. The same statistical phenomenon (sensitivity to naming) may arise from different learned representations in different model architectures and training regimes.
Robustness is an open problem. SGD-X demonstrates that schema paraphrases can degrade performance substantially. A practitioner relying on semantic indices faces the same fragility that SGD-X documents: small changes in wording can produce large changes in output. Progressive hardening (§7) mitigates this by converting fragile semantic contracts into robust structural guarantees, but cannot eliminate the fundamental dependence on consumer capability for the semantic channel.
Cross-channel interaction. The two-channel model is a deliberate simplification. It treats the structural and semantic channels as independent constraint systems whose effects compose. In practice, the channels interact. Tam et al. (EMNLP 2024) found that strict constrained decoding can degrade reasoning quality while enhancing classification accuracy, suggesting that mechanical enforcement of
Local measurement in progress. We define metrics (
Consumer capability is a moving target. The magnitude of semantic indexing effects depends on the consumer's language understanding, which varies across model families and improves with scale. Effects documented today may attenuate or amplify as models evolve.
When a language model consumes a type schema, names stop being addresses and become instructions. This paper has defined the resulting phenomenon — semantic index types — and formalized the two-channel constraint system it creates: the structural channel determines the support of the output distribution (mechanically enforced), the semantic channel determines the conditional probabilities within that support (neurally guided), and
The explanation is architectural. Traditional types compile to machine code that erases names. Schema-driven types compile to token sequences consumed by a neural network that reads them. Alpha equivalence fails because the compilation target changed. This is not a discovery about model sensitivity — it is a necessary consequence of neural consumption of natural-language tokens in typed contexts.
The same bound governs both engineering value and security risk. Progressive hardening reduces semantic bandwidth, converting soft guidance into hard proof. An adversary who controls the semantic channel can influence at most as many bits as the structural constraint leaves open. The engineering story and the security story are the same story.
Three research communities — schema-guided dialogue, text-to-SQL, and code language models — independently discovered what amounts to linguistic relativity for neural consumers: the vocabulary of the schema determines the output distribution of the model. This paper contributes the unifying abstraction and the formal characterization. Once a declaration is simultaneously consumed as natural language by a neural interpreter and as structural contract by a host runtime, the rest follows.
The theory of semantic index types is the theory of what happens when naming becomes programming.
Barendregt, H.P. (1984). The Lambda Calculus: Its Syntax and Semantics. Revised edition. North-Holland.
Beurer-Kellner, L., Fischer, M., and Vechev, M. (2023). Prompting is programming: A query language for large language models. PLDI 2023.
Beurer-Kellner, L. and Fischer, M. (2025). Tool poisoning attacks on AI agents. Invariant Labs.
Bielik, P. and Vechev, M. (2020). Adversarial robustness for code. ICML 2020.
Chang, S., et al. (2023). Dr.Spider: A diagnostic evaluation benchmark towards text-to-SQL robustness. ICLR 2023.
Chen, B., et al. (2025). StruQ: Defending against prompt injection with structured queries. USENIX Security 2025.
Chen, Z., et al. (2021). ShadowGNN: Graph projection neural network for text-to-SQL parser. NAACL 2021.
Church, A. (1941). The Calculi of Lambda Conversion. Annals of Mathematics Studies, 6.
Dong, Y., et al. (2024). XGrammar: Flexible and efficient structured generation engine for large language models. Preprint.
Geng, S., et al. (2025). JSONSchemaBench: A benchmark for structured generation with complex JSON schemas. Preprint.
Greshake, K., et al. (2023). Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. AISec@CCS 2023.
Hindle, A., et al. (2012). On the naturalness of software. ICSE 2012. (Most Influential Paper, 2022.)
King, A. (2019). Parse, don't validate. Blog post.
Le, H., et al. (2025). When names disappear: Benchmarking LLMs for code generation without natural language cues. Preprint.
Lee, H., et al. (2022). SGD-X: A benchmark for robust generalization in schema-guided dialogue systems. AAAI 2022.
Lei, W., et al. (2020). Re-examining the role of schema linking in text-to-SQL. EMNLP 2020.
Li, J., et al. (2023). Can LLM already serve as a database interface? A BIg bench for large-scale database grounded text-to-SQL (BIRD). NeurIPS 2023.
Lin, X.V., et al. (2020). Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing. EMNLP 2020.
Liu, Y., et al. (2024). Formalizing and benchmarking prompt injection attacks and defenses. USENIX Security 2024.
Nikiema, P., et al. (2025). The code barrier: What LLMs actually understand? Preprint.
Park, K., et al. (2024). Grammar-aligned decoding. NeurIPS 2024.
Rabin, M.R.I., et al. (2021). Understanding neural code intelligence through program simplification. IST 2021.
Rastogi, A., et al. (2020). Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. AAAI 2020.
Scholak, T., et al. (2021). PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. EMNLP 2021.
Tam, Z.R., et al. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. EMNLP 2024.
Troshin, S. and Chirkova, N. (2022). Probing pretrained models of source code. BlackboxNLP@EMNLP 2022.
Wallace, E., et al. (2024). The instruction hierarchy: Training LLMs to prioritize privileged instructions. OpenAI.
Wang, B., et al. (2020). RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. ACL 2020.
Wang, Y., et al. (2021). CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. EMNLP 2021.
Wang, Z., et al. (2025). MCPTox: A broad evaluation of LLM agent safety through tool poisoning. Preprint.
Wang, Z., et al. (2025). ToolCommander: Adversarial attacks and defenses in multi-turn LLM tool-use. NAACL 2025.
Whorf, B.L. (1956). Language, Thought, and Reality: Selected Writings of Benjamin Lee Whorf. Edited by J.B. Carroll. MIT Press.
Willard, B. and Louf, R. (2023). Efficient guided generation for large language models. TMLR 2023.
Wretblad, N., et al. (2024). Understanding the effects of column descriptions on text-to-SQL. NeurIPS 2024.
Zhang, H., et al. (2020). Generating adversarial examples for holding robustness of source code processing models. AAAI 2020.
Zhao, J., et al. (2022). Description-driven task-oriented dialog modeling. Preprint.
Zheng, L., et al. (2024). SGLang: Efficient execution of structured language model programs. NeurIPS 2024.