
A team sees an incorrect model response and checks the prompt version. The prompt did not change, so the investigation immediately shifts to the model.
That conclusion is often premature. The effective request may also have changed because retrieval returned a different document version, memory was compacted, a tool schema drifted, template variables resolved differently, instruction order changed, or the deployment used different model parameters.
In this issue, we define a provider-neutral prompt and context lineage contract. It records the dependencies of one generative AI operation, correlates them with distributed tracing, protects sensitive content by default, and establishes exactly what an incident replay can and cannot prove.
What You Are Defining
You are defining an application-owned lineage manifest for each model operation. The manifest is created by deterministic code around the model call. It is not a prompt and it is not model output.
The contract records:
- Request, trace, conversation, service, and deployment identity
- Prompt template ID, immutable version, and content hash
- Protected hashes of resolved prompt variables
- Final instruction order and source versions
- Ordered context items with source, trust, sensitivity, and freshness
- Retrieval query, index version, filter policy, and selected evidence
- Memory, tool definitions, tool calls, and historical tool results
- Requested model, response model, and runtime parameters
- Assembled input hash, token usage, output hash, and policy decision
- Privacy capture mode, retention class, and reconstruction level
The model remains probabilistic. Manifest creation, source identity, ordering, hashing, retention policy, and replay controls remain deterministic.
System Structure
The manifest sits beside the model operation. Request assembly produces the prepared record before the network call. The response path appends the terminal outcome. OpenTelemetry spans carry operational correlation while the complete manifest remains in a separately governed lineage store.
The diagram below shows the high-level control flow:
This separation matters. Traces are optimized for operational correlation and bounded attributes. A lineage store is optimized for complete request dependencies, integrity verification, retention controls, and incident reconstruction.
The Manifest Is Not Sent to the Model
The manifest describes the request from outside the probabilistic boundary. Adding it to the model context would consume tokens, expose operational metadata, and allow the model to influence the evidence used to audit its own operation.
An abbreviated lifecycle view looks like this. Fields are omitted for readability, so this fragment is not a complete schema-valid manifest:
{
"schemaVersion": "1.0.0",
"recordType": "production",
"manifestId": "...",
"lifecycle": "prepared",
"correlation": {
"requestId": "...",
"traceId": "...",
"spanId": "..."
},
"prompt": {
"templateId": "support-triage",
"templateVersion": "4",
"templateHash": {
"algorithm": "SHA-256",
"value": "..."
}
},
"contextItems": [],
"model": {},
"request": {},
"outcome": {
"status": "unknown",
"policyDecision": "not_evaluated"
}
}The application owns this record. The model does not generate it, approve it, or decide whether it should be retained.
Prompt Versioning Alone Is Not Enough
Prompt versioning answers which template was selected. It does not prove which template bytes were loaded, which variables were resolved, which instruction fragments were appended, or which runtime context was placed around the prompt.
The lineage contract therefore requires both version and content identity:
"prompt": {
"templateId": "support-triage",
"templateVersion": "4",
"templateHash": {
"algorithm": "SHA-256",
"value": "8a2d973bc1d896f4e857705f7d0517ca07f5dbd5b505285414f47c012a61ac16"
},
"variables": [
{
"name": "account_region",
"valueHash": {
"algorithm": "HMAC-SHA-256",
"keyId": "lineage-hmac-2026-02",
"value": "33b4b061420b59765c25373b6754b736b338328b229cc8701e42317612f0b117"
},
"sensitivity": "internal"
}
]
}The version identifies the intended artifact. The hash detects an artifact that changed without a corresponding version change. Variable values use a keyed HMAC in this example because plain hashes do not protect guessable values such as regions, account tiers, email addresses, or short identifiers.
Create the Record Before the Model Call
If lineage is written only after a successful response, the evidence disappears for timeouts, provider failures, process crashes, and cancellations. The contract uses a two-phase lifecycle.
prepared: persisted immediately before network executioncompleted: response and policy outcome were recordedfailed: the operation ended with a classified failurecancelled: execution was intentionally terminated
Completed records should be append-only or revisioned. A correction creates a superseding record or immutable-store revision rather than silently rewriting investigation evidence.
Instruction and Context Order Are Part of the Input
Recording a set of source IDs is insufficient. The same content can produce different behavior when its position, instruction role, or surrounding context changes.
"instructions": [
{
"position": 0,
"kind": "system",
"source": {
"system": "prompt-registry",
"id": "support-system-policy",
"version": "7"
},
"contentHash": {
"algorithm": "SHA-256",
"value": "..."
}
}
],
"contextItems": [
{
"position": 1,
"kind": "retrieval_document",
"source": {
"system": "policy-index",
"id": "refund-policy",
"version": "7"
},
"contentHash": {
"algorithm": "SHA-256",
"value": "..."
},
"trust": "trusted_internal",
"sensitivity": "internal",
"tokenCount": 214
}
]The instruction categories are application lineage categories. They do not claim that every provider exposes the same role hierarchy. Provider adapters still own the mapping from the application's instruction model to a concrete API.
Trust Labels Describe Origin, Not Truth
The schema supports four trust classifications:
trusted_internaluntrusted_externaluser_suppliedderived
A trusted internal document can still be stale or incorrect. A user-supplied message can still contain valid evidence. Trust controls how content is handled; it is not a quality score and must not be presented as one.
Sensitivity is recorded separately as public, internal, confidential, or restricted. This separation allows a source to be operationally trusted while still requiring strict privacy handling.
Retrieval Needs Index Lineage
A document ID does not fully identify a retrieval result. Reindexing, rechunking, embedding changes, deletion, metadata-filter changes, and access-policy changes can all alter what the same query returns.
"retrieval": {
"queryHash": {
"algorithm": "HMAC-SHA-256",
"keyId": "lineage-hmac-2026-02",
"value": "..."
},
"indexId": "support-policy-index",
"indexVersion": "2026-06-20T08:00:00Z",
"topK": 3,
"filterPolicyVersion": "tenant-region-filter-v5"
}Selected documents still appear as ordered context items. The retrieval block records how the selection was requested, while each context item records what was actually included.
Tool Lineage Does Not Grant Tool Authority
A tool definition can change model behavior even when the tool never executes. The manifest therefore records the tool name, contract version, and canonical schema hash.
When a call is proposed or executed, the manifest can add the call ID, protected argument hash, protected result hash, and deterministic execution status. That is audit evidence, not authorization. Allowlists, scopes, approval gates, idempotency, and execution policy remain separate controls.
Requested Model and Response Model Are Different Fields
Applications know which model they requested. Providers may return a more specific model identifier or revision. Both belong in lineage when available.
"model": {
"provider": "local-openai-compatible",
"requestedModel": "llama3.2:3b",
"responseModel": "llama3.2:3b",
"endpointClass": "local",
"parameters": {
"temperature": 0,
"topP": 0.9,
"maxOutputTokens": 160,
"seed": 42
}
}A recorded seed and temperature zero improve experiment control only when the runtime supports them. They do not guarantee identical output across different provider versions, hardware, kernels, batching states, or model revisions.
Three Reconstruction Levels
A production system should state what its evidence can reconstruct instead of calling every record replayable.
Exact Input
The complete effective input exists in an immutable, access-controlled snapshot. Its assembled input hash can be verified. This supports exact input reconstruction, not identical output reproduction.
Reference Resolvable
Every dependency has a versioned reference and hash, but content remains in source systems. Reconstruction works only while those source versions remain available and unchanged.
Metadata Only
The manifest preserves versions, hashes, classifications, and operational correlation without resolvable content. This supports change analysis and incident correlation, not request replay.
Raw Prompt Content Is Opt-In
Prompt and context observability can easily become a second uncontrolled customer-data store. Raw system instructions, user messages, retrieved documents, memory, tool arguments, and outputs should not enter ordinary traces by default.
Current OpenTelemetry GenAI conventions define system instructions, input messages, output messages, and tool definitions as opt-in attributes. The specification also warns that input messages are likely to contain user or PII data. That is the correct operational default: content capture is a risk decision, not a harmless debugging switch. See the OpenTelemetry GenAI span conventions.
The contract defines three capture modes:
metadata_only: identities, versions, classifications, counts, and protected hashesreferenced_content: governed pointers to versioned source systemsencrypted_content: approved raw snapshots in a separate encrypted content store
The JSON Schema requires a separate content-store reference when raw capture is enabled. Access to that store should be narrower than access to ordinary application logs or traces.
Hashing Is Not Redaction
A plain SHA-256 hash of a short or guessable value does not make that value anonymous. An attacker can hash likely email addresses, regions, account tiers, or status values and compare the results.
Use SHA-256 for integrity over high-entropy artifacts whose content is already appropriately governed. Use HMAC-SHA-256 with a managed key when equality comparison is required for sensitive or low-entropy values. Prefer omission when the value is not operationally necessary.
The HMAC key ID belongs in the manifest. The key itself does not.
Manifest Integrity Uses Canonical JSON
Hashing ordinary JSON text is fragile because whitespace and object-property order can change without changing the data. The contract uses the JSON Canonicalization Scheme from RFC 8785.
- Remove the top-level
integrityobject - Canonicalize the remaining object using RFC 8785
- Encode the canonical representation as UTF-8
- Compute SHA-256
- Store the lowercase hexadecimal digest as the payload hash
The checked-in example has a recomputed payload hash of db0890413bcbd0c1864d3b1c65c5eac5d51c8d8ee19f86c074bf5b0352505d2c. That verifies the example manifest structure. Individual content hashes remain illustrative because the referenced example source systems do not exist.
OpenTelemetry Is the Correlation Layer
The lineage contract does not replace OpenTelemetry. It maps standard operational fields into current GenAI conventions and keeps the full dependency record in the manifest.
Manifest OpenTelemetry GenAI
model.provider gen_ai.provider.name
model.requestedModel gen_ai.request.model
model.responseModel gen_ai.response.model
correlation.conversationId gen_ai.conversation.id
request.inputTokenCount gen_ai.usage.input_tokens
outcome.outputTokenCount gen_ai.usage.output_tokens
system instructions gen_ai.system_instructions (opt-in)
input messages gen_ai.input.messages (opt-in)
output messages gen_ai.output.messages (opt-in)
tool definitions gen_ai.tool.definitions (opt-in)The current OpenTelemetry GenAI registry marks these attributes as Development. That means the internal lineage schema should remain stable while the OpenTelemetry mapping is maintained as a versioned adapter. See the GenAI attribute registry.
Manifest IDs, document IDs, request IDs, hashes, and tenant IDs should not become metric dimensions. Keep high-cardinality request evidence in traces and the lineage store. Keep metric labels bounded.
Walking an Incident Investigation
The companion repository includes an incident replay runbook. The investigation starts with evidence verification, not with another model call.
- Retrieve the manifest by manifest ID, request ID, or trace ID
- Validate the schema version and canonical payload hash
- Determine the reconstruction level actually available
- Resolve prompt, instructions, context, memory, retrieval, and tools by immutable version
- Verify each source hash
- Reassemble the recorded order and compare the assembled input hash
- Disable production credentials and replace write tools with recording stubs
- Choose one explicit comparison and change one variable at a time
- Compare structure, policy, grounding, task quality, and operational behavior
- Document supported conclusions and alternative explanations
Replay must not repeat refunds, deployments, deletions, email delivery, exports, or other side effects. Historical tool results may be injected as recorded context. The original tool should not be called merely to reproduce a past response.
The Artifact Is Machine-Validated
The repository includes a JSON Schema Draft 2020-12 contract and an explicitly marked example record. The example validates successfully with strict AJV validation and standard format support:
examples/support-request-lineage.json validSchema validation proves document shape, required fields, formats, and conditional rules. It does not prove that source references exist, HMAC values are correct, content is safe, or the model response is good. Those are different validation boundaries.
Why This Architecture Works
- It records the complete request dependency graph instead of only the prompt name
- It captures evidence before network execution can fail
- It keeps source order, versions, trust, sensitivity, and freshness explicit
- It separates integrity hashes from privacy-preserving HMACs
- It defaults to metadata and references rather than raw content
- It uses OpenTelemetry for correlation without forcing full manifests into spans
- It states reconstruction capability honestly
- It contains replay side effects before comparison begins
- It distinguishes input reconstruction from probabilistic output reproduction
The contract does not make models deterministic. It makes the system around the model more explainable, inspectable, and accountable.
Potential Enhancements
To extend the operating model, consider:
- Add SDK adapters that emit prepared and terminal manifests automatically
- Add schema conformance tests to CI for every producing service
- Store prompt, policy, and tool artifacts in immutable registries
- Add signed manifests for cross-organization evidence exchange
- Add retrieval chunker, embedding model, and reranker versions
- Add evaluation-run lineage linked back to production manifests
- Add automated reconstruction-level downgrade when sources expire
- Add privacy-budget and tenant-deletion verification
- Add an OpenTelemetry adapter compatibility matrix by convention version
Final Notes
A prompt is only one dependency in a production AI request. Context, retrieval, memory, tools, model configuration, deployment identity, and policy state all contribute to the final behavior.
When those dependencies are versioned, ordered, protected, correlated, and retained under explicit policy, incident review becomes evidence-based. When they are not, teams are left guessing whether the prompt, context, model, or surrounding system actually changed.
Explore the contract repository at the GitHub repository.
See you in the next issue.
Stay curious.
Join the Newsletter
Subscribe for AI engineering insights, system design strategies, and workflow tips.