Issue #33: Prompt and Context Lineage for Reproducible AI Systems

13 min read | June 20, 2026

A team sees an incorrect model response and checks the prompt version. The prompt did not change, so the investigation immediately shifts to the model.

That conclusion is often premature. The effective request may also have changed because retrieval returned a different document version, memory was compacted, a tool schema drifted, template variables resolved differently, instruction order changed, or the deployment used different model parameters.

In this issue, we define a provider-neutral prompt and context lineage contract. It records the dependencies of one generative AI operation, correlates them with distributed tracing, protects sensitive content by default, and establishes exactly what an incident replay can and cannot prove.

What You Are Defining

You are defining an application-owned lineage manifest for each model operation. The manifest is created by deterministic code around the model call. It is not a prompt and it is not model output.

The contract records:

Request, trace, conversation, service, and deployment identity
Prompt template ID, immutable version, and content hash
Protected hashes of resolved prompt variables
Final instruction order and source versions
Ordered context items with source, trust, sensitivity, and freshness
Retrieval query, index version, filter policy, and selected evidence
Memory, tool definitions, tool calls, and historical tool results
Requested model, response model, and runtime parameters
Assembled input hash, token usage, output hash, and policy decision
Privacy capture mode, retention class, and reconstruction level

The model remains probabilistic. Manifest creation, source identity, ordering, hashing, retention policy, and replay controls remain deterministic.

System Structure

The manifest sits beside the model operation. Request assembly produces the prepared record before the network call. The response path appends the terminal outcome. OpenTelemetry spans carry operational correlation while the complete manifest remains in a separately governed lineage store.

The diagram below shows the high-level control flow:

This separation matters. Traces are optimized for operational correlation and bounded attributes. A lineage store is optimized for complete request dependencies, integrity verification, retention controls, and incident reconstruction.

The Manifest Is Not Sent to the Model

The manifest describes the request from outside the probabilistic boundary. Adding it to the model context would consume tokens, expose operational metadata, and allow the model to influence the evidence used to audit its own operation.

An abbreviated lifecycle view looks like this. Fields are omitted for readability, so this fragment is not a complete schema-valid manifest:

{
  "schemaVersion": "1.0.0",
  "recordType": "production",
  "manifestId": "...",
  "lifecycle": "prepared",
  "correlation": {
    "requestId": "...",
    "traceId": "...",
    "spanId": "..."
  },
  "prompt": {
    "templateId": "support-triage",
    "templateVersion": "4",
    "templateHash": {
      "algorithm": "SHA-256",
      "value": "..."
    }
  },
  "contextItems": [],
  "model": {},
  "request": {},
  "outcome": {
    "status": "unknown",
    "policyDecision": "not_evaluated"
  }
}

The application owns this record. The model does not generate it, approve it, or decide whether it should be retained.

Prompt Versioning Alone Is Not Enough

Prompt versioning answers which template was selected. It does not prove which template bytes were loaded, which variables were resolved, which instruction fragments were appended, or which runtime context was placed around the prompt.

The lineage contract therefore requires both version and content identity:

"prompt": {
"templateId": "support-triage",
"templateVersion": "4",
"templateHash": {
  "algorithm": "SHA-256",
  "value": "8a2d973bc1d896f4e857705f7d0517ca07f5dbd5b505285414f47c012a61ac16"
},
"variables": [
  {
    "name": "account_region",
    "valueHash": {
      "algorithm": "HMAC-SHA-256",
      "keyId": "lineage-hmac-2026-02",
      "value": "33b4b061420b59765c25373b6754b736b338328b229cc8701e42317612f0b117"
    },
    "sensitivity": "internal"
  }
]
}

The version identifies the intended artifact. The hash detects an artifact that changed without a corresponding version change. Variable values use a keyed HMAC in this example because plain hashes do not protect guessable values such as regions, account tiers, email addresses, or short identifiers.

Create the Record Before the Model Call

If lineage is written only after a successful response, the evidence disappears for timeouts, provider failures, process crashes, and cancellations. The contract uses a two-phase lifecycle.

prepared: persisted immediately before network execution
completed: response and policy outcome were recorded
failed: the operation ended with a classified failure
cancelled: execution was intentionally terminated

Completed records should be append-only or revisioned. A correction creates a superseding record or immutable-store revision rather than silently rewriting investigation evidence.

Instruction and Context Order Are Part of the Input

Recording a set of source IDs is insufficient. The same content can produce different behavior when its position, instruction role, or surrounding context changes.

"instructions": [
{
  "position": 0,
  "kind": "system",
  "source": {
    "system": "prompt-registry",
    "id": "support-system-policy",
    "version": "7"
  },
  "contentHash": {
    "algorithm": "SHA-256",
    "value": "..."
  }
}
],
"contextItems": [
{
  "position": 1,
  "kind": "retrieval_document",
  "source": {
    "system": "policy-index",
    "id": "refund-policy",
    "version": "7"
  },
  "contentHash": {
    "algorithm": "SHA-256",
    "value": "..."
  },
  "trust": "trusted_internal",
  "sensitivity": "internal",
  "tokenCount": 214
}
]

The instruction categories are application lineage categories. They do not claim that every provider exposes the same role hierarchy. Provider adapters still own the mapping from the application's instruction model to a concrete API.

Trust Labels Describe Origin, Not Truth

The schema supports four trust classifications:

trusted_internal
untrusted_external
user_supplied
derived

A trusted internal document can still be stale or incorrect. A user-supplied message can still contain valid evidence. Trust controls how content is handled; it is not a quality score and must not be presented as one.

Sensitivity is recorded separately as public, internal, confidential, or restricted. This separation allows a source to be operationally trusted while still requiring strict privacy handling.

Retrieval Needs Index Lineage

A document ID does not fully identify a retrieval result. Reindexing, rechunking, embedding changes, deletion, metadata-filter changes, and access-policy changes can all alter what the same query returns.

"retrieval": {
"queryHash": {
  "algorithm": "HMAC-SHA-256",
  "keyId": "lineage-hmac-2026-02",
  "value": "..."
},
"indexId": "support-policy-index",
"indexVersion": "2026-06-20T08:00:00Z",
"topK": 3,
"filterPolicyVersion": "tenant-region-filter-v5"
}

Selected documents still appear as ordered context items. The retrieval block records how the selection was requested, while each context item records what was actually included.

Tool Lineage Does Not Grant Tool Authority

A tool definition can change model behavior even when the tool never executes. The manifest therefore records the tool name, contract version, and canonical schema hash.

When a call is proposed or executed, the manifest can add the call ID, protected argument hash, protected result hash, and deterministic execution status. That is audit evidence, not authorization. Allowlists, scopes, approval gates, idempotency, and execution policy remain separate controls.

Requested Model and Response Model Are Different Fields

Applications know which model they requested. Providers may return a more specific model identifier or revision. Both belong in lineage when available.

"model": {
"provider": "local-openai-compatible",
"requestedModel": "llama3.2:3b",
"responseModel": "llama3.2:3b",
"endpointClass": "local",
"parameters": {
  "temperature": 0,
  "topP": 0.9,
  "maxOutputTokens": 160,
  "seed": 42
}
}

A recorded seed and temperature zero improve experiment control only when the runtime supports them. They do not guarantee identical output across different provider versions, hardware, kernels, batching states, or model revisions.

Three Reconstruction Levels

A production system should state what its evidence can reconstruct instead of calling every record replayable.

Exact Input

The complete effective input exists in an immutable, access-controlled snapshot. Its assembled input hash can be verified. This supports exact input reconstruction, not identical output reproduction.

Reference Resolvable

Every dependency has a versioned reference and hash, but content remains in source systems. Reconstruction works only while those source versions remain available and unchanged.

Metadata Only

The manifest preserves versions, hashes, classifications, and operational correlation without resolvable content. This supports change analysis and incident correlation, not request replay.

Raw Prompt Content Is Opt-In

Prompt and context observability can easily become a second uncontrolled customer-data store. Raw system instructions, user messages, retrieved documents, memory, tool arguments, and outputs should not enter ordinary traces by default.

Current OpenTelemetry GenAI conventions define system instructions, input messages, output messages, and tool definitions as opt-in attributes. The specification also warns that input messages are likely to contain user or PII data. That is the correct operational default: content capture is a risk decision, not a harmless debugging switch. See the OpenTelemetry GenAI span conventions.

The contract defines three capture modes:

metadata_only: identities, versions, classifications, counts, and protected hashes
referenced_content: governed pointers to versioned source systems
encrypted_content: approved raw snapshots in a separate encrypted content store

The JSON Schema requires a separate content-store reference when raw capture is enabled. Access to that store should be narrower than access to ordinary application logs or traces.

Hashing Is Not Redaction

A plain SHA-256 hash of a short or guessable value does not make that value anonymous. An attacker can hash likely email addresses, regions, account tiers, or status values and compare the results.

Use SHA-256 for integrity over high-entropy artifacts whose content is already appropriately governed. Use HMAC-SHA-256 with a managed key when equality comparison is required for sensitive or low-entropy values. Prefer omission when the value is not operationally necessary.

The HMAC key ID belongs in the manifest. The key itself does not.

Manifest Integrity Uses Canonical JSON

Hashing ordinary JSON text is fragile because whitespace and object-property order can change without changing the data. The contract uses the JSON Canonicalization Scheme from RFC 8785.

Remove the top-level integrity object
Canonicalize the remaining object using RFC 8785
Encode the canonical representation as UTF-8
Compute SHA-256
Store the lowercase hexadecimal digest as the payload hash

The checked-in example has a recomputed payload hash of db0890413bcbd0c1864d3b1c65c5eac5d51c8d8ee19f86c074bf5b0352505d2c. That verifies the example manifest structure. Individual content hashes remain illustrative because the referenced example source systems do not exist.

OpenTelemetry Is the Correlation Layer

The lineage contract does not replace OpenTelemetry. It maps standard operational fields into current GenAI conventions and keeps the full dependency record in the manifest.

Manifest                         OpenTelemetry GenAI
model.provider                  gen_ai.provider.name
model.requestedModel            gen_ai.request.model
model.responseModel             gen_ai.response.model
correlation.conversationId      gen_ai.conversation.id
request.inputTokenCount         gen_ai.usage.input_tokens
outcome.outputTokenCount        gen_ai.usage.output_tokens
system instructions             gen_ai.system_instructions (opt-in)
input messages                  gen_ai.input.messages (opt-in)
output messages                 gen_ai.output.messages (opt-in)
tool definitions                gen_ai.tool.definitions (opt-in)

The current OpenTelemetry GenAI registry marks these attributes as Development. That means the internal lineage schema should remain stable while the OpenTelemetry mapping is maintained as a versioned adapter. See the GenAI attribute registry.

Manifest IDs, document IDs, request IDs, hashes, and tenant IDs should not become metric dimensions. Keep high-cardinality request evidence in traces and the lineage store. Keep metric labels bounded.

Walking an Incident Investigation

The companion repository includes an incident replay runbook. The investigation starts with evidence verification, not with another model call.

Retrieve the manifest by manifest ID, request ID, or trace ID
Validate the schema version and canonical payload hash
Determine the reconstruction level actually available
Resolve prompt, instructions, context, memory, retrieval, and tools by immutable version
Verify each source hash
Reassemble the recorded order and compare the assembled input hash
Disable production credentials and replace write tools with recording stubs
Choose one explicit comparison and change one variable at a time
Compare structure, policy, grounding, task quality, and operational behavior
Document supported conclusions and alternative explanations

Replay must not repeat refunds, deployments, deletions, email delivery, exports, or other side effects. Historical tool results may be injected as recorded context. The original tool should not be called merely to reproduce a past response.

The Artifact Is Machine-Validated

The repository includes a JSON Schema Draft 2020-12 contract and an explicitly marked example record. The example validates successfully with strict AJV validation and standard format support:

examples/support-request-lineage.json valid

Schema validation proves document shape, required fields, formats, and conditional rules. It does not prove that source references exist, HMAC values are correct, content is safe, or the model response is good. Those are different validation boundaries.

Why This Architecture Works

It records the complete request dependency graph instead of only the prompt name
It captures evidence before network execution can fail
It keeps source order, versions, trust, sensitivity, and freshness explicit
It separates integrity hashes from privacy-preserving HMACs
It defaults to metadata and references rather than raw content
It uses OpenTelemetry for correlation without forcing full manifests into spans
It states reconstruction capability honestly
It contains replay side effects before comparison begins
It distinguishes input reconstruction from probabilistic output reproduction

The contract does not make models deterministic. It makes the system around the model more explainable, inspectable, and accountable.

Potential Enhancements

To extend the operating model, consider:

Add SDK adapters that emit prepared and terminal manifests automatically
Add schema conformance tests to CI for every producing service
Store prompt, policy, and tool artifacts in immutable registries
Add signed manifests for cross-organization evidence exchange
Add retrieval chunker, embedding model, and reranker versions
Add evaluation-run lineage linked back to production manifests
Add automated reconstruction-level downgrade when sources expire
Add privacy-budget and tenant-deletion verification
Add an OpenTelemetry adapter compatibility matrix by convention version

Final Notes

A prompt is only one dependency in a production AI request. Context, retrieval, memory, tools, model configuration, deployment identity, and policy state all contribute to the final behavior.

When those dependencies are versioned, ordered, protected, correlated, and retained under explicit policy, incident review becomes evidence-based. When they are not, teams are left guessing whether the prompt, context, model, or surrounding system actually changed.

Explore the contract repository at the GitHub repository.

See you in the next issue.

Stay curious.

Share this article with your network.

LinkedIn X Facebook

Join the Newsletter

Subscribe for AI engineering insights, system design strategies, and workflow tips.

Your information is safe. Unsubscribe anytime.