Issue #30: MCP Tool Contract Gates for AI Systems

9 min read | May 30, 2026

A lot of agent systems still talk about MCP as if adding a new server were just another integration step. Point the agent at a different server, let it discover one more tool, and assume the role surface is effectively unchanged.

That is the wrong shape for production AI engineering. A live MCP server publishes executable capability. If a candidate profile adds a destructive tool or changes the required arguments for an existing tool, that is a contract change, not just a wiring detail.

In this issue, we build a local C# gate that launches a real MCP server over STDIO, discovers both a baseline and a candidate profile through the MCP client, compares their live tool contracts, probes the required tools through real MCP calls, and deterministically decides whether the candidate should be promoted or rolled back.

What You Are Building

You are building a production-shaped MCP promotion workflow that keeps the live tool contract explicit:

Load runtime configuration from appsettings.json and MCPGATE_ environment overrides
Launch a real MCP server profile over STDIO using dotnet <server.dll> baseline or candidate
Discover the live MCP tools with ListToolsAsync()
Extract actual tool names, input schemas, and MCP annotations from the protocol response
Diff baseline and candidate profiles for added tools, removed tools, schema breakage, read-only hint regressions, and newly added destructive tools
Replay frozen role cases by calling the live MCP tools with structured arguments
Fail when forbidden tools leak into the role surface or when a required MCP schema no longer matches
Apply deterministic promotion rules to decide Promote, Hold, or Rollback
Persist the full report as JSON for later review

This is the control layer that becomes necessary once MCP servers can evolve independently of the agent prompt and application code.

System Structure

The architecture is intentionally small. One MCP server host exposes either a baseline or candidate tool profile. The gate client launches that server twice over STDIO, performs live MCP discovery against both profiles, compares the discovered tool contracts, replays frozen role probes through real MCP calls, applies the promotion gate, and saves a report.

The diagram below shows the high-level control flow:

Runtime Configuration First

The app starts by loading the gate profile before any MCP session is opened:

{
  "Experiment": {
    "ServerCommand": "dotnet",
    "BaselineProfile": "baseline",
    "CandidateProfile": "candidate",
    "DatasetPath": "data/mcp_tool_gate_eval.json",
    "ReportDirectory": "data/reports"
  },
  "Promotion": {
    "BlockOnRemovedTools": true,
    "BlockOnSchemaBreakage": true,
    "BlockOnReadOnlyHintRegression": true,
    "BlockOnNewDestructiveTools": true,
    "MaxNewDestructiveTools": 0,
    "MaxBrokenCases": 0
  }
}

This matters because the promotion boundary is operational. Which server command to launch, which profile is baseline, which profile is candidate, which dataset defines the role contract, and which changes trigger rollback are visible system controls rather than hidden release assumptions.

The Server Is Real MCP

This project does not fake tool discovery. The server is a real MCP host using STDIO transport:

var toolTypes = profile switch
{
  "baseline" => new[] { typeof(BaselineSupportOpsTools) },
  "candidate" => new[] { typeof(CandidateSupportOpsTools) },
  _ => throw new InvalidOperationException($"Unknown MCP server profile '{profile}'.")
};

builder.Services
  .AddMcpServer()
  .WithStdioServerTransport()
  .WithTools(toolTypes);

That matters because the gate is reading live MCP protocol output, not a hand-maintained mirror of what the server was supposed to expose.

Live Tool Discovery Defines the Contract

The client launches the server and discovers the tool surface through MCP:

var transport = new StdioClientTransport(
  new StdioClientTransportOptions
  {
      Name = $"Support Ops MCP Server ({profile})",
      Command = serverCommand,
      Arguments = [serverDllPath, profile]
  },
  loggerFactory);

var client = await McpClientFactory.CreateAsync(
  transport,
  new McpClientOptions
  {
      ClientInfo = new() { Name = "McpToolContractGates", Version = "1.0.0" }
  },
  loggerFactory);

var tools = await client.ListToolsAsync();

From each discovered tool, the gate extracts:

the live MCP tool name
the live JSON input schema
the required arguments derived from that schema
the MCP annotations such as ReadOnlyHint and DestructiveHint

That means the contract comes from the server the agent would actually talk to, not from a separate documentation layer.

Candidate Drift Is Visible in the Live Surface

The candidate profile intentionally introduces two changes that should block promotion:

[McpServerTool(Name = "Incident.Declare", Destructive = true)]
public static string DeclareIncident(string serviceName, string summary) =>
  $"Incident declared for {serviceName}: {summary}.";

[McpServerTool(Name = "Deployments.Rollback", Destructive = true)]
public static string RollbackDeployment(string serviceName, string environment, string releaseId) =>
  $"Rollback started for {serviceName} in {environment} release {releaseId}.";

The first change breaks the baseline schema by dropping the required severity argument. The second change exposes a destructive deployment rollback tool to a support role that should not have it at all.

Diffing Live MCP Schemas

The comparison layer works directly against the discovered tool contracts:

var missingRequiredArguments = baselineTool.RequiredArguments
  .Except(candidateTool.RequiredArguments, StringComparer.Ordinal)
  .ToArray();

if (missingRequiredArguments.Length > 0)
{
  schemaBreakages.Add(new ToolContractChange(
      baselineTool.Name,
      string.Join(", ", baselineTool.RequiredArguments),
      string.Join(", ", candidateTool.RequiredArguments)));
}

if (baselineTool.ReadOnlyHint && !candidateTool.ReadOnlyHint)
{
  readOnlyHintRegressions.Add(new ToolContractChange(
      baselineTool.Name,
      "ReadOnlyHint=true",
      "ReadOnlyHint=false"));
}

This is where the system stops treating MCP as a vague interoperability label and starts treating it as a real contract boundary. The tool name, schema, and annotations are all part of what the client is allowed to rely on.

Frozen Role Cases Probe the Live Tools

The frozen dataset does not just name tools. It also includes the structured arguments used to probe those tools through real MCP calls.

{
  "id": "TG-003",
  "userTask": "Declare a P1 checkout incident with explicit severity.",
  "requiredTools": [
    {
      "toolName": "Incident.Declare",
      "requiredArguments": [
        "serviceName",
        "severity",
        "summary"
      ],
      "requireReadOnlyHint": false,
      "requireDestructiveHint": true,
      "probeArguments": {
        "serviceName": "checkout-api",
        "severity": "P1",
        "summary": "error rate is rising"
      },
      "expectedOutputContains": "severity P1"
    }
  ],
  "forbiddenTools": [
    "Deployments.Rollback"
  ]
}

The probe layer then calls the real MCP tool:

var response = await tool.CallAsync(
  arguments.ToDictionary(pair => pair.Key, pair => (object?)pair.Value),
  cancellationToken: cancellationToken);

var output = string.Join("
", response.Content.Select(content => content.Text));

That is the key difference from a static diff. The gate is not only inspecting what the tool surface claims to be. It is also exercising the live tools that the agent would actually call.

Promotion Gate Is Explicit

The promotion policy stays small and inspectable:

if (config.BlockOnSchemaBreakage && diff.SchemaBreakages.Count > 0)
{
  reasons.Add("Candidate broke required MCP input schemas on baseline tools.");
}

if (config.BlockOnNewDestructiveTools && diff.AddedDestructiveTools.Count > config.MaxNewDestructiveTools)
{
  reasons.Add("Candidate introduced new destructive MCP tools into the role surface.");
}

if (candidateSummary.BrokenCaseCount > config.MaxBrokenCases)
{
  reasons.Add($"Candidate failed {candidateSummary.BrokenCaseCount} frozen role cases.");
}

if (reasons.Count > 0)
{
  return new GateRecommendation(GateDecision.Rollback, reasons);
}

The gate is boring on purpose. That is a strength. You can test it, inspect it, and explain exactly why a candidate MCP profile was blocked.

Walking a Real Live Run

A deterministic local run at 2026-05-29 23:16 UTC produced the following output:

MCP Tool Contract Gates
Baseline profile: baseline
Candidate profile: candidate
Transport: stdio
Dataset: 3 frozen role cases

baseline
- Discovered tools: 5
- Pass rate: 100%
- Cases passed: 3/3

candidate
- Discovered tools: 6
- Pass rate: 0%
- Cases passed: 0/3

Live MCP diff
- Added tools: 1
- Removed tools: 0
- Schema breakages: 1
- Read-only hint regressions: 0
- Added destructive tools: 1

candidate sample failures:
- TG-001: Forbidden tool Deployments.Rollback is exposed to the role surface.
- TG-002: Forbidden tool Deployments.Rollback is exposed to the role surface.
- TG-003: Tool Incident.Declare is missing required arguments: severity.; Forbidden tool Deployments.Rollback is exposed to the role surface.

Decision: Rollback
- Candidate broke required MCP input schemas on baseline tools.
- Candidate introduced new destructive MCP tools into the role surface.
- Candidate failed 3 frozen role cases.
Report: McpToolContractGates\bin\Debug\net10.0\data\reports\20260529T231611Z-tool-gate-baseline-to-candidate.json

How to interpret this:

baseline passed every live role probe, so the current MCP surface still fits the intended support role
candidate discovered one extra tool, which already changed the live role surface before any model reasoning happened
The diff layer found one real schema break and one newly added destructive tool from the actual MCP discovery results
The probe layer confirmed the operational consequence: the candidate profile both leaks a forbidden rollback tool and no longer satisfies the baseline incident schema
The rollback happened for deterministic contract reasons, not because the candidate merely looked riskier

Why This Architecture Works

The gate works because the MCP server and the agent reasoning layer are treated as different responsibilities:

The MCP server publishes the tool surface
The discovery layer records what that surface actually is right now
The diff layer catches structural drift in tool names, schemas, and hints
The probe layer exercises the tools the agent would actually call
The promotion gate converts those findings into a small explicit decision
The saved report keeps the whole decision inspectable after the run ends

That is the real boundary here. The probabilistic layer may choose tools inside the allowed surface. The deterministic layer owns whether that live MCP surface is promotable in the first place.

Potential Enhancements

To extend this project further, you can consider:

Add streamable HTTP MCP transport in addition to STDIO so the same gate can evaluate remote server deployments
Split frozen role probes by agent role such as support, finance, and incident commander
Persist longitudinal MCP diff history so you can detect tool-surface drift over time
Add policy exceptions for intentionally approved tool additions while keeping the rest of the surface gated
Extend the sample to compare multi-server MCP bundles instead of one profile at a time

Final Notes

MCP makes it easier to compose agents with tool servers, but it also makes tool drift a first-class operational risk.

If the tool surface is part of the system contract, then the server has to be tested and gated as a server, not just described in docs or assumed from configuration.

Explore the source code at the GitHub repository.

See you in the next issue.

Stay curious.

Share this article with your network.

LinkedIn X Facebook

Join the Newsletter

Subscribe for AI engineering insights, system design strategies, and workflow tips.

Your information is safe. Unsubscribe anytime.