SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

jizhuozhi · 2025-12-20T20:04:18Z

Status: Draft
Type: Informational
Created: 2025-12-21
Author(s): Zhuozhi Ji jizhuozhi.george@gmail.com (@jizhuozhi)
Sponsor: None
PR: SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

Abstract

This SEP proposes optional high availability (HA) best practices for MCP deployments with stateful streaming sessions (e.g., SSE). While the MCP protocol itself remains unchanged, production deployments often face challenges in maintaining session continuity and resilience when using multiple replicas behind load balancers. This proposal outlines optional patterns, including pub-sub event buses, cluster coordination with P2P forwarding, middleware/SDK abstraction, and session partitioning. These patterns provide guidance for implementers to achieve HA without breaking protocol compatibility or requiring client modifications.

Motivation

Production MCP deployments increasingly target multi-node, horizontally scalable environments. Long-lived streaming sessions (SSE) introduce challenges when routed through stateless HTTP ingress or load balancers:

Session continuity may break if connections are routed to a different replica.
Node failure or restart can interrupt ongoing streaming sessions.
Resuming sessions across replicas is non-trivial without coordination.

Community discussions, including GitHub PR #325, have highlighted these issues. Contributors concluded that session stickiness or shared session stores are practical implementation considerations, but not mandated by the protocol. This creates an opportunity for informational guidance on HA patterns that are optional and non-intrusive.

Specification

This SEP does not introduce protocol-level changes. The following optional HA patterns are proposed for implementers:

1. Core HA Patterns

1.1 Event Bus / Pub-Sub

Externalize session events to a distributed pub-sub system.
MCP replicas subscribe to session events to enable failover and session recovery.
Decouples session lifetime from any single node.

1.2 Cluster Coordination & P2P Forwarding

MCP nodes maintain lightweight cluster state via gossip, shared stores, or JDBC ping.
Session messages can be forwarded to the node currently handling the session.
Avoids heavy consensus mechanisms to preserve throughput.

2. Implementation & Optimization Support

2.1 Middleware / SDK Abstraction

Encapsulates HA logic (pub-sub, P2P forwarding) within SDK or middleware.
Keeps protocol handlers and business logic unchanged.
Provides a transparent API to clients, allowing gradual adoption.

2.2 Session Partitioning / Affinity Hints

Session IDs may encode partitioning or affinity hints.
Reduces coordination overhead.
Affinity is advisory and must not impact correctness.

3. Illustrative Middleware-Oriented Model (Python, Non-Normative)

async def handle_mcp_message(message, send):
    if message["type"] == "tool_call":
        result = await run_tool(message["payload"])
        await send({
            "type": "tool_result",
            "payload": result
        })

class MCPHAMiddleware:
    def __init__(self, ha_backend):
        self.ha = ha_backend

    def wrap(self, handler):
        async def wrapped(message, send):
            session_id = self.ha.ensure_session(message)

            async with self.ha.bind_session(session_id, send) as ha_send:
                await handler(message, ha_send)

        return wrapped

Rationale

Alternate designs considered: Sticky sessions at load balancer, full Raft replication, central shared state.
Why chosen approach: Optional patterns allow HA without protocol changes, preserve throughput, and provide flexibility.
Related work: Community PR Add best practices when using load balancer #325; common HA patterns in distributed systems.
Community consensus: PR discussion supports optional, non-normative guidance for HA.

Backward Compatibility

No protocol changes are introduced. Existing clients and servers remain fully compatible. Adoption of HA patterns is optional and implementation-defined.

Security Implications

No new security surfaces are introduced by this SEP. Implementers should consider standard security practices for distributed coordination, pub-sub, and session forwarding.

Reference Implementation

Prototype Python middleware shown above.
No full reference implementation is required to mark SEP as draft.

Additional Optional Sections

Performance Implications

Optional HA patterns may introduce additional latency or coordination overhead, but throughput is preserved by avoiding heavy consensus.

Testing Plan

Implementers should validate session continuity during failover, replica restart, and load balancer routing.

Alternatives Considered

Sticky sessions at LB (less flexible, not always feasible)
Full Raft replication (high latency, throughput penalty)
Central shared store (adds infrastructure complexity)

Open Questions

Best practices for large clusters with thousands of concurrent streaming sessions.
Integration guidance for Streamable HTTP once adoption increases.

Acknowledgments

Community contributors to PR Add best practices when using load balancer #325 for highlighting HA challenges in production MCP deployments.

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

jizhuozhi · 2025-12-22T05:52:33Z

One additional point worth clarifying is where HA responsibility should live.

While PR #206 acknowledges stateful servers and session IDs, it implicitly leaves routing and session continuity to external components (e.g. sticky routing at a proxy or message bus–based routing). In practice, relying on load balancers or proxies to provide correctness guarantees for stateful streaming sessions introduces operational uncertainty and deployment-specific behavior.

This SEP intentionally frames HA as something that can be implemented and controlled by the MCP server itself, rather than being dependent on proxy-level stickiness or opaque middleware behavior. By doing so, MCP servers can provide predictable session affinity, failover handling, and recovery semantics that are consistent across environments, independent of ingress or proxy configuration.

In that sense, the proposal does not replace existing deployment options, but highlights that server-managed HA stickiness is both feasible and preferable for stateful streaming use cases, especially in production-grade, multi-replica deployments.

…in MCP Deployments modelcontextprotocol#2001

cliffhall · 2026-01-05T20:55:49Z

HI @jizhuozhi!

What does this SEP seek to modify? SEPs are intended to enhance the specification. If this is best practices, perhaps it's best implemented as either a blog post, doc change, or tutorial, e.g., a PR here: https://github.com/modelcontextprotocol/modelcontextprotocol/tree/main/docs/tutorials

randybias · 2026-01-07T00:21:43Z

FWIW, I've done a little experimentation of just sending SSE over a different transport (NATS) and I think the event Pub/Sub approach makes a ton of sense and will be very resilient. There are some quirks to work through, but it feels clean and with precedent, to keep the control plane via HTTP/S, but provide an alternative signaling/data plane transport for events. The existing mechanism really does have problems when connectivity is intermittent as a new session ID is spawned every time. A session ID should really be more like a conversation identifier.

jizhuozhi · 2026-01-07T03:27:54Z

What does this SEP seek to modify? SEPs are intended to enhance the specification. If this is best practices, perhaps it's best implemented as either a blog post, doc change, or tutorial, e.g., a PR here: https://github.com/modelcontextprotocol/modelcontextprotocol/tree/main/docs/tutorials

Hello @cliffhall, thank you for pointing out this issue. I will carefully consider whether it should be treated as a SEP or a tutorial.

jizhuozhi · 2026-01-07T03:29:57Z

The existing mechanism really does have problems when connectivity is intermittent as a new session ID is spawned every time. A session ID should really be more like a conversation identifier.

Hello @randybias, thank you for pointing this out. I hadn't considered the scenario of a disconnected connection before. Perhaps I need to consider a solution that can take over the previous session ID, similar to how WebSocket availability works.

jizhuozhi · 2026-01-07T04:00:36Z

In my current thinking, we might need a connection reconnection protocol that allows the client to re-establish the SSE transport and resume the previous session when the connection is interrupted.

randybias · 2026-01-07T19:30:12Z

In the event pub/sub case, the session ID can just be an identifier for a topic/queue, can't it? Reconnect just means talking to the same topic by either end.

jizhuozhi · 2026-01-08T11:39:18Z

In the event pub/sub case, the session ID can just be an identifier for a topic/queue, can't it? Reconnect just means talking to the same topic by either end.

However, because disconnection does not mean session cancellation:

Disconnection MAY occur at any time (e.g., due to network conditions). Therefore:

Disconnection SHOULD NOT be interpreted as the client canceling its request.

To cancel, the client SHOULD explicitly send an MCP CancelledNotification.

To avoid message loss due to disconnection, the server MAY make the stream resumable.

In a multi-instance high-availability scenario, after reconnection, multiple instances will simultaneously hold this session, which can lead to out-of-order message processing or message loss.

For a simple example, consider Redis pub/sub. If there's no mechanism to manage the sessions, the original instance holding the session might compete with the new instance for messages. In other words, a barrier needs to be inserted after reconnection to ensure the happens-before relationship between different instances processing the same session.

randybias · 2026-01-12T19:51:39Z

Ah... then maybe a key part of the spec needs to change, namely:

The MCP spec says:

"The server SHALL tie each subscription to the originating MCP session."

This assumes that event subscriptions, my use case, are tied to sessions. Perhaps they shouldn't be?

I hear what you are saying regarding the out-of-order situation. My use case doesn't require durability or even ordered messages. But that makes sense.

FWIW, I just did a bit more of a full prototype to push SSE over NATS. Just the SSE events. Everything else as normal in the MCP protocol, and it almost works, except for some violations of the protocol (as above).

It feels to me like SSE should be out of band. It just makes more sense, rather than holding open an HTTP Streaming socket for a long period of time. It introduces new problems, for sure, but it's more conducive to "events" in the typical way we understand event driven architecture.

jizhuozhi changed the title ~~SEP-0000: Optional High Availability Patterns for Stateful Streaming in MCP Deployments~~ SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments Dec 20, 2025

jizhuozhi requested a review from a team as a code owner December 22, 2025 06:03

jizhuozhi requested a review from a team December 22, 2025 06:03

SEP-2001: Optional High Availability Patterns for Stateful Streaming …

2deeae0

…in MCP Deployments modelcontextprotocol#2001

jizhuozhi force-pushed the main branch from 543d7e6 to 2deeae0 Compare December 22, 2025 06:05

jonathanhefner mentioned this pull request Dec 27, 2025

Proposal: Optional High Availability Best Practices for MCP Deployments with Stateful Streaming (SSE) Connections #2000

Closed

Merge branch 'main' into main

e2ebf00

Merge branch 'main' into main

e892b65

dsp-ant added SEP proposal SEP proposal without a sponsor. labels Jan 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

jizhuozhi commented Dec 20, 2025 •

edited

Loading

Uh oh!

jizhuozhi commented Dec 22, 2025

Uh oh!

cliffhall commented Jan 5, 2026

Uh oh!

randybias commented Jan 7, 2026

Uh oh!

jizhuozhi commented Jan 7, 2026

Uh oh!

jizhuozhi commented Jan 7, 2026 •

edited

Loading

Uh oh!

jizhuozhi commented Jan 7, 2026

Uh oh!

randybias commented Jan 7, 2026

Uh oh!

jizhuozhi commented Jan 8, 2026

Uh oh!

randybias commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

Are you sure you want to change the base?

SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001

Conversation

jizhuozhi commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Abstract

Motivation

Specification

1. Core HA Patterns

1.1 Event Bus / Pub-Sub

1.2 Cluster Coordination & P2P Forwarding

2. Implementation & Optimization Support

2.1 Middleware / SDK Abstraction

2.2 Session Partitioning / Affinity Hints

3. Illustrative Middleware-Oriented Model (Python, Non-Normative)

Rationale

Backward Compatibility

Security Implications

Reference Implementation

Additional Optional Sections

Performance Implications

Testing Plan

Alternatives Considered

Open Questions

Acknowledgments

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

jizhuozhi commented Dec 22, 2025

Uh oh!

cliffhall commented Jan 5, 2026

Uh oh!

randybias commented Jan 7, 2026

Uh oh!

jizhuozhi commented Jan 7, 2026

Uh oh!

jizhuozhi commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jizhuozhi commented Jan 7, 2026

Uh oh!

randybias commented Jan 7, 2026

Uh oh!

jizhuozhi commented Jan 8, 2026

Uh oh!

randybias commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jizhuozhi commented Dec 20, 2025 •

edited

Loading

jizhuozhi commented Jan 7, 2026 •

edited

Loading