-
Notifications
You must be signed in to change notification settings - Fork 1.2k
SEP-2001: Optional High Availability Patterns for Stateful Streaming in MCP Deployments #2001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
One additional point worth clarifying is where HA responsibility should live. While PR #206 acknowledges stateful servers and session IDs, it implicitly leaves routing and session continuity to external components (e.g. sticky routing at a proxy or message bus–based routing). In practice, relying on load balancers or proxies to provide correctness guarantees for stateful streaming sessions introduces operational uncertainty and deployment-specific behavior. This SEP intentionally frames HA as something that can be implemented and controlled by the MCP server itself, rather than being dependent on proxy-level stickiness or opaque middleware behavior. By doing so, MCP servers can provide predictable session affinity, failover handling, and recovery semantics that are consistent across environments, independent of ingress or proxy configuration. In that sense, the proposal does not replace existing deployment options, but highlights that server-managed HA stickiness is both feasible and preferable for stateful streaming use cases, especially in production-grade, multi-replica deployments. |
|
HI @jizhuozhi! What does this SEP seek to modify? SEPs are intended to enhance the specification. If this is best practices, perhaps it's best implemented as either a blog post, doc change, or tutorial, e.g., a PR here: https://github.com/modelcontextprotocol/modelcontextprotocol/tree/main/docs/tutorials |
|
FWIW, I've done a little experimentation of just sending SSE over a different transport (NATS) and I think the event Pub/Sub approach makes a ton of sense and will be very resilient. There are some quirks to work through, but it feels clean and with precedent, to keep the control plane via HTTP/S, but provide an alternative signaling/data plane transport for events. The existing mechanism really does have problems when connectivity is intermittent as a new session ID is spawned every time. A session ID should really be more like a conversation identifier. |
Hello @cliffhall, thank you for pointing out this issue. I will carefully consider whether it should be treated as a SEP or a tutorial. |
Hello @randybias, thank you for pointing this out. I hadn't considered the scenario of a disconnected connection before. Perhaps I need to consider a solution that can take over the previous session ID, similar to how WebSocket availability works. |
|
In my current thinking, we might need a connection reconnection protocol that allows the client to re-establish the SSE transport and resume the previous session when the connection is interrupted. |
|
In the event pub/sub case, the session ID can just be an identifier for a topic/queue, can't it? Reconnect just means talking to the same topic by either end. |
However, because disconnection does not mean session cancellation:
In a multi-instance high-availability scenario, after reconnection, multiple instances will simultaneously hold this session, which can lead to out-of-order message processing or message loss. For a simple example, consider Redis pub/sub. If there's no mechanism to manage the sessions, the original instance holding the session might compete with the new instance for messages. In other words, a barrier needs to be inserted after reconnection to ensure the happens-before relationship between different instances processing the same session. |
|
Ah... then maybe a key part of the spec needs to change, namely: The MCP spec says: "The server SHALL tie each subscription to the originating MCP session." This assumes that event subscriptions, my use case, are tied to sessions. Perhaps they shouldn't be? I hear what you are saying regarding the out-of-order situation. My use case doesn't require durability or even ordered messages. But that makes sense. FWIW, I just did a bit more of a full prototype to push SSE over NATS. Just the SSE events. Everything else as normal in the MCP protocol, and it almost works, except for some violations of the protocol (as above). It feels to me like SSE should be out of band. It just makes more sense, rather than holding open an HTTP Streaming socket for a long period of time. It introduces new problems, for sure, but it's more conducive to "events" in the typical way we understand event driven architecture. |
Abstract
This SEP proposes optional high availability (HA) best practices for MCP deployments with stateful streaming sessions (e.g., SSE). While the MCP protocol itself remains unchanged, production deployments often face challenges in maintaining session continuity and resilience when using multiple replicas behind load balancers. This proposal outlines optional patterns, including pub-sub event buses, cluster coordination with P2P forwarding, middleware/SDK abstraction, and session partitioning. These patterns provide guidance for implementers to achieve HA without breaking protocol compatibility or requiring client modifications.
Motivation
Production MCP deployments increasingly target multi-node, horizontally scalable environments. Long-lived streaming sessions (SSE) introduce challenges when routed through stateless HTTP ingress or load balancers:
Community discussions, including GitHub PR #325, have highlighted these issues. Contributors concluded that session stickiness or shared session stores are practical implementation considerations, but not mandated by the protocol. This creates an opportunity for informational guidance on HA patterns that are optional and non-intrusive.
Specification
This SEP does not introduce protocol-level changes. The following optional HA patterns are proposed for implementers:
1. Core HA Patterns
1.1 Event Bus / Pub-Sub
1.2 Cluster Coordination & P2P Forwarding
2. Implementation & Optimization Support
2.1 Middleware / SDK Abstraction
2.2 Session Partitioning / Affinity Hints
3. Illustrative Middleware-Oriented Model (Python, Non-Normative)
Rationale
Backward Compatibility
No protocol changes are introduced. Existing clients and servers remain fully compatible. Adoption of HA patterns is optional and implementation-defined.
Security Implications
No new security surfaces are introduced by this SEP. Implementers should consider standard security practices for distributed coordination, pub-sub, and session forwarding.
Reference Implementation
Additional Optional Sections
Performance Implications
Testing Plan
Alternatives Considered
Open Questions
Acknowledgments
Motivation and Context
How Has This Been Tested?
Breaking Changes
Types of changes
Checklist
Additional context