Skip to content

MCP Toolset cleanup race condition when running parallel evaluations #4155

@alexhermida

Description

@alexhermida

Description

When running agent eval set with multiple eval cases and MCPToolset, the evaluation fails most of the time with CancelledError and warnings about cancel scope violations. This fails in version 1.22.1 also.

This happens from the UI, adk eval or pytest. The root cause is a race condition where multiple parallel Runner instances share the same MCPToolset (via a shared root_agent), and each Runner independently calls close() on exit.

As mentioned this fails randomly, since depending on your agent/tools used and eval set.

I think this also relates to #3161

Bug Behavior

The evaluation fails intermittently with errors like:

asyncio.exceptions.CancelledError: Cancelled via cancel scope <id> by <Task ...>

And warnings:

WARNING - Toolset MCPToolset cleanup cancelled: Cancelled via cancel scope...
Warning: Error during MCP session cleanup for stdio_session: Attempted to exit cancel scope in a different task than it was entered in

Steps to Reproduce

  1. Create an agent that uses MCPToolset, you can either use stdio or streamable-http
  2. Create a eval set with at least two eval cases, if you only have one is not going to parellallise anything so won't fail.
  3. Run evaluation with adk eval or pytest using AgentEvaluator.evaluate(), or from the UI selecting both eval cases
  4. The test fails randomly, depending on the test most times will fail.

Inspired on the example shared in PR #3161 in the following comment

And following https://google.github.io/adk-docs/evaluate/#recommendations-on-criteria approach.

Files needed:

  • test_agent/__init__.py
  • test_agent/agent.py
  • test_agent/eval/test_eval.py
  • mcp_server.py
  • and your eval_sets

Example agent (agent.py):

from google.adk.agents import Agent
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset, StdioServerParameters

root_agent = Agent(
    name="test_agent",
    model="gemini-2.5-flash",
    instruction="You are a calculator assistant.",
    tools=[
        MCPToolset(
            connection_params=StdioServerParameters(
                command="python",
                args=["mcp_server.py"],
            ),
        )
    ],
)
from fastmcp import FastMCP

# Initialize the MCP server
mcp = FastMCP("Addition Server")


# Define a tool to add two numbers
@mcp.tool()
def add_two_numbers(a: int, b: int) -> int:
    """
    Adds two numbers together.
    """
    return a + b


# Run the MCP server
if __name__ == "__main__":
    mcp.run()

You can create a evalset and two eval cases using the adk web UI.

Then, either run both eval cases in the UI or create the followin test:

Example test (test_agent/eval/test_eval.py):

@pytest.mark.asyncio
async def test_basic_mcp_connection():
    await AgentEvaluator.evaluate(
        "test_agent",
        "test_agent/eval/eval_data/basic_test.evalset.json",
        num_runs=1,
    )

Root Cause Analysis

The issue is in the evaluation flow:

  1. LocalEvalService.perform_inference() (local_eval_service.py#L175-L189) runs multiple inferences in parallel (default parallelism=4) using the same root_agent

  2. EvaluationGenerator._generate_inferences_from_root_agent() (evaluation_generator.py#L236-L243) creates a new Runner for each parallel inference, but passes the shared root_agent

  3. Each Runner's close() (runners.py#L1489-L1493) calls toolset.close() on all toolsets from the shared agent

  4. Race condition: When Runner 1 finishes first, it closes the shared MCPToolset. Runners 2, 3, 4 are still running and either:

    • Try to use the now-closed MCP connection
    • Also try to close the already-closed toolset
    • Violate anyio's CancelScope task context rules

Proposed Solution

I'm not very familiar with the code base yet, but I think my approach would be to update the LocalEvalService and MCPToolset, make McpToolset.close() idempotent and manage lifecycle explicitly in LocalEvalService since that would have minimum impact.

I know the project is very active and not sure if there is any related progress on this issue, but happy to open a PR to fix it if that's helpful.


Environment

  • Python: 3.13.0
  • OS: macOS (Darwin)
  • ADK version: 1.22.1
  • Related packages: anyio, fastmcp

Additional Context

  • For few tools and eval cases, the bug sometimes tests pass if the parallel tasks happen to complete in a "safe" order
  • Setting parallelism=1 in evaluation config works around the issue but defeats the purpose of parallel evaluation
  • The issue affects both adk eval CLI and AgentEvaluator.evaluate() in pytest

Metadata

Metadata

Assignees

No one assigned

    Labels

    mcp[Component] Issues about MCP supporttools[Component] This issue is related to tools

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions