-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Description
When running agent eval set with multiple eval cases and MCPToolset, the evaluation fails most of the time with CancelledError and warnings about cancel scope violations. This fails in version 1.22.1 also.
This happens from the UI, adk eval or pytest. The root cause is a race condition where multiple parallel Runner instances share the same MCPToolset (via a shared root_agent), and each Runner independently calls close() on exit.
As mentioned this fails randomly, since depending on your agent/tools used and eval set.
I think this also relates to #3161
Bug Behavior
The evaluation fails intermittently with errors like:
asyncio.exceptions.CancelledError: Cancelled via cancel scope <id> by <Task ...>
And warnings:
WARNING - Toolset MCPToolset cleanup cancelled: Cancelled via cancel scope...
Warning: Error during MCP session cleanup for stdio_session: Attempted to exit cancel scope in a different task than it was entered in
Steps to Reproduce
- Create an agent that uses
MCPToolset, you can either use stdio or streamable-http - Create a eval set with at least two eval cases, if you only have one is not going to parellallise anything so won't fail.
- Run evaluation with
adk evalorpytestusingAgentEvaluator.evaluate(), or from the UI selecting both eval cases - The test fails randomly, depending on the test most times will fail.
Inspired on the example shared in PR #3161 in the following comment
And following https://google.github.io/adk-docs/evaluate/#recommendations-on-criteria approach.
Files needed:
test_agent/__init__.py- test_agent/agent.py
- test_agent/eval/test_eval.py
- mcp_server.py
- and your eval_sets
Example agent (agent.py):
from google.adk.agents import Agent
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset, StdioServerParameters
root_agent = Agent(
name="test_agent",
model="gemini-2.5-flash",
instruction="You are a calculator assistant.",
tools=[
MCPToolset(
connection_params=StdioServerParameters(
command="python",
args=["mcp_server.py"],
),
)
],
)from fastmcp import FastMCP
# Initialize the MCP server
mcp = FastMCP("Addition Server")
# Define a tool to add two numbers
@mcp.tool()
def add_two_numbers(a: int, b: int) -> int:
"""
Adds two numbers together.
"""
return a + b
# Run the MCP server
if __name__ == "__main__":
mcp.run()You can create a evalset and two eval cases using the adk web UI.
Then, either run both eval cases in the UI or create the followin test:
Example test (test_agent/eval/test_eval.py):
@pytest.mark.asyncio
async def test_basic_mcp_connection():
await AgentEvaluator.evaluate(
"test_agent",
"test_agent/eval/eval_data/basic_test.evalset.json",
num_runs=1,
)Root Cause Analysis
The issue is in the evaluation flow:
-
LocalEvalService.perform_inference()(local_eval_service.py#L175-L189) runs multiple inferences in parallel (defaultparallelism=4) using the sameroot_agent -
EvaluationGenerator._generate_inferences_from_root_agent()(evaluation_generator.py#L236-L243) creates a new Runner for each parallel inference, but passes the sharedroot_agent -
Each Runner's
close()(runners.py#L1489-L1493) callstoolset.close()on all toolsets from the shared agent -
Race condition: When Runner 1 finishes first, it closes the shared
MCPToolset. Runners 2, 3, 4 are still running and either:- Try to use the now-closed MCP connection
- Also try to close the already-closed toolset
- Violate anyio's
CancelScopetask context rules
Proposed Solution
I'm not very familiar with the code base yet, but I think my approach would be to update the LocalEvalService and MCPToolset, make McpToolset.close() idempotent and manage lifecycle explicitly in LocalEvalService since that would have minimum impact.
I know the project is very active and not sure if there is any related progress on this issue, but happy to open a PR to fix it if that's helpful.
Environment
- Python: 3.13.0
- OS: macOS (Darwin)
- ADK version: 1.22.1
- Related packages: anyio, fastmcp
Additional Context
- For few tools and eval cases, the bug sometimes tests pass if the parallel tasks happen to complete in a "safe" order
- Setting
parallelism=1in evaluation config works around the issue but defeats the purpose of parallel evaluation - The issue affects both
adk evalCLI andAgentEvaluator.evaluate()in pytest