MCP Toolset cleanup race condition when running parallel evaluations

### Description

When running agent eval set with multiple eval cases and `MCPToolset`, the evaluation fails most of the time with `CancelledError` and warnings about cancel scope violations. This fails in version 1.22.1 also.

This happens from the UI, `adk eval` or pytest. The root cause is a race condition where multiple parallel `Runner` instances share the same `MCPToolset` (via a shared `root_agent`), and each Runner independently calls `close()` on exit.

As mentioned this fails randomly, since depending on your agent/tools used and eval set.

I think this also relates to #3161 

### Bug Behavior

The evaluation fails intermittently with errors like:

```
asyncio.exceptions.CancelledError: Cancelled via cancel scope <id> by <Task ...>
```

And warnings:

```
WARNING - Toolset MCPToolset cleanup cancelled: Cancelled via cancel scope...
Warning: Error during MCP session cleanup for stdio_session: Attempted to exit cancel scope in a different task than it was entered in
```

### Steps to Reproduce

1. Create an agent that uses `MCPToolset`, you can either use stdio or streamable-http
2. Create a eval set with at least two eval cases, if you only have one is not going to parellallise anything so won't fail.
2. Run evaluation with `adk eval` or `pytest` using `AgentEvaluator.evaluate()`, or from the UI selecting both eval cases
3. The test fails randomly, depending on the test most times will fail.

Inspired on the example shared in PR #3161 in the following [comment](https://github.com/google/adk-python/issues/3161#issuecomment-3636386369)

And following https://google.github.io/adk-docs/evaluate/#recommendations-on-criteria approach.

Files needed:

- `test_agent/__init__.py`
- test_agent/agent.py
- test_agent/eval/test_eval.py
- mcp_server.py
- and your eval_sets

**Example agent (`agent.py`):**

```python
from google.adk.agents import Agent
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset, StdioServerParameters

root_agent = Agent(
    name="test_agent",
    model="gemini-2.5-flash",
    instruction="You are a calculator assistant.",
    tools=[
        MCPToolset(
            connection_params=StdioServerParameters(
                command="python",
                args=["mcp_server.py"],
            ),
        )
    ],
)
```

```mcp_server.py
from fastmcp import FastMCP

# Initialize the MCP server
mcp = FastMCP("Addition Server")


# Define a tool to add two numbers
@mcp.tool()
def add_two_numbers(a: int, b: int) -> int:
    """
    Adds two numbers together.
    """
    return a + b


# Run the MCP server
if __name__ == "__main__":
    mcp.run()
```

You can create a evalset and two eval cases using the `adk web` UI.

Then, either run both eval cases in the UI or create the followin test:

**Example test (test_agent/eval/test_eval.py):**

```python
@pytest.mark.asyncio
async def test_basic_mcp_connection():
    await AgentEvaluator.evaluate(
        "test_agent",
        "test_agent/eval/eval_data/basic_test.evalset.json",
        num_runs=1,
    )
```


### Root Cause Analysis

The issue is in the evaluation flow:

1. **`LocalEvalService.perform_inference()`** ([local_eval_service.py#L175-L189](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/local_eval_service.py#L175-L189)) runs multiple inferences in parallel (default `parallelism=4`) using the **same** `root_agent`

2. **`EvaluationGenerator._generate_inferences_from_root_agent()`** ([evaluation_generator.py#L236-L243](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/evaluation_generator.py#L236-L243)) creates a **new Runner** for each parallel inference, but passes the **shared** `root_agent`

3. **Each Runner's `close()`** ([runners.py#L1489-L1493](https://github.com/google/adk-python/blob/main/src/google/adk/runners.py#L1489-L1493)) calls `toolset.close()` on all toolsets from the shared agent

4. **Race condition:** When Runner 1 finishes first, it closes the shared `MCPToolset`. Runners 2, 3, 4 are still running and either:
   - Try to use the now-closed MCP connection
   - Also try to close the already-closed toolset
   - Violate anyio's `CancelScope` task context rules


### Proposed Solution

I'm not very familiar with the code base yet, but I think my approach would be to update the `LocalEvalService` and `MCPToolset`, make McpToolset.close() idempotent and manage lifecycle explicitly in LocalEvalService since that would have minimum impact.

I know the project is very active and not sure if there is any related progress on this issue, but happy to open a PR to fix it if that's helpful.

---

### Environment

- **Python:** 3.13.0
- **OS:** macOS (Darwin)
- **ADK version:** 1.22.1
- **Related packages:** anyio, fastmcp

---

### Additional Context

- For few tools and eval cases, the bug sometimes tests pass if the parallel tasks happen to complete in a "safe" order
- Setting `parallelism=1` in evaluation config works around the issue but defeats the purpose of parallel evaluation
- The issue affects both `adk eval` CLI and `AgentEvaluator.evaluate()` in pytest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MCP Toolset cleanup race condition when running parallel evaluations #4155

Description

Bug Behavior

Steps to Reproduce

Root Cause Analysis

Proposed Solution

Environment

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MCP Toolset cleanup race condition when running parallel evaluations #4155

Description

Description

Bug Behavior

Steps to Reproduce

Root Cause Analysis

Proposed Solution

Environment

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions