Skip to content

multiprocessing ResourceTracker can deadlock at shutdown when os.fork was used #146313

@gpshead

Description

@gpshead

Bug report

Bug description:

A regression caused by #88887

The best context for this issue comes from two places: my (1) #88887 (comment) report and independent confirmation from @itamaro in (2) #88887 (comment)


(1) """a regression in processes using fork() where a reference to the resource_tracker's pipe remains alive in another process. https://github.com/gpshead/cpython/blob/00d16dca6e911fb69c055aa874a2d25cb5e5fe6a/Lib/test/_test_multiprocessing.py#L6293-L6306 has an example of a regression test that demonstrates it.

Basically, at process shutdown the new __del__ finalizer is called and can hang in waitpid on a child process that is not exiting.

We could sever that relationship so the fd isn't inherited and the shared resource_tracker used by multiple sub-child processes when the "fork" start_method is used is no longer a feature - that'd undo #80849 's #5172 which added that as a feature (cc: @pitrou & @tomMoral) - but also "fork" as a start_method is rather frowned upon these days - people are better off avoiding it. But the default only just changed away in 3.14 so a lot of people still are - I encountered this in 3.13.9 & 3.13.11.
I would not undo a feature in a bugfix regardless.

One "easy" workaround for now is probably for anyone actually hitting this is possibly to restore previous behavior and re-gain this issue - which it feels like it was uncommon:

if hasattr(multiprocessing.resource_tracker.ResourceTracker, "__del__"):
    del multiprocessing.resource_tracker.ResourceTracker.__del__

A fix forward could basically be to undo #5172's feature."""


(2) """hey @gpshead, I believe I ran into this at least twice now, while migrating Meta to 3.12.

Trying to create a minimal reproducer, here's what I got:

import os
import sys
import time
from multiprocessing.resource_tracker import ensure_running

# Step 1: Start the resource tracker (creates the pipe with fds r, w).
ensure_running()
print("Resource tracker started.", flush=True)

# Step 2: Fork. The child inherits the write-end fd of the tracker pipe.
pid = os.fork()

if pid == 0:
    # Child: stay alive so the inherited write-end fd remains open,
    # preventing the tracker from seeing EOF.
    print(f"[child {os.getpid()}] sleeping (holds write fd open)...", flush=True)
    time.sleep(100.0)
    print(f"[child {os.getpid()}] exiting...", flush=True)
    sys.exit(0)
else:
    # Parent: exit normally. During shutdown, ResourceTracker.__del__
    # closes the write fd and calls waitpid() on the tracker process.
    # The tracker never exits because the child still has the fd open.
    print(f"[parent {os.getpid()}] exiting normally (child={pid})...", flush=True)

and here's what I ended up doing in our global sitecustomize.py to workaround it:
https://github.com/facebook/buck2/blob/271de04a2a00041cee2e9e18d896fcd24f241598/prelude/python/tools/make_par/sitecustomize.py#L203-L246
(briefly: register at fork callback that resets the resource tracker inherited from the parent (if it was started) after in child)"""


My first draft of a regression test trying to reproduce it and a fix was in main...gpshead:cpython:claude/fix-resource-tracker-hang-XZw5P from January.

I'll turn something here into a real fix.

CPython versions tested on:

3.12, 3.13

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

Labels

stdlibStandard Library Python modules in the Lib/ directorytopic-multiprocessingtype-bugAn unexpected behavior, bug, or error

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions