Skip to content

gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork()#146316

Draft
gpshead wants to merge 1 commit intopython:mainfrom
gpshead:gh-146313-single
Draft

gh-146313: Fix multiprocessing ResourceTracker deadlock after os.fork()#146316
gpshead wants to merge 1 commit intopython:mainfrom
gpshead:gh-146313-single

Conversation

@gpshead
Copy link
Member

@gpshead gpshead commented Mar 23, 2026

Problem

ResourceTracker.__del__ (added in gh-88887) calls os.waitpid(pid, 0) which blocks indefinitely if a process created via os.fork() still holds the tracker pipe's write end. The tracker never sees EOF, never exits, and the parent hangs at interpreter shutdown.

Root cause

Three requirements conflict:

Fix

Two layers:

Timeout safety-net. _stop_locked() gains a wait_timeout parameter. When called from __del__, it polls with WNOHANG using exponential backoff for up to 1 second instead of blocking indefinitely.

At-fork handler. An os.register_at_fork(after_in_child=...) handler closes the inherited pipe fd in the child unless a preserve flag is set. popen_fork.Popen._launch() sets the flag before its fork so mp.Process(fork) children keep the fd and reuse the parent's tracker (preserving gh-80849). Raw os.fork() children close the fd, letting the parent reap promptly.

Result

Scenario Before After
Raw os.fork(), parent exits while child alive deadlock ~30ms reap
mp.Process(fork), parent joins then exits ~30ms reap ~30ms reap
mp.Process(fork), parent exits abnormally deadlock 1s bounded wait
No fork (gh-88887 scenario) ~30ms reap ~30ms reap

The at-fork handler makes the timeout unreachable in all well-behaved paths. The timeout remains as a safety net for abnormal shutdowns.

Problem

ResourceTracker.__del__ (added in pythongh-88887) calls os.waitpid(pid, 0)
which blocks indefinitely if a process created via os.fork() still
holds the tracker pipe's write end. The tracker never sees EOF, never
exits, and the parent hangs at interpreter shutdown.

Root cause

Three requirements conflict:

- pythongh-88887 wants the parent to reap the tracker to prevent zombies
- pythongh-80849 wants mp.Process(fork) children to reuse the parent's
  tracker via the inherited pipe fd
- pythongh-146313 shows the parent can't block in waitpid() if a child
  holds the fd -- the tracker won't see EOF until all copies close

Fix

Two layers:

Timeout safety-net. _stop_locked() gains a wait_timeout parameter.
When called from __del__, it polls with WNOHANG using exponential
backoff for up to 1 second instead of blocking indefinitely.

At-fork handler. An os.register_at_fork(after_in_child=...) handler
closes the inherited pipe fd in the child unless a preserve flag is
set. popen_fork.Popen._launch() sets the flag before its fork so
mp.Process(fork) children keep the fd and reuse the parent's tracker
(preserving pythongh-80849). Raw os.fork() children close the fd, letting
the parent reap promptly.

Result

  Scenario                                       Before     After
  Raw os.fork(), parent exits while child alive  deadlock   ~30ms reap
  mp.Process(fork), parent joins then exits      ~30ms reap ~30ms reap
  mp.Process(fork), parent exits abnormally      deadlock   1s bounded wait
  No fork (pythongh-88887 scenario)                    ~30ms reap ~30ms reap

The at-fork handler makes the timeout unreachable in all well-behaved
paths. The timeout remains as a safety net for abnormal shutdowns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant