gh-130895: fix multiprocessing.Process join/wait/poll races #131440

duaneg · 2025-03-19T02:06:48Z

This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error.

In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block.

In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former.

The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise.

If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code.

To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for gh-128041 is also reverted.

Issue: multiprocessing.Process.is_alive() can incorrectly return True after join() #130895

This bug is caused by race conditions in the poll implementations (which are called by join/wait) where if multiple threads try to reap the dead process only one "wins" and gets the exit code, while the others get an error. In the forkserver implementation the losing thread(s) set the code to an error, possibly overwriting the correct code set by the winning thread. This is relatively easy to fix: we can just take a lock before waiting for the process, since at that point we know the call should not block. In the fork and spawn implementations the losers of the race return before the exit code is set, meaning the process may still report itself as alive after join returns. Fixing this is trickier as we have to support a mixture of blocking and non-blocking calls to poll, and we cannot have the latter waiting to take a lock held by the former. The approach taken is to split the blocking and non-blocking call variants. The non-blocking variant does its work with the lock held: since it won't block this should be safe. The blocking variant releases the lock before making the blocking operating system call. It then retakes the lock and either sets the code if it wins or waits for a potentially racing thread to do so otherwise. If a non-blocking call is racing with the unlocked part of a blocking call it may still "lose" the race, and return None instead of the exit code, even though the process is dead. However, as the process could be alive at the time the call is made but die immediately afterwards, this situation should already be handled by correctly written code. To verify the behaviour a test is added which reliably triggers failures for all three implementations. A work-around for this bug in a test added for pythongh-128041 is also reverted.

ghost · 2025-03-19T02:06:51Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2025-03-19T02:06:52Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Lib/multiprocessing/popen_forkserver.py

Lib/multiprocessing/popen_fork.py

returning now, except if we raced with another thread that set it just after our timeout expired.

duaneg · 2025-11-03T00:05:59Z

It might be a good idea to open a PR with just the fix for forkserver under #140867. That issue describes the problem with forkserver better, and the fix for forkserver is much simpler and safer than the fix for fork/spawn. It makes sense to treat them separately.

@zmedico if you want to do that, please go ahead, and feel free to copy the unit test from this PR if it is helpful 🙂

Lib/multiprocessing/popen_fork.py

Lib/multiprocessing/popen_forkserver.py

…o each class can do so as it requires. Co-authored-by: Duprat <yduprat@gmail.com>

YvesDup · 2025-12-09T09:40:41Z

This fix looks good to me.

duaneg requested a review from gpshead as a code owner March 19, 2025 02:06

bedevere-app bot added the awaiting review label Mar 19, 2025

bedevere-app bot mentioned this pull request Mar 19, 2025

multiprocessing.Process.is_alive() can incorrectly return True after join() #130895

Open

Add blurb

ad102f4

zmedico reviewed Nov 2, 2025

View reviewed changes

Lib/multiprocessing/popen_forkserver.py Outdated Show resolved Hide resolved

zmedico reviewed Nov 2, 2025

View reviewed changes

Lib/multiprocessing/popen_fork.py Show resolved Hide resolved

Return status code after timeout: usually this will be None, as we are

87f391f

returning now, except if we raced with another thread that set it just after our timeout expired.

duaneg added 2 commits November 3, 2025 13:07

Merge remote-tracking branch 'origin/main' into waiting/pythongh-130895

6ff7c04

Ignore fork-in-thread deprecation warnings in test, as now required

5b4dbb0

YvesDup reviewed Nov 25, 2025

View reviewed changes

Lib/multiprocessing/popen_fork.py Show resolved Hide resolved

YvesDup reviewed Nov 25, 2025

View reviewed changes

Lib/multiprocessing/popen_fork.py Show resolved Hide resolved

YvesDup reviewed Nov 28, 2025

View reviewed changes

Lib/multiprocessing/popen_forkserver.py Outdated Show resolved Hide resolved

Initialise synchronisation-related attributes in polymorphic method s…

e7f7144

…o each class can do so as it requires. Co-authored-by: Duprat <yduprat@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

Uh oh!

duaneg commented Mar 19, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

ghost commented Mar 19, 2025 •

edited by ghost

Loading

Uh oh!

bedevere-app bot commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

duaneg commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YvesDup commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

Are you sure you want to change the base?

gh-130895: fix multiprocessing.Process join/wait/poll races #131440

Uh oh!

Conversation

duaneg commented Mar 19, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Mar 19, 2025 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-app bot commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

duaneg commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YvesDup commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

duaneg commented Mar 19, 2025 •

edited by bedevere-app bot

Loading

ghost commented Mar 19, 2025 •

edited by ghost

Loading