Great to know that this can help resolve SPARK-53759. I think it is a good
idea to backport the minimal fix. If you can prepare backport PRs (and keep
it simple) that would be easier to get reviews.

I wonder if Hyukjin or Dongjoon have any concern about this.

Best regards,
Yicong Huang <https://yicong-huang.github.io>
[email protected]
On Apr 4, 2026 at 7:23 PM -0700, Antonio Blanco <[email protected]>,
wrote:


Hi all,

I'd like to propose backporting the fix for SPARK-53759 to the active
release
branches (4.1, 4.0, and 3.5).

SPARK-53759 is a critical bug where PySpark crashes deterministically on
Windows with Python 3.12+. Windows always uses the simple-worker codepath
(because os.fork() is unavailable), and the worker's socket connection was
missing an explicit flush() before close(). On Python 3.12+, changed GC
finalization ordering [1] causes the underlying socket to close before the
write buffer is flushed, silently losing task results. The JVM sees
EOFException.

This was incidentally fixed on master by PR #54458 (SPARK-55665), which
unified worker socket handling across 14 files. I confirmed the fix is
present
in pyspark==4.2.0.dev3 on PyPI but not in any stable release — all versions
through 4.1.1 are affected.

Since PR #54458 is a large refactor (14 files), a clean cherry-pick to
release
branches may not be straightforward. However, the actual fix for
SPARK-53759 is
small — just adding flush() before close() in the worker's finally block,
mirroring what daemon.py already does. I've prepared minimal backport
branches
for review:

- branch-4.1:
https://github.com/anblanco/spark/tree/fix/SPARK-53759-simple-worker-flush
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fanblanco%2Fspark%2Ftree%2Ffix%2FSPARK-53759-simple-worker-flush&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930542952%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=yiZL5hcdxqQJaT7fTcEPOjO4TEs3uKTdi1O01QUjmKk%3D&reserved=0>
- (Can prepare branch-4.0 and branch-3.5 variants if there's interest)

I put together a reproducer with a test matrix and full root cause analysis
here: https://github.com/anblanco/spark53759-reproducer
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fanblanco%2Fspark53759-reproducer&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930565428%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=Ilqh68g3HgULVVaDOrWkAJHQ4eCSIKZICUmYi0NvnCw%3D&reserved=0>

The bug has been open since September 2025 and affects all Windows users on
Python 3.12+, which is now the default Python on most systems. I think the
impact warrants backporting, especially given how small the fix is.

Note that branch-3.5 LTS ends April 12 — if a backport is appropriate there,
it would need to happen soon.

Happy to prepare the backport PRs if maintainers agree this is worth doing.

Thanks you for your time,
Antonio Blanco

[1] https://github.com/python/cpython/issues/97922
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpython%2Fcpython%2Fissues%2F97922&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930583717%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=3%2FBQm1AahBcmJ1V2xt923MgCf1TS%2BlBAqYXHVkq1Et0%3D&reserved=0>
[2] https://issues.apache.org/jira/browse/SPARK-53759
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-53759&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930600498%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=72soFF6yfwLrHycS%2FblvrPWn2nB7036pRM7TzWbYGrM%3D&reserved=0>
[3] https://github.com/apache/spark/pull/54458
<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F54458&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930619458%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=0w%2FOW8Md2ozr57DXPvFBVdpuj1c5l3QouzBqEB1DnXA%3D&reserved=0>
--
Antonio
<witty signature />

Reply via email to