Great to know that this can help resolve SPARK-53759. I think it is a good idea to backport the minimal fix. If you can prepare backport PRs (and keep it simple) that would be easier to get reviews.
I wonder if Hyukjin or Dongjoon have any concern about this. Best regards, Yicong Huang <https://yicong-huang.github.io> [email protected] On Apr 4, 2026 at 7:23 PM -0700, Antonio Blanco <[email protected]>, wrote: Hi all, I'd like to propose backporting the fix for SPARK-53759 to the active release branches (4.1, 4.0, and 3.5). SPARK-53759 is a critical bug where PySpark crashes deterministically on Windows with Python 3.12+. Windows always uses the simple-worker codepath (because os.fork() is unavailable), and the worker's socket connection was missing an explicit flush() before close(). On Python 3.12+, changed GC finalization ordering [1] causes the underlying socket to close before the write buffer is flushed, silently losing task results. The JVM sees EOFException. This was incidentally fixed on master by PR #54458 (SPARK-55665), which unified worker socket handling across 14 files. I confirmed the fix is present in pyspark==4.2.0.dev3 on PyPI but not in any stable release — all versions through 4.1.1 are affected. Since PR #54458 is a large refactor (14 files), a clean cherry-pick to release branches may not be straightforward. However, the actual fix for SPARK-53759 is small — just adding flush() before close() in the worker's finally block, mirroring what daemon.py already does. I've prepared minimal backport branches for review: - branch-4.1: https://github.com/anblanco/spark/tree/fix/SPARK-53759-simple-worker-flush <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fanblanco%2Fspark%2Ftree%2Ffix%2FSPARK-53759-simple-worker-flush&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930542952%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=yiZL5hcdxqQJaT7fTcEPOjO4TEs3uKTdi1O01QUjmKk%3D&reserved=0> - (Can prepare branch-4.0 and branch-3.5 variants if there's interest) I put together a reproducer with a test matrix and full root cause analysis here: https://github.com/anblanco/spark53759-reproducer <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fanblanco%2Fspark53759-reproducer&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930565428%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=Ilqh68g3HgULVVaDOrWkAJHQ4eCSIKZICUmYi0NvnCw%3D&reserved=0> The bug has been open since September 2025 and affects all Windows users on Python 3.12+, which is now the default Python on most systems. I think the impact warrants backporting, especially given how small the fix is. Note that branch-3.5 LTS ends April 12 — if a backport is appropriate there, it would need to happen soon. Happy to prepare the backport PRs if maintainers agree this is worth doing. Thanks you for your time, Antonio Blanco [1] https://github.com/python/cpython/issues/97922 <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpython%2Fcpython%2Fissues%2F97922&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930583717%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=3%2FBQm1AahBcmJ1V2xt923MgCf1TS%2BlBAqYXHVkq1Et0%3D&reserved=0> [2] https://issues.apache.org/jira/browse/SPARK-53759 <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-53759&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930600498%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=72soFF6yfwLrHycS%2FblvrPWn2nB7036pRM7TzWbYGrM%3D&reserved=0> [3] https://github.com/apache/spark/pull/54458 <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F54458&data=05%7C02%7Cyiconghuang%40umass.edu%7C21cffcaff05c4b06dbbe08de92ba47b0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639109525930619458%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C%7C&sdata=0w%2FOW8Md2ozr57DXPvFBVdpuj1c5l3QouzBqEB1DnXA%3D&reserved=0> -- Antonio <witty signature />
