> How about as a compromise, we publish (but don’t lock to) the pip freeze > outputs of the venvs we use for testing?
> Where do you propose to publish? Spark website? Maybe in our github repo > somewhere? > I was thinking just in the publisher artifacts directory we already do. +1, I'm fine with any approach, as long as it provides sufficient info to let user know which exactly version of dependencies was used for testing. For Java/Scala, we have a script[1] generated dependency list in code repo, at [2] [1] https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh [2] https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3 Thanks, Cheng Pan > On Mar 31, 2026, at 03:12, Holden Karau <[email protected]> wrote: > > I was thinking just in the publisher artifacts directory we already do. > > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 > <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > > On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected] > <mailto:[email protected]>> wrote: >> Where do you propose to publish? Spark website? Maybe in our github repo >> somewhere? For python packages, users rarely look for artifacts (and it's >> difficult to find). >> >> Tian >> >> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected] >> <mailto:[email protected]>> wrote: >>> I hear that. How about as a compromise, we publish (but don’t lock to) the >>> pip freeze outputs of the venvs we use for testing? >>> >>> Twitter: https://twitter.com/holdenkarau >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://www.fighthealthinsurance.com/?q=hk_email> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her >>> >>> >>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas >>> <[email protected] <mailto:[email protected]>> wrote: >>>> I think supply chain attacks are a problem, but I don’t think we want to >>>> be on the hook for a solution here, even if it’s meant just for our >>>> project. >>>> >>>> There are “good enough” approaches available today for Python that >>>> mitigate most of the risk by excluding recent releases when resolving what >>>> package versions to install. >>>> >>>> uv offers exclude-newer >>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip offers >>>> uploaded-prior-to >>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>. >>>> Poetry has an issue open >>>> <https://github.com/python-poetry/poetry/issues/10646> for a similar >>>> feature, plus at least one open PR to close it. >>>> >>>> Users concerned about supply chain attacks would probably get better >>>> results from using these options as compared to installing pinned >>>> dependencies provided by the projects they use. >>>> >>>> Nick >>>> >>>> >>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> So I think we can ship it as an optional distribution element (it's >>>>> literally just another file folks can choose to download/use if they >>>>> want). >>>>> >>>>> Asking users is an idea too, I could put together a survey if we want? >>>>> >>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*". >>>>>> Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we >>>>>> don't need to focus on the syntax. >>>>>> >>>>>> I don't believe we can ship pyspark with a env lock file. That's what >>>>>> users do in their own projects. It's not part of python package system. >>>>>> What users do is normally install packages, test it out, then lock it >>>>>> with either pip or uv - generate a lock file for all dependencies and >>>>>> use it across their systems. It's not common for packages to list out a >>>>>> "known working dependency list" for users. >>>>>> >>>>>> However, if we really want to try it out, we can do something like `pip >>>>>> install pyspark[full-pinned] and install every dependency pyspark >>>>>> requires with a pinned version. If our user needs an out-of-box solution >>>>>> they can do that. We can also collect feedbacks and see the sentiment >>>>>> from users. >>>>>> >>>>>> Tian >>>>>> >>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>>> > If we consider PySpark the dominant package - meaning that if a user >>>>>>> > employs it, it must be the most important element in their project >>>>>>> > and everything else must comply with it - pinning versions might be >>>>>>> > viable. >>>>>>> >>>>>>> This is not always true, but definitely a major case. >>>>>>> >>>>>>> > I'm not familiar with Java dependency solutions or how users use >>>>>>> > spark with Java >>>>>>> >>>>>>> In Java/Scala, it's rare to use dynamic version for dependency >>>>>>> management. Product declares transitive dependencies with pinned >>>>>>> version, and the package manager (Maven, SBT, Gradle, etc.) picks the >>>>>>> most reasonable version based on resolution rules. The rules is a >>>>>>> little different in Maven, SBT and Gradle, the Maven docs[1] explains >>>>>>> how it works. >>>>>>> >>>>>>> In short, in Java/Scala dependency management, the pinned version is >>>>>>> more like a suggested version, it's easy to override by users. >>>>>>> >>>>>>> As Owen pointed out, things are completely different in Python world, >>>>>>> both pinned version and latest version seems not ideal, then >>>>>>> >>>>>>> 1. pinned version (foo==2.0.0) >>>>>>> 2. allow maintenance releases (foo~=2.0.0) >>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0) >>>>>>> 4. latest version (foo>=2.0.0, or foo) >>>>>>> >>>>>>> seems 2 or 3 might be an acceptable solution? And, I still believe we >>>>>>> should add a disclaimer that this compatibility only holds under the >>>>>>> assumption that 3rd-party packages strictly adhere to semantic >>>>>>> versioning. >>>>>>> >>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>> > requirements.txt -- expressing a known good / recommended specific >>>>>>> > resolved environment. That is _not_ what Python dependency >>>>>>> > constraints are for. It's what env lock flies are for. >>>>>>> >>>>>>> We definitely need such a dependency list in PySpark release, it's >>>>>>> really important for users to set up a reproducible environment after >>>>>>> the release several years, and this is also a good reference for users >>>>>>> who encounter 3rd-party packages bugs, or battle with dependency >>>>>>> conflicts when they install lots of packages in single environment. >>>>>>> >>>>>>> [1] >>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >>>>>>> >>>>>>> Thanks, >>>>>>> Cheng Pan >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> TL;DR Tian is more correct, and == pinning versions is not achieving >>>>>>>> the desired outcome. There are other ways to do it; I can't think of >>>>>>>> any other Python package that works that way. This thread is >>>>>>>> conflating different things. >>>>>>>> >>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an >>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 >>>>>>>> years? -- expressing "foo==2.0.0" is very likely overly narrow. That >>>>>>>> says "does not work with any other version at all" which is likely >>>>>>>> more incorrect and more problematic for users. >>>>>>>> >>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>>> requirements.txt -- expressing a known good / recommended specific >>>>>>>> resolved environment. That is _not_ what Python dependency constraints >>>>>>>> are for. It's what env lock flies are for. >>>>>>>> >>>>>>>> To be sure there is an art to figuring out the right dependency >>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, as a >>>>>>>> default when there is nothing more specific known. That is, write >>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1". >>>>>>>> >>>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly >>>>>>>> because Maven resolution is just pretty different, but mostly because >>>>>>>> the core Spark distribution is the 'server side' and is necessarily a >>>>>>>> 'fat jar', a sort of statically-compiled artifact that simply has some >>>>>>>> specific versions in them and can never have different versions >>>>>>>> because of runtime resolution differences. >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev >>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>> I agree that a product must be usable first. Pinning the version (to >>>>>>>>> a specific number with `==`) will make pyspark unusable. >>>>>>>>> >>>>>>>>> First of all, I think we can agree that many users use PySpark with >>>>>>>>> other Python packages. If we conflict with other packages, `pip >>>>>>>>> install -r requirements.txt` won't work. It will complain that the >>>>>>>>> dependencies can't be resolved, which completely breaks our user's >>>>>>>>> workflow. Even if the user locks the dependency version, it won't >>>>>>>>> work. So the user had to install PySpark first, then the other >>>>>>>>> packages, to override PySpark's dependency. They can't put their >>>>>>>>> dependency list in a single file - that is a horrible user experience. >>>>>>>>> >>>>>>>>> When I look at controversial topics, I always have a strong belief, >>>>>>>>> that I can't be the only smart person in the world. If an idea is >>>>>>>>> good, others must already be doing it. Can we find any recognized >>>>>>>>> package in the market that pins its dependencies to a specific >>>>>>>>> version? The only case it works is when this package is *all* the >>>>>>>>> user needs. That's why we pin versions for docker images, HTTP >>>>>>>>> services, or standalone tools - users just need something that works >>>>>>>>> out of the box. If we consider PySpark the dominant package - meaning >>>>>>>>> that if a user employs it, it must be the most important element in >>>>>>>>> their project and everything else must comply with it - pinning >>>>>>>>> versions might be viable. >>>>>>>>> >>>>>>>>> I'm not familiar with Java dependency solutions or how users use >>>>>>>>> spark with Java, but I'm familiar with the Python ecosystem and >>>>>>>>> community. If we pin to a specific version, we will face significant >>>>>>>>> criticism. If we must do it, at least don't make it default. Like I >>>>>>>>> said above, I don't have a strong opinion about having a >>>>>>>>> `pyspark[pinned]` - if users only need pyspark and no other packages >>>>>>>>> they could use that. But that's extra effort for maintenance, and we >>>>>>>>> need to think about what's pinned. We have a lot of pyspark install >>>>>>>>> versions. >>>>>>>>> >>>>>>>>> Tian Gao >>>>>>>>> >>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected] >>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>> I think the community has already reached consistence to freeze >>>>>>>>>> dependencies in minor release. >>>>>>>>>> >>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1] >>>>>>>>>> >>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases: >>>>>>>>>> > - Dependencies are frozen and behavioral changes are minimized in >>>>>>>>>> > minor releases. >>>>>>>>>> >>>>>>>>>> I would interpret the proposed dependency policy applies to both >>>>>>>>>> Java/Scala and Python dependency management for Spark. If so, that >>>>>>>>>> means PySpark will always use pinned dependencies version since >>>>>>>>>> 4.3.0. But if the intention is to only apply such a dependency >>>>>>>>>> policy to Java/Scala, then it creates a very strange situation - an >>>>>>>>>> extremely conservative dependency management strategy for >>>>>>>>>> Java/Scala, and an extremely liberal one for Python. >>>>>>>>>> >>>>>>>>>> To Tian Gao, >>>>>>>>>> >>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always make >>>>>>>>>> > us more secure - that's my major point. >>>>>>>>>> >>>>>>>>>> Product must be usable first, then security, performance, etc. If it >>>>>>>>>> claims require `foo>=2.0.0`, how do you ensure it is compatible with >>>>>>>>>> foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures >>>>>>>>>> occurred many times, e.g.,[2]. On the contrary, if it claims require >>>>>>>>>> `foo==2.0.0`, that means it was thoroughly tested with `foo==2.0.0`, >>>>>>>>>> and users take their own risk to use it with other `foo` versions, >>>>>>>>>> for exmaple, if the `foo` strictly follow semantic version, it >>>>>>>>>> should work with `foo<3.0.0`, but this is not Spark's >>>>>>>>>> responsibility, users should assess and assume the risk of >>>>>>>>>> incompatibility themselves. >>>>>>>>>> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633 >>>>>>>>>> [2] https://github.com/apache/spark/pull/52633 >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Cheng Pan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected] >>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>> >>>>>>>>>>> Response inline >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>>>> Pronouns: she/her >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas >>>>>>>>>>> <[email protected] <mailto:[email protected]>> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau >>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> One possibility would be to make the pinned version optional (eg >>>>>>>>>>>>> pyspark[pinned]) or publish a separate constraints file for >>>>>>>>>>>>> people to optionally use with -c? >>>>>>>>>>>> >>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is >>>>>>>>>>>> possible today for people using modern Python packaging workflows >>>>>>>>>>>> that use lock files. In fact, it happens automatically; all >>>>>>>>>>>> transitive dependencies are pinned in the lock file, and this is >>>>>>>>>>>> by design. >>>>>>>>>>> >>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda where >>>>>>>>>>> does this come from? >>>>>>>>>>> >>>>>>>>>>> The idea here is we provide the versions we used during the release >>>>>>>>>>> stage so if folks want a “known safe” initial starting point for a >>>>>>>>>>> new env they’ve got one. >>>>>>>>>>>> >>>>>>>>>>>> Furthermore, it is straightforward to add additional restrictions >>>>>>>>>>>> to your project spec (i.e. pyproject.toml) so that when the >>>>>>>>>>>> packaging tool builds the lock file, it does it with whatever >>>>>>>>>>>> restrictions you want that are specific to your project. That >>>>>>>>>>>> could include specific versions or version ranges of libraries to >>>>>>>>>>>> exclude, for example. >>>>>>>>>>> Yes, but as it stands we leave it to the end user to start from >>>>>>>>>>> scratch picking these versions, we can make their lives simpler by >>>>>>>>>>> providing the versions we tested against with a lock file they can >>>>>>>>>>> choose to use, ignore, or update to their desired versions and >>>>>>>>>>> include. >>>>>>>>>>> >>>>>>>>>>> Also for interactive workloads I more often see a bare requirements >>>>>>>>>>> file or even pip installs in nb cells (but this could be sample >>>>>>>>>>> bias). >>>>>>>>>>>> >>>>>>>>>>>> I had to do this, for example, on a personal project that used >>>>>>>>>>>> PySpark Connect but which was pulling in a version of grpc that >>>>>>>>>>>> was generating a lot of log noise >>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. >>>>>>>>>>>> I pinned the version of grpc in my project file and let the >>>>>>>>>>>> packaging tool resolve all the requirements across PySpark Connect >>>>>>>>>>>> and my custom restrictions. >>>>>>>>>>>> >>>>>>>>>>>> Nick >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>
