I hear that. How about as a compromise, we publish (but don’t lock to) the pip freeze outputs of the venvs we use for testing?
Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <[email protected]> wrote: > I think supply chain attacks are a problem, but I don’t think we want to > be on the hook for a solution here, even if it’s meant just for our project. > > There are “good enough” approaches available today for Python that > mitigate most of the risk by excluding recent releases when resolving what > package versions to install. > > uv offers exclude-newer > <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip offers > uploaded-prior-to > <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>. > Poetry has an issue open > <https://github.com/python-poetry/poetry/issues/10646> for a similar > feature, plus at least one open PR to close it. > > Users concerned about supply chain attacks would probably get better > results from using these options as compared to installing pinned > dependencies provided by the projects they use. > > Nick > > > On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]> wrote: > > So I think we can ship it as an optional distribution element (it's > literally just another file folks can choose to download/use if they want). > > Asking users is an idea too, I could put together a survey if we want? > > On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected]> > wrote: > >> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*". >> Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we don't >> need to focus on the syntax. >> >> I don't believe we can ship pyspark with a env lock file. That's what >> users do in their own projects. It's not part of python package system. >> What users do is normally install packages, test it out, then lock it with >> either pip or uv - generate a lock file for all dependencies and use it >> across their systems. It's not common for packages to list out a "known >> working dependency list" for users. >> >> However, if we really want to try it out, we can do something like `pip >> install pyspark[full-pinned] and install every dependency pyspark requires >> with a pinned version. If our user needs an out-of-box solution they can do >> that. We can also collect feedbacks and see the sentiment from users. >> >> Tian >> >> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote: >> >>> > If we consider PySpark the dominant package - meaning that if a user >>> employs it, it must be the most important element in their project and >>> everything else must comply with it - pinning versions might be viable. >>> >>> This is not always true, but definitely a major case. >>> >>> > I'm not familiar with Java dependency solutions or how users use spark >>> with Java >>> >>> In Java/Scala, it's rare to use dynamic version for dependency >>> management. Product declares transitive dependencies with pinned version, >>> and the package manager (Maven, SBT, Gradle, etc.) picks the most >>> reasonable version based on resolution rules. The rules is a little >>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it works. >>> >>> In short, in Java/Scala dependency management, the pinned version is >>> more like a suggested version, it's easy to override by users. >>> >>> As Owen pointed out, things are completely different in Python world, >>> both pinned version and latest version seems not ideal, then >>> >>> 1. pinned version (foo==2.0.0) >>> 2. allow maintenance releases (foo~=2.0.0) >>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0) >>> 4. latest version (foo>=2.0.0, or foo) >>> >>> seems 2 or 3 might be an acceptable solution? And, I still believe we >>> should add a disclaimer that this compatibility only holds under the >>> assumption that 3rd-party packages strictly adhere to semantic versioning. >>> >>> > You can totally produce a sort of 'lock' file -- uv.lock, >>> requirements.txt -- expressing a known good / recommended specific resolved >>> environment. That is _not_ what Python dependency constraints are for. It's >>> what env lock flies are for. >>> >>> We definitely need such a dependency list in PySpark release, it's >>> really important for users to set up a reproducible environment after the >>> release several years, and this is also a good reference for users who >>> encounter 3rd-party packages bugs, or battle with dependency conflicts when >>> they install lots of packages in single environment. >>> >>> [1] >>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >>> >>> Thanks, >>> Cheng Pan >>> >>> >>> >>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote: >>> >>> TL;DR Tian is more correct, and == pinning versions is not achieving the >>> desired outcome. There are other ways to do it; I can't think of any other >>> Python package that works that way. This thread is conflating different >>> things. >>> >>> While expressing dependence on "foo>=2.0.0" indeed can be an >>> overly-broad claim -- do you really think it works with 5.x in 10 years? -- >>> expressing "foo==2.0.0" is very likely overly narrow. That says "does not >>> work with any other version at all" which is likely more incorrect and more >>> problematic for users. >>> >>> You can totally produce a sort of 'lock' file -- uv.lock, >>> requirements.txt -- expressing a known good / recommended specific resolved >>> environment. That is _not_ what Python dependency constraints are for. It's >>> what env lock flies are for. >>> >>> To be sure there is an art to figuring out the right dependency bounds. >>> A reasonable compromise is to allow maintenance releases, as a default when >>> there is nothing more specific known. That is, write "foo~=2.0.2" to mean >>> ">=2.0.0 and < 2.1". >>> >>> The analogy to Scala/Java/Maven land does not quite work, partly because >>> Maven resolution is just pretty different, but mostly because the core >>> Spark distribution is the 'server side' and is necessarily a 'fat jar', a >>> sort of statically-compiled artifact that simply has some specific versions >>> in them and can never have different versions because of runtime resolution >>> differences. >>> >>> >>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <[email protected]> >>> wrote: >>> >>>> I agree that a product must be usable first. Pinning the version (to a >>>> specific number with `==`) will make pyspark unusable. >>>> >>>> First of all, I think we can agree that many users use PySpark with >>>> other Python packages. If we conflict with other packages, `pip install -r >>>> requirements.txt` won't work. It will complain that the dependencies can't >>>> be resolved, which completely breaks our user's workflow. Even if the user >>>> locks the dependency version, it won't work. So the user had to install >>>> PySpark first, then the other packages, to override PySpark's dependency. >>>> They can't put their dependency list in a single file - that is a horrible >>>> user experience. >>>> >>>> When I look at controversial topics, I always have a strong belief, >>>> that I can't be the only smart person in the world. If an idea is good, >>>> others must already be doing it. Can we find any recognized package in the >>>> market that pins its dependencies to a specific version? The only case it >>>> works is when this package is *all* the user needs. That's why we pin >>>> versions for docker images, HTTP services, or standalone tools - users just >>>> need something that works out of the box. If we consider PySpark the >>>> dominant package - meaning that if a user employs it, it must be the most >>>> important element in their project and everything else must comply with it >>>> - pinning versions might be viable. >>>> >>>> I'm not familiar with Java dependency solutions or how users use spark >>>> with Java, but I'm familiar with the Python ecosystem and community. If we >>>> pin to a specific version, we will face significant criticism. If we must >>>> do it, at least don't make it default. Like I said above, I don't have a >>>> strong opinion about having a `pyspark[pinned]` - if users only need >>>> pyspark and no other packages they could use that. But that's extra effort >>>> for maintenance, and we need to think about what's pinned. We have a lot of >>>> pyspark install versions. >>>> >>>> Tian Gao >>>> >>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote: >>>> >>>>> I think the community has already reached consistence to freeze >>>>> dependencies in minor release. >>>>> >>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1] >>>>> >>>>> > Clear rules for changes allowed in minor vs. major releases: >>>>> > - Dependencies are frozen and behavioral changes are minimized in >>>>> minor releases. >>>>> >>>>> I would interpret the proposed dependency policy applies to both >>>>> Java/Scala and Python dependency management for Spark. If so, that means >>>>> PySpark will always use pinned dependencies version since 4.3.0. But if >>>>> the >>>>> intention is to only apply such a dependency policy to Java/Scala, then it >>>>> creates a very strange situation - an extremely conservative dependency >>>>> management strategy for Java/Scala, and an extremely liberal one for >>>>> Python. >>>>> >>>>> To Tian Gao, >>>>> >>>>> > Pinning versions is a double-edged sword, it doesn't always make us >>>>> more secure - that's my major point. >>>>> >>>>> Product must be usable first, then security, performance, etc. If it >>>>> claims require `foo>=2.0.0`, how do you ensure it is compatible with foo >>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred >>>>> many times, e.g.,[2]. On the contrary, if it claims require `foo==2.0.0`, >>>>> that means it was thoroughly tested with `foo==2.0.0`, and users take >>>>> their >>>>> own risk to use it with other `foo` versions, for exmaple, if the `foo` >>>>> strictly follow semantic version, it should work with `foo<3.0.0`, but >>>>> this >>>>> is not Spark's responsibility, users should assess and assume the risk of >>>>> incompatibility themselves. >>>>> >>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633 >>>>> [2] https://github.com/apache/spark/pull/52633 >>>>> >>>>> Thanks, >>>>> Cheng Pan >>>>> >>>>> >>>>> >>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> >>>>> wrote: >>>>> >>>>> Response inline >>>>> >>>>> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>> >>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]> >>>>>> wrote: >>>>>> >>>>>> One possibility would be to make the pinned version optional (eg >>>>>> pyspark[pinned]) or publish a separate constraints file for people to >>>>>> optionally use with -c? >>>>>> >>>>>> >>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is >>>>>> possible today for people using modern Python packaging workflows that >>>>>> use >>>>>> lock files. In fact, it happens automatically; all transitive >>>>>> dependencies >>>>>> are pinned in the lock file, and this is by design. >>>>>> >>>>> So for someone installing a fresh venv with uv/pip/or conda where does >>>>> this come from? >>>>> >>>>> The idea here is we provide the versions we used during the release >>>>> stage so if folks want a “known safe” initial starting point for a new env >>>>> they’ve got one. >>>>> >>>>>> >>>>>> Furthermore, it is straightforward to add additional restrictions to >>>>>> your project spec (i.e. pyproject.toml) so that when the packaging tool >>>>>> builds the lock file, it does it with whatever restrictions you want that >>>>>> are specific to your project. That could include specific versions or >>>>>> version ranges of libraries to exclude, for example. >>>>>> >>>>> Yes, but as it stands we leave it to the end user to start from >>>>> scratch picking these versions, we can make their lives simpler by >>>>> providing the versions we tested against with a lock file they can choose >>>>> to use, ignore, or update to their desired versions and include. >>>>> >>>>> Also for interactive workloads I more often see a bare requirements >>>>> file or even pip installs in nb cells (but this could be sample bias). >>>>> >>>>>> >>>>>> I had to do this, for example, on a personal project that used >>>>>> PySpark Connect but which was pulling in a version of grpc that was >>>>>> generating a lot of log noise >>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. >>>>>> I pinned the version of grpc in my project file and let the packaging >>>>>> tool >>>>>> resolve all the requirements across PySpark Connect and my custom >>>>>> restrictions. >>>>>> >>>>>> Nick >>>>>> >>>>>> >>>>> >>> > > -- > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > >
