I was thinking just in the publisher artifacts directory we already do. Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]> wrote: > Where do you propose to publish? Spark website? Maybe in our github repo > somewhere? For python packages, users rarely look for artifacts (and it's > difficult to find). > > Tian > > On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected]> > wrote: > >> I hear that. How about as a compromise, we publish (but don’t lock to) >> the pip freeze outputs of the venvs we use for testing? >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas < >> [email protected]> wrote: >> >>> I think supply chain attacks are a problem, but I don’t think we want to >>> be on the hook for a solution here, even if it’s meant just for our project. >>> >>> There are “good enough” approaches available today for Python that >>> mitigate most of the risk by excluding recent releases when resolving what >>> package versions to install. >>> >>> uv offers exclude-newer >>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip >>> offers uploaded-prior-to >>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>. >>> Poetry has an issue open >>> <https://github.com/python-poetry/poetry/issues/10646> for a similar >>> feature, plus at least one open PR to close it. >>> >>> Users concerned about supply chain attacks would probably get better >>> results from using these options as compared to installing pinned >>> dependencies provided by the projects they use. >>> >>> Nick >>> >>> >>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]> >>> wrote: >>> >>> So I think we can ship it as an optional distribution element (it's >>> literally just another file folks can choose to download/use if they want). >>> >>> Asking users is an idea too, I could put together a survey if we want? >>> >>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected]> >>> wrote: >>> >>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*". >>>> Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we don't >>>> need to focus on the syntax. >>>> >>>> I don't believe we can ship pyspark with a env lock file. That's what >>>> users do in their own projects. It's not part of python package system. >>>> What users do is normally install packages, test it out, then lock it with >>>> either pip or uv - generate a lock file for all dependencies and use it >>>> across their systems. It's not common for packages to list out a "known >>>> working dependency list" for users. >>>> >>>> However, if we really want to try it out, we can do something like `pip >>>> install pyspark[full-pinned] and install every dependency pyspark requires >>>> with a pinned version. If our user needs an out-of-box solution they can do >>>> that. We can also collect feedbacks and see the sentiment from users. >>>> >>>> Tian >>>> >>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote: >>>> >>>>> > If we consider PySpark the dominant package - meaning that if a user >>>>> employs it, it must be the most important element in their project and >>>>> everything else must comply with it - pinning versions might be viable. >>>>> >>>>> This is not always true, but definitely a major case. >>>>> >>>>> > I'm not familiar with Java dependency solutions or how users use >>>>> spark with Java >>>>> >>>>> In Java/Scala, it's rare to use dynamic version for dependency >>>>> management. Product declares transitive dependencies with pinned version, >>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most >>>>> reasonable version based on resolution rules. The rules is a little >>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it >>>>> works. >>>>> >>>>> In short, in Java/Scala dependency management, the pinned version is >>>>> more like a suggested version, it's easy to override by users. >>>>> >>>>> As Owen pointed out, things are completely different in Python world, >>>>> both pinned version and latest version seems not ideal, then >>>>> >>>>> 1. pinned version (foo==2.0.0) >>>>> 2. allow maintenance releases (foo~=2.0.0) >>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0) >>>>> 4. latest version (foo>=2.0.0, or foo) >>>>> >>>>> seems 2 or 3 might be an acceptable solution? And, I still believe we >>>>> should add a disclaimer that this compatibility only holds under the >>>>> assumption that 3rd-party packages strictly adhere to semantic versioning. >>>>> >>>>> > You can totally produce a sort of 'lock' file -- uv.lock, >>>>> requirements.txt -- expressing a known good / recommended specific >>>>> resolved >>>>> environment. That is _not_ what Python dependency constraints are for. >>>>> It's >>>>> what env lock flies are for. >>>>> >>>>> We definitely need such a dependency list in PySpark release, it's >>>>> really important for users to set up a reproducible environment after the >>>>> release several years, and this is also a good reference for users who >>>>> encounter 3rd-party packages bugs, or battle with dependency conflicts >>>>> when >>>>> they install lots of packages in single environment. >>>>> >>>>> [1] >>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >>>>> >>>>> Thanks, >>>>> Cheng Pan >>>>> >>>>> >>>>> >>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote: >>>>> >>>>> TL;DR Tian is more correct, and == pinning versions is not achieving >>>>> the desired outcome. There are other ways to do it; I can't think of any >>>>> other Python package that works that way. This thread is conflating >>>>> different things. >>>>> >>>>> While expressing dependence on "foo>=2.0.0" indeed can be an >>>>> overly-broad claim -- do you really think it works with 5.x in 10 years? >>>>> -- >>>>> expressing "foo==2.0.0" is very likely overly narrow. That says "does not >>>>> work with any other version at all" which is likely more incorrect and >>>>> more >>>>> problematic for users. >>>>> >>>>> You can totally produce a sort of 'lock' file -- uv.lock, >>>>> requirements.txt -- expressing a known good / recommended specific >>>>> resolved >>>>> environment. That is _not_ what Python dependency constraints are for. >>>>> It's >>>>> what env lock flies are for. >>>>> >>>>> To be sure there is an art to figuring out the right dependency >>>>> bounds. A reasonable compromise is to allow maintenance releases, as a >>>>> default when there is nothing more specific known. That is, write >>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1". >>>>> >>>>> The analogy to Scala/Java/Maven land does not quite work, partly >>>>> because Maven resolution is just pretty different, but mostly because the >>>>> core Spark distribution is the 'server side' and is necessarily a 'fat >>>>> jar', a sort of statically-compiled artifact that simply has some specific >>>>> versions in them and can never have different versions because of runtime >>>>> resolution differences. >>>>> >>>>> >>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev < >>>>> [email protected]> wrote: >>>>> >>>>>> I agree that a product must be usable first. Pinning the version (to >>>>>> a specific number with `==`) will make pyspark unusable. >>>>>> >>>>>> First of all, I think we can agree that many users use PySpark with >>>>>> other Python packages. If we conflict with other packages, `pip install >>>>>> -r >>>>>> requirements.txt` won't work. It will complain that the dependencies >>>>>> can't >>>>>> be resolved, which completely breaks our user's workflow. Even if the >>>>>> user >>>>>> locks the dependency version, it won't work. So the user had to install >>>>>> PySpark first, then the other packages, to override PySpark's dependency. >>>>>> They can't put their dependency list in a single file - that is a >>>>>> horrible >>>>>> user experience. >>>>>> >>>>>> When I look at controversial topics, I always have a strong belief, >>>>>> that I can't be the only smart person in the world. If an idea is good, >>>>>> others must already be doing it. Can we find any recognized package in >>>>>> the >>>>>> market that pins its dependencies to a specific version? The only case it >>>>>> works is when this package is *all* the user needs. That's why we pin >>>>>> versions for docker images, HTTP services, or standalone tools - users >>>>>> just >>>>>> need something that works out of the box. If we consider PySpark the >>>>>> dominant package - meaning that if a user employs it, it must be the most >>>>>> important element in their project and everything else must comply with >>>>>> it >>>>>> - pinning versions might be viable. >>>>>> >>>>>> I'm not familiar with Java dependency solutions or how users use >>>>>> spark with Java, but I'm familiar with the Python ecosystem and >>>>>> community. >>>>>> If we pin to a specific version, we will face significant criticism. If >>>>>> we >>>>>> must do it, at least don't make it default. Like I said above, I don't >>>>>> have >>>>>> a strong opinion about having a `pyspark[pinned]` - if users only need >>>>>> pyspark and no other packages they could use that. But that's extra >>>>>> effort >>>>>> for maintenance, and we need to think about what's pinned. We have a lot >>>>>> of >>>>>> pyspark install versions. >>>>>> >>>>>> Tian Gao >>>>>> >>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote: >>>>>> >>>>>>> I think the community has already reached consistence to freeze >>>>>>> dependencies in minor release. >>>>>>> >>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1] >>>>>>> >>>>>>> > Clear rules for changes allowed in minor vs. major releases: >>>>>>> > - Dependencies are frozen and behavioral changes are minimized in >>>>>>> minor releases. >>>>>>> >>>>>>> I would interpret the proposed dependency policy applies to both >>>>>>> Java/Scala and Python dependency management for Spark. If so, that means >>>>>>> PySpark will always use pinned dependencies version since 4.3.0. But if >>>>>>> the >>>>>>> intention is to only apply such a dependency policy to Java/Scala, then >>>>>>> it >>>>>>> creates a very strange situation - an extremely conservative dependency >>>>>>> management strategy for Java/Scala, and an extremely liberal one for >>>>>>> Python. >>>>>>> >>>>>>> To Tian Gao, >>>>>>> >>>>>>> > Pinning versions is a double-edged sword, it doesn't always make >>>>>>> us more secure - that's my major point. >>>>>>> >>>>>>> Product must be usable first, then security, performance, etc. If it >>>>>>> claims require `foo>=2.0.0`, how do you ensure it is compatible with foo >>>>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred >>>>>>> many times, e.g.,[2]. On the contrary, if it claims require >>>>>>> `foo==2.0.0`, >>>>>>> that means it was thoroughly tested with `foo==2.0.0`, and users take >>>>>>> their >>>>>>> own risk to use it with other `foo` versions, for exmaple, if the `foo` >>>>>>> strictly follow semantic version, it should work with `foo<3.0.0`, but >>>>>>> this >>>>>>> is not Spark's responsibility, users should assess and assume the risk >>>>>>> of >>>>>>> incompatibility themselves. >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633 >>>>>>> [2] https://github.com/apache/spark/pull/52633 >>>>>>> >>>>>>> Thanks, >>>>>>> Cheng Pan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Response inline >>>>>>> >>>>>>> >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>> Pronouns: she/her >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> One possibility would be to make the pinned version optional (eg >>>>>>>> pyspark[pinned]) or publish a separate constraints file for people to >>>>>>>> optionally use with -c? >>>>>>>> >>>>>>>> >>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is >>>>>>>> possible today for people using modern Python packaging workflows that >>>>>>>> use >>>>>>>> lock files. In fact, it happens automatically; all transitive >>>>>>>> dependencies >>>>>>>> are pinned in the lock file, and this is by design. >>>>>>>> >>>>>>> So for someone installing a fresh venv with uv/pip/or conda where >>>>>>> does this come from? >>>>>>> >>>>>>> The idea here is we provide the versions we used during the release >>>>>>> stage so if folks want a “known safe” initial starting point for a new >>>>>>> env >>>>>>> they’ve got one. >>>>>>> >>>>>>>> >>>>>>>> Furthermore, it is straightforward to add additional restrictions >>>>>>>> to your project spec (i.e. pyproject.toml) so that when the packaging >>>>>>>> tool >>>>>>>> builds the lock file, it does it with whatever restrictions you want >>>>>>>> that >>>>>>>> are specific to your project. That could include specific versions or >>>>>>>> version ranges of libraries to exclude, for example. >>>>>>>> >>>>>>> Yes, but as it stands we leave it to the end user to start from >>>>>>> scratch picking these versions, we can make their lives simpler by >>>>>>> providing the versions we tested against with a lock file they can >>>>>>> choose >>>>>>> to use, ignore, or update to their desired versions and include. >>>>>>> >>>>>>> Also for interactive workloads I more often see a bare requirements >>>>>>> file or even pip installs in nb cells (but this could be sample bias). >>>>>>> >>>>>>>> >>>>>>>> I had to do this, for example, on a personal project that used >>>>>>>> PySpark Connect but which was pulling in a version of grpc that >>>>>>>> was generating a lot of log noise >>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. >>>>>>>> I pinned the version of grpc in my project file and let the packaging >>>>>>>> tool >>>>>>>> resolve all the requirements across PySpark Connect and my custom >>>>>>>> restrictions. >>>>>>>> >>>>>>>> Nick >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://www.fighthealthinsurance.com/?q=hk_email> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her >>> >>> >>>
