Re: [discuss] Pinning PySpark dependencies?

Holden Karau Mon, 30 Mar 2026 14:39:41 -0700

I was thinking just in the publisher artifacts directory we already do.

Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her



On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]> wrote:

> Where do you propose to publish? Spark website? Maybe in our github repo
> somewhere? For python packages, users rarely look for artifacts (and it's
> difficult to find).
>
> Tian
>
> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected]>
> wrote:
>
>> I hear that. How about as a compromise, we publish (but don’t lock to)
>> the pip freeze outputs of the venvs we use for testing?
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <
>> [email protected]> wrote:
>>
>>> I think supply chain attacks are a problem, but I don’t think we want to
>>> be on the hook for a solution here, even if it’s meant just for our project.
>>>
>>> There are “good enough” approaches available today for Python that
>>> mitigate most of the risk by excluding recent releases when resolving what
>>> package versions to install.
>>>
>>> uv offers exclude-newer
>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip
>>> offers uploaded-prior-to
>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>> Poetry has an issue open
>>> <https://github.com/python-poetry/poetry/issues/10646> for a similar
>>> feature, plus at least one open PR to close it.
>>>
>>> Users concerned about supply chain attacks would probably get better
>>> results from using these options as compared to installing pinned
>>> dependencies provided by the projects they use.
>>>
>>> Nick
>>>
>>>
>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]>
>>> wrote:
>>>
>>> So I think we can ship it as an optional distribution element (it's
>>> literally just another file folks can choose to download/use if they want).
>>>
>>> Asking users is an idea too, I could put together a survey if we want?
>>>
>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected]>
>>> wrote:
>>>
>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*".
>>>> Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we don't
>>>> need to focus on the syntax.
>>>>
>>>> I don't believe we can ship pyspark with a env lock file. That's what
>>>> users do in their own projects. It's not part of python package system.
>>>> What users do is normally install packages, test it out, then lock it with
>>>> either pip or uv - generate a lock file for all dependencies and use it
>>>> across their systems. It's not common for packages to list out a "known
>>>> working dependency list" for users.
>>>>
>>>> However, if we really want to try it out, we can do something like `pip
>>>> install pyspark[full-pinned] and install every dependency pyspark requires
>>>> with a pinned version. If our user needs an out-of-box solution they can do
>>>> that. We can also collect feedbacks and see the sentiment from users.
>>>>
>>>> Tian
>>>>
>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote:
>>>>
>>>>> > If we consider PySpark the dominant package - meaning that if a user
>>>>> employs it, it must be the most important element in their project and
>>>>> everything else must comply with it - pinning versions might be viable.
>>>>>
>>>>> This is not always true, but definitely a major case.
>>>>>
>>>>> > I'm not familiar with Java dependency solutions or how users use
>>>>> spark with Java
>>>>>
>>>>> In Java/Scala, it's rare to use dynamic version for dependency
>>>>> management. Product declares transitive dependencies with pinned version,
>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>>>>> reasonable version based on resolution rules. The rules is a little
>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it 
>>>>> works.
>>>>>
>>>>> In short, in Java/Scala dependency management, the pinned version is
>>>>> more like a suggested version, it's easy to override by users.
>>>>>
>>>>> As Owen pointed out, things are completely different in Python world,
>>>>> both pinned version and latest version seems not ideal, then
>>>>>
>>>>> 1. pinned version (foo==2.0.0)
>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>
>>>>> seems 2 or 3 might be an acceptable solution? And, I still believe we
>>>>> should add a disclaimer that this compatibility only holds under the
>>>>> assumption that 3rd-party packages strictly adhere to semantic versioning.
>>>>>
>>>>> > You can totally produce a sort of 'lock' file -- uv.lock,
>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>> resolved
>>>>> environment. That is _not_ what Python dependency constraints are for. 
>>>>> It's
>>>>> what env lock flies are for.
>>>>>
>>>>> We definitely need such a dependency list in PySpark release, it's
>>>>> really important for users to set up a reproducible environment after the
>>>>> release several years, and this is also a good reference for users who
>>>>> encounter 3rd-party packages bugs, or battle with dependency conflicts 
>>>>> when
>>>>> they install lots of packages in single environment.
>>>>>
>>>>> [1]
>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan
>>>>>
>>>>>
>>>>>
>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>>>>
>>>>> TL;DR Tian is more correct, and == pinning versions is not achieving
>>>>> the desired outcome. There are other ways to do it; I can't think of any
>>>>> other Python package that works that way. This thread is conflating
>>>>> different things.
>>>>>
>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an
>>>>> overly-broad claim -- do you really think it works with 5.x in 10 years? 
>>>>> --
>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says "does not
>>>>> work with any other version at all" which is likely more incorrect and 
>>>>> more
>>>>> problematic for users.
>>>>>
>>>>> You can totally produce a sort of 'lock' file -- uv.lock,
>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>> resolved
>>>>> environment. That is _not_ what Python dependency constraints are for. 
>>>>> It's
>>>>> what env lock flies are for.
>>>>>
>>>>> To be sure there is an art to figuring out the right dependency
>>>>> bounds. A reasonable compromise is to allow maintenance releases, as a
>>>>> default when there is nothing more specific known. That is, write
>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>
>>>>> The analogy to Scala/Java/Maven land does not quite work, partly
>>>>> because Maven resolution is just pretty different, but mostly because the
>>>>> core Spark distribution is the 'server side' and is necessarily a 'fat
>>>>> jar', a sort of statically-compiled artifact that simply has some specific
>>>>> versions in them and can never have different versions because of runtime
>>>>> resolution differences.
>>>>>
>>>>>
>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I agree that a product must be usable first. Pinning the version (to
>>>>>> a specific number with `==`) will make pyspark unusable.
>>>>>>
>>>>>> First of all, I think we can agree that many users use PySpark with
>>>>>> other Python packages. If we conflict with other packages, `pip install 
>>>>>> -r
>>>>>> requirements.txt` won't work. It will complain that the dependencies 
>>>>>> can't
>>>>>> be resolved, which completely breaks our user's workflow. Even if the 
>>>>>> user
>>>>>> locks the dependency version, it won't work. So the user had to install
>>>>>> PySpark first, then the other packages, to override PySpark's dependency.
>>>>>> They can't put their dependency list in a single file - that is a 
>>>>>> horrible
>>>>>> user experience.
>>>>>>
>>>>>> When I look at controversial topics, I always have a strong belief,
>>>>>> that I can't be the only smart person in the world. If an idea is good,
>>>>>> others must already be doing it. Can we find any recognized package in 
>>>>>> the
>>>>>> market that pins its dependencies to a specific version? The only case it
>>>>>> works is when this package is *all* the user needs. That's why we pin
>>>>>> versions for docker images, HTTP services, or standalone tools - users 
>>>>>> just
>>>>>> need something that works out of the box. If we consider PySpark the
>>>>>> dominant package - meaning that if a user employs it, it must be the most
>>>>>> important element in their project and everything else must comply with 
>>>>>> it
>>>>>> - pinning versions might be viable.
>>>>>>
>>>>>> I'm not familiar with Java dependency solutions or how users use
>>>>>> spark with Java, but I'm familiar with the Python ecosystem and 
>>>>>> community.
>>>>>> If we pin to a specific version, we will face significant criticism. If 
>>>>>> we
>>>>>> must do it, at least don't make it default. Like I said above, I don't 
>>>>>> have
>>>>>> a strong opinion about having a `pyspark[pinned]` - if users only need
>>>>>> pyspark and no other packages they could use that. But that's extra 
>>>>>> effort
>>>>>> for maintenance, and we need to think about what's pinned. We have a lot 
>>>>>> of
>>>>>> pyspark install versions.
>>>>>>
>>>>>> Tian Gao
>>>>>>
>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote:
>>>>>>
>>>>>>> I think the community has already reached consistence to freeze
>>>>>>> dependencies in minor release.
>>>>>>>
>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>>>>>>
>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>> > - Dependencies are frozen and behavioral changes are minimized in
>>>>>>> minor releases.
>>>>>>>
>>>>>>> I would interpret the proposed dependency policy applies to both
>>>>>>> Java/Scala and Python dependency management for Spark. If so, that means
>>>>>>> PySpark will always use pinned dependencies version since 4.3.0. But if 
>>>>>>> the
>>>>>>> intention is to only apply such a dependency policy to Java/Scala, then 
>>>>>>> it
>>>>>>> creates a very strange situation - an extremely conservative dependency
>>>>>>> management strategy for Java/Scala, and an extremely liberal one for 
>>>>>>> Python.
>>>>>>>
>>>>>>> To Tian Gao,
>>>>>>>
>>>>>>> > Pinning versions is a double-edged sword, it doesn't always make
>>>>>>> us more secure - that's my major point.
>>>>>>>
>>>>>>> Product must be usable first, then security, performance, etc. If it
>>>>>>> claims require `foo>=2.0.0`, how do you ensure it is compatible with foo
>>>>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred
>>>>>>> many times, e.g.,[2]. On the contrary, if it claims require 
>>>>>>> `foo==2.0.0`,
>>>>>>> that means it was thoroughly tested with `foo==2.0.0`, and users take 
>>>>>>> their
>>>>>>> own risk to use it with other `foo` versions, for exmaple, if the `foo`
>>>>>>> strictly follow semantic version, it should work with `foo<3.0.0`, but 
>>>>>>> this
>>>>>>> is not Spark's responsibility, users should assess and assume the risk 
>>>>>>> of
>>>>>>> incompatibility themselves.
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Cheng Pan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Response inline
>>>>>>>
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>> Pronouns: she/her
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> One possibility would be to make the pinned version optional (eg
>>>>>>>> pyspark[pinned]) or publish a separate constraints file for people to
>>>>>>>> optionally use with -c?
>>>>>>>>
>>>>>>>>
>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is
>>>>>>>> possible today for people using modern Python packaging workflows that 
>>>>>>>> use
>>>>>>>> lock files. In fact, it happens automatically; all transitive 
>>>>>>>> dependencies
>>>>>>>> are pinned in the lock file, and this is by design.
>>>>>>>>
>>>>>>> So for someone installing a fresh venv with uv/pip/or conda where
>>>>>>> does this come from?
>>>>>>>
>>>>>>> The idea here is we provide the versions we used during the release
>>>>>>> stage so if folks want a “known safe” initial starting point for a new 
>>>>>>> env
>>>>>>> they’ve got one.
>>>>>>>
>>>>>>>>
>>>>>>>> Furthermore, it is straightforward to add additional restrictions
>>>>>>>> to your project spec (i.e. pyproject.toml) so that when the packaging 
>>>>>>>> tool
>>>>>>>> builds the lock file, it does it with whatever restrictions you want 
>>>>>>>> that
>>>>>>>> are specific to your project. That could include specific versions or
>>>>>>>> version ranges of libraries to exclude, for example.
>>>>>>>>
>>>>>>> Yes, but as it stands we leave it to the end user to start from
>>>>>>> scratch picking these versions, we can make their lives simpler by
>>>>>>> providing the versions we tested against with a lock file they can 
>>>>>>> choose
>>>>>>> to use, ignore, or update to their desired versions and include.
>>>>>>>
>>>>>>> Also for interactive workloads I more often see a bare requirements
>>>>>>> file or even pip installs in nb cells (but this could be sample bias).
>>>>>>>
>>>>>>>>
>>>>>>>> I had to do this, for example, on a personal project that used
>>>>>>>> PySpark Connect but which was pulling in a version of grpc that
>>>>>>>> was generating a lot of log noise
>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>> I pinned the version of grpc in my project file and let the packaging 
>>>>>>>> tool
>>>>>>>> resolve all the requirements across PySpark Connect and my custom
>>>>>>>> restrictions.
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to