Re: [discuss] Pinning PySpark dependencies?

Devin Petersohn via dev Wed, 01 Apr 2026 07:48:56 -0700

I think we should do something in response to the growing supply chain
attacks rather than just leaving the problem to users. One alternative we
could consider for Python specifically is an install target with upper
bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. This
wouldn't impact regular use, and seems like it would solve the other
problems with publishing lock files, etc. As others have mentioned, this
wouldn't *guarantee* security, but it would provide meaningful protection
against the worst offenders we've recently seen.


On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote:

> > How about as a compromise, we publish (but don’t lock to) the pip freeze
> outputs of the venvs we use for testing?
>
> > Where do you propose to publish? Spark website? Maybe in our github repo
> somewhere?
>
> > I was thinking just in the publisher artifacts directory we already do.
>
> +1, I'm fine with any approach, as long as it provides sufficient info to
> let user know which exactly version of dependencies was used for testing.
>
> For Java/Scala, we have a script[1] generated dependency list in code
> repo, at [2]
>
> [1]
> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh
> [2]
> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3
>
> Thanks,
> Cheng Pan
>
>
>
> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]> wrote:
>
> I was thinking just in the publisher artifacts directory we already do.
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]> wrote:
>
>> Where do you propose to publish? Spark website? Maybe in our github repo
>> somewhere? For python packages, users rarely look for artifacts (and it's
>> difficult to find).
>>
>> Tian
>>
>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected]>
>> wrote:
>>
>>> I hear that. How about as a compromise, we publish (but don’t lock to)
>>> the pip freeze outputs of the venvs we use for testing?
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <
>>> [email protected]> wrote:
>>>
>>>> I think supply chain attacks are a problem, but I don’t think we want
>>>> to be on the hook for a solution here, even if it’s meant just for our
>>>> project.
>>>>
>>>> There are “good enough” approaches available today for Python that
>>>> mitigate most of the risk by excluding recent releases when resolving what
>>>> package versions to install.
>>>>
>>>> uv offers exclude-newer
>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip
>>>> offers uploaded-prior-to
>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>>> Poetry has an issue open
>>>> <https://github.com/python-poetry/poetry/issues/10646> for a similar
>>>> feature, plus at least one open PR to close it.
>>>>
>>>> Users concerned about supply chain attacks would probably get better
>>>> results from using these options as compared to installing pinned
>>>> dependencies provided by the projects they use.
>>>>
>>>> Nick
>>>>
>>>>
>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>> So I think we can ship it as an optional distribution element (it's
>>>> literally just another file folks can choose to download/use if they want).
>>>>
>>>> Asking users is an idea too, I could put together a survey if we want?
>>>>
>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected]>
>>>> wrote:
>>>>
>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*".
>>>>> Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we don't
>>>>> need to focus on the syntax.
>>>>>
>>>>> I don't believe we can ship pyspark with a env lock file. That's what
>>>>> users do in their own projects. It's not part of python package system.
>>>>> What users do is normally install packages, test it out, then lock it with
>>>>> either pip or uv - generate a lock file for all dependencies and use it
>>>>> across their systems. It's not common for packages to list out a "known
>>>>> working dependency list" for users.
>>>>>
>>>>> However, if we really want to try it out, we can do something like
>>>>> `pip install pyspark[full-pinned] and install every dependency pyspark
>>>>> requires with a pinned version. If our user needs an out-of-box solution
>>>>> they can do that. We can also collect feedbacks and see the sentiment from
>>>>> users.
>>>>>
>>>>> Tian
>>>>>
>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote:
>>>>>
>>>>>> > If we consider PySpark the dominant package - meaning that if a
>>>>>> user employs it, it must be the most important element in their project 
>>>>>> and
>>>>>> everything else must comply with it - pinning versions might be viable.
>>>>>>
>>>>>> This is not always true, but definitely a major case.
>>>>>>
>>>>>> > I'm not familiar with Java dependency solutions or how users use
>>>>>> spark with Java
>>>>>>
>>>>>> In Java/Scala, it's rare to use dynamic version for dependency
>>>>>> management. Product declares transitive dependencies with pinned version,
>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>>>>>> reasonable version based on resolution rules. The rules is a little
>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it 
>>>>>> works.
>>>>>>
>>>>>> In short, in Java/Scala dependency management, the pinned version is
>>>>>> more like a suggested version, it's easy to override by users.
>>>>>>
>>>>>> As Owen pointed out, things are completely different in Python world,
>>>>>> both pinned version and latest version seems not ideal, then
>>>>>>
>>>>>> 1. pinned version (foo==2.0.0)
>>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>>
>>>>>> seems 2 or 3 might be an acceptable solution? And, I still believe we
>>>>>> should add a disclaimer that this compatibility only holds under the
>>>>>> assumption that 3rd-party packages strictly adhere to semantic 
>>>>>> versioning.
>>>>>>
>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>> resolved
>>>>>> environment. That is _not_ what Python dependency constraints are for. 
>>>>>> It's
>>>>>> what env lock flies are for.
>>>>>>
>>>>>> We definitely need such a dependency list in PySpark release, it's
>>>>>> really important for users to set up a reproducible environment after the
>>>>>> release several years, and this is also a good reference for users who
>>>>>> encounter 3rd-party packages bugs, or battle with dependency conflicts 
>>>>>> when
>>>>>> they install lots of packages in single environment.
>>>>>>
>>>>>> [1]
>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>>
>>>>>> Thanks,
>>>>>> Cheng Pan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>>>>>
>>>>>> TL;DR Tian is more correct, and == pinning versions is not achieving
>>>>>> the desired outcome. There are other ways to do it; I can't think of any
>>>>>> other Python package that works that way. This thread is conflating
>>>>>> different things.
>>>>>>
>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an
>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 years? 
>>>>>> --
>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says "does not
>>>>>> work with any other version at all" which is likely more incorrect and 
>>>>>> more
>>>>>> problematic for users.
>>>>>>
>>>>>> You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>> resolved
>>>>>> environment. That is _not_ what Python dependency constraints are for. 
>>>>>> It's
>>>>>> what env lock flies are for.
>>>>>>
>>>>>> To be sure there is an art to figuring out the right dependency
>>>>>> bounds. A reasonable compromise is to allow maintenance releases, as a
>>>>>> default when there is nothing more specific known. That is, write
>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>>
>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly
>>>>>> because Maven resolution is just pretty different, but mostly because the
>>>>>> core Spark distribution is the 'server side' and is necessarily a 'fat
>>>>>> jar', a sort of statically-compiled artifact that simply has some 
>>>>>> specific
>>>>>> versions in them and can never have different versions because of runtime
>>>>>> resolution differences.
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I agree that a product must be usable first. Pinning the version (to
>>>>>>> a specific number with `==`) will make pyspark unusable.
>>>>>>>
>>>>>>> First of all, I think we can agree that many users use PySpark with
>>>>>>> other Python packages. If we conflict with other packages, `pip install 
>>>>>>> -r
>>>>>>> requirements.txt` won't work. It will complain that the dependencies 
>>>>>>> can't
>>>>>>> be resolved, which completely breaks our user's workflow. Even if the 
>>>>>>> user
>>>>>>> locks the dependency version, it won't work. So the user had to install
>>>>>>> PySpark first, then the other packages, to override PySpark's 
>>>>>>> dependency.
>>>>>>> They can't put their dependency list in a single file - that is a 
>>>>>>> horrible
>>>>>>> user experience.
>>>>>>>
>>>>>>> When I look at controversial topics, I always have a strong belief,
>>>>>>> that I can't be the only smart person in the world. If an idea is good,
>>>>>>> others must already be doing it. Can we find any recognized package in 
>>>>>>> the
>>>>>>> market that pins its dependencies to a specific version? The only case 
>>>>>>> it
>>>>>>> works is when this package is *all* the user needs. That's why we pin
>>>>>>> versions for docker images, HTTP services, or standalone tools - users 
>>>>>>> just
>>>>>>> need something that works out of the box. If we consider PySpark the
>>>>>>> dominant package - meaning that if a user employs it, it must be the 
>>>>>>> most
>>>>>>> important element in their project and everything else must comply with 
>>>>>>> it
>>>>>>> - pinning versions might be viable.
>>>>>>>
>>>>>>> I'm not familiar with Java dependency solutions or how users use
>>>>>>> spark with Java, but I'm familiar with the Python ecosystem and 
>>>>>>> community.
>>>>>>> If we pin to a specific version, we will face significant criticism. If 
>>>>>>> we
>>>>>>> must do it, at least don't make it default. Like I said above, I don't 
>>>>>>> have
>>>>>>> a strong opinion about having a `pyspark[pinned]` - if users only need
>>>>>>> pyspark and no other packages they could use that. But that's extra 
>>>>>>> effort
>>>>>>> for maintenance, and we need to think about what's pinned. We have a 
>>>>>>> lot of
>>>>>>> pyspark install versions.
>>>>>>>
>>>>>>> Tian Gao
>>>>>>>
>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote:
>>>>>>>
>>>>>>>> I think the community has already reached consistence to freeze
>>>>>>>> dependencies in minor release.
>>>>>>>>
>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>>>>>>>
>>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>>> > - Dependencies are frozen and behavioral changes are minimized in
>>>>>>>> minor releases.
>>>>>>>>
>>>>>>>> I would interpret the proposed dependency policy applies to both
>>>>>>>> Java/Scala and Python dependency management for Spark. If so, that 
>>>>>>>> means
>>>>>>>> PySpark will always use pinned dependencies version since 4.3.0. But 
>>>>>>>> if the
>>>>>>>> intention is to only apply such a dependency policy to Java/Scala, 
>>>>>>>> then it
>>>>>>>> creates a very strange situation - an extremely conservative dependency
>>>>>>>> management strategy for Java/Scala, and an extremely liberal one for 
>>>>>>>> Python.
>>>>>>>>
>>>>>>>> To Tian Gao,
>>>>>>>>
>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always make
>>>>>>>> us more secure - that's my major point.
>>>>>>>>
>>>>>>>> Product must be usable first, then security, performance, etc. If
>>>>>>>> it claims require `foo>=2.0.0`, how do you ensure it is compatible 
>>>>>>>> with foo
>>>>>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures 
>>>>>>>> occurred
>>>>>>>> many times, e.g.,[2]. On the contrary, if it claims require 
>>>>>>>> `foo==2.0.0`,
>>>>>>>> that means it was thoroughly tested with `foo==2.0.0`, and users take 
>>>>>>>> their
>>>>>>>> own risk to use it with other `foo` versions, for exmaple, if the `foo`
>>>>>>>> strictly follow semantic version, it should work with `foo<3.0.0`, but 
>>>>>>>> this
>>>>>>>> is not Spark's responsibility, users should assess and assume the risk 
>>>>>>>> of
>>>>>>>> incompatibility themselves.
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Cheng Pan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Response inline
>>>>>>>>
>>>>>>>>
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> One possibility would be to make the pinned version optional (eg
>>>>>>>>> pyspark[pinned]) or publish a separate constraints file for people to
>>>>>>>>> optionally use with -c?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is
>>>>>>>>> possible today for people using modern Python packaging workflows 
>>>>>>>>> that use
>>>>>>>>> lock files. In fact, it happens automatically; all transitive 
>>>>>>>>> dependencies
>>>>>>>>> are pinned in the lock file, and this is by design.
>>>>>>>>>
>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda where
>>>>>>>> does this come from?
>>>>>>>>
>>>>>>>> The idea here is we provide the versions we used during the release
>>>>>>>> stage so if folks want a “known safe” initial starting point for a new 
>>>>>>>> env
>>>>>>>> they’ve got one.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Furthermore, it is straightforward to add additional restrictions
>>>>>>>>> to your project spec (i.e. pyproject.toml) so that when the packaging 
>>>>>>>>> tool
>>>>>>>>> builds the lock file, it does it with whatever restrictions you want 
>>>>>>>>> that
>>>>>>>>> are specific to your project. That could include specific versions or
>>>>>>>>> version ranges of libraries to exclude, for example.
>>>>>>>>>
>>>>>>>> Yes, but as it stands we leave it to the end user to start from
>>>>>>>> scratch picking these versions, we can make their lives simpler by
>>>>>>>> providing the versions we tested against with a lock file they can 
>>>>>>>> choose
>>>>>>>> to use, ignore, or update to their desired versions and include.
>>>>>>>>
>>>>>>>> Also for interactive workloads I more often see a bare requirements
>>>>>>>> file or even pip installs in nb cells (but this could be sample bias).
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I had to do this, for example, on a personal project that used
>>>>>>>>> PySpark Connect but which was pulling in a version of grpc that
>>>>>>>>> was generating a lot of log noise
>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>>> I pinned the version of grpc in my project file and let the packaging 
>>>>>>>>> tool
>>>>>>>>> resolve all the requirements across PySpark Connect and my custom
>>>>>>>>> restrictions.
>>>>>>>>>
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>>
>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to