Re: [discuss] Pinning PySpark dependencies?

Cheng Pan Wed, 01 Apr 2026 07:35:22 -0700

> How about as a compromise, we publish (but don’t lock to) the pip freeze 
> outputs of the venvs we use for testing?


> Where do you propose to publish? Spark website? Maybe in our github repo 
> somewhere?

> I was thinking just in the publisher artifacts directory we already do.

+1, I'm fine with any approach, as long as it provides sufficient info to let 
user know which exactly version of dependencies was used for testing. 

For Java/Scala, we have a script[1] generated dependency list in code repo, at 
[2]

[1] https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh
[2] 
https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3

Thanks,
Cheng Pan



> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]> wrote:
> 
> I was thinking just in the publisher artifacts directory we already do.
> 
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
> 
> 
> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected] 
> <mailto:[email protected]>> wrote:
>> Where do you propose to publish? Spark website? Maybe in our github repo 
>> somewhere? For python packages, users rarely look for artifacts (and it's 
>> difficult to find).
>> 
>> Tian
>> 
>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> I hear that. How about as a compromise, we publish (but don’t lock to) the 
>>> pip freeze outputs of the venvs we use for testing?
>>> 
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.): 
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>> 
>>> 
>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas 
>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> I think supply chain attacks are a problem, but I don’t think we want to 
>>>> be on the hook for a solution here, even if it’s meant just for our 
>>>> project.
>>>> 
>>>> There are “good enough” approaches available today for Python that 
>>>> mitigate most of the risk by excluding recent releases when resolving what 
>>>> package versions to install.
>>>> 
>>>> uv offers exclude-newer 
>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip offers 
>>>> uploaded-prior-to 
>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>>>  Poetry has an issue open 
>>>> <https://github.com/python-poetry/poetry/issues/10646> for a similar 
>>>> feature, plus at least one open PR to close it.
>>>> 
>>>> Users concerned about supply chain attacks would probably get better 
>>>> results from using these options as compared to installing pinned 
>>>> dependencies provided by the projects they use.
>>>> 
>>>> Nick
>>>> 
>>>> 
>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> So I think we can ship it as an optional distribution element (it's 
>>>>> literally just another file folks can choose to download/use if they 
>>>>> want).
>>>>> 
>>>>> Asking users is an idea too, I could put together a survey if we want?
>>>>> 
>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*". 
>>>>>> Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we 
>>>>>> don't need to focus on the syntax.
>>>>>> 
>>>>>> I don't believe we can ship pyspark with a env lock file. That's what 
>>>>>> users do in their own projects. It's not part of python package system. 
>>>>>> What users do is normally install packages, test it out, then lock it 
>>>>>> with either pip or uv - generate a lock file for all dependencies and 
>>>>>> use it across their systems. It's not common for packages to list out a 
>>>>>> "known working dependency list" for users.
>>>>>> 
>>>>>> However, if we really want to try it out, we can do something like `pip 
>>>>>> install pyspark[full-pinned] and install every dependency pyspark 
>>>>>> requires with a pinned version. If our user needs an out-of-box solution 
>>>>>> they can do that. We can also collect feedbacks and see the sentiment 
>>>>>> from users.
>>>>>> 
>>>>>> Tian 
>>>>>> 
>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> > If we consider PySpark the dominant package - meaning that if a user 
>>>>>>> > employs it, it must be the most important element in their project 
>>>>>>> > and everything else must comply with it - pinning versions might be 
>>>>>>> > viable.
>>>>>>> 
>>>>>>> This is not always true, but definitely a major case.
>>>>>>> 
>>>>>>> > I'm not familiar with Java dependency solutions or how users use 
>>>>>>> > spark with Java
>>>>>>> 
>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency 
>>>>>>> management. Product declares transitive dependencies with pinned 
>>>>>>> version, and the package manager (Maven, SBT, Gradle, etc.) picks the 
>>>>>>> most reasonable version based on resolution rules. The rules is a 
>>>>>>> little different in Maven, SBT and Gradle, the Maven docs[1] explains 
>>>>>>> how it works.
>>>>>>> 
>>>>>>> In short, in Java/Scala dependency management, the pinned version is 
>>>>>>> more like a suggested version, it's easy to override by users.
>>>>>>> 
>>>>>>> As Owen pointed out, things are completely different in Python world, 
>>>>>>> both pinned version and latest version seems not ideal, then
>>>>>>> 
>>>>>>> 1. pinned version (foo==2.0.0)
>>>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>>> 
>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still believe we 
>>>>>>> should add a disclaimer that this compatibility only holds under the 
>>>>>>> assumption that 3rd-party packages strictly adhere to semantic 
>>>>>>> versioning.
>>>>>>> 
>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock, 
>>>>>>> > requirements.txt -- expressing a known good / recommended specific 
>>>>>>> > resolved environment. That is _not_ what Python dependency 
>>>>>>> > constraints are for. It's what env lock flies are for.
>>>>>>> 
>>>>>>> We definitely need such a dependency list in PySpark release, it's 
>>>>>>> really important for users to set up a reproducible environment after 
>>>>>>> the release several years, and this is also a good reference for users 
>>>>>>> who encounter 3rd-party packages bugs, or battle with dependency 
>>>>>>> conflicts when they install lots of packages in single environment.
>>>>>>> 
>>>>>>> [1] 
>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Cheng Pan
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not achieving 
>>>>>>>> the desired outcome. There are other ways to do it; I can't think of 
>>>>>>>> any other Python package that works that way. This thread is 
>>>>>>>> conflating different things.
>>>>>>>> 
>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an 
>>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 
>>>>>>>> years? -- expressing "foo==2.0.0" is very likely overly narrow. That 
>>>>>>>> says "does not work with any other version at all" which is likely 
>>>>>>>> more incorrect and more problematic for users.
>>>>>>>> 
>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock, 
>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>> resolved environment. That is _not_ what Python dependency constraints 
>>>>>>>> are for. It's what env lock flies are for.
>>>>>>>> 
>>>>>>>> To be sure there is an art to figuring out the right dependency 
>>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, as a 
>>>>>>>> default when there is nothing more specific known. That is, write 
>>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>>>> 
>>>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly 
>>>>>>>> because Maven resolution is just pretty different, but mostly because 
>>>>>>>> the core Spark distribution is the 'server side' and is necessarily a 
>>>>>>>> 'fat jar', a sort of statically-compiled artifact that simply has some 
>>>>>>>> specific versions in them and can never have different versions 
>>>>>>>> because of runtime resolution differences. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev 
>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>> I agree that a product must be usable first. Pinning the version (to 
>>>>>>>>> a specific number with `==`) will make pyspark unusable.
>>>>>>>>> 
>>>>>>>>> First of all, I think we can agree that many users use PySpark with 
>>>>>>>>> other Python packages. If we conflict with other packages, `pip 
>>>>>>>>> install -r requirements.txt` won't work. It will complain that the 
>>>>>>>>> dependencies can't be resolved, which completely breaks our user's 
>>>>>>>>> workflow. Even if the user locks the dependency version, it won't 
>>>>>>>>> work. So the user had to install PySpark first, then the other 
>>>>>>>>> packages, to override PySpark's dependency. They can't put their 
>>>>>>>>> dependency list in a single file - that is a horrible user experience.
>>>>>>>>> 
>>>>>>>>> When I look at controversial topics, I always have a strong belief, 
>>>>>>>>> that I can't be the only smart person in the world. If an idea is 
>>>>>>>>> good, others must already be doing it. Can we find any recognized 
>>>>>>>>> package in the market that pins its dependencies to a specific 
>>>>>>>>> version? The only case it works is when this package is *all* the 
>>>>>>>>> user needs. That's why we pin versions for docker images, HTTP 
>>>>>>>>> services, or standalone tools - users just need something that works 
>>>>>>>>> out of the box. If we consider PySpark the dominant package - meaning 
>>>>>>>>> that if a user employs it, it must be the most important element in 
>>>>>>>>> their project and everything else must comply with it - pinning 
>>>>>>>>> versions might be viable.
>>>>>>>>> 
>>>>>>>>> I'm not familiar with Java dependency solutions or how users use 
>>>>>>>>> spark with Java, but I'm familiar with the Python ecosystem and 
>>>>>>>>> community. If we pin to a specific version, we will face significant 
>>>>>>>>> criticism. If we must do it, at least don't make it default. Like I 
>>>>>>>>> said above, I don't have a strong opinion about having a 
>>>>>>>>> `pyspark[pinned]` - if users only need pyspark and no other packages 
>>>>>>>>> they could use that. But that's extra effort for maintenance, and we 
>>>>>>>>> need to think about what's pinned. We have a lot of pyspark install 
>>>>>>>>> versions.
>>>>>>>>> 
>>>>>>>>> Tian Gao
>>>>>>>>> 
>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> I think the community has already reached consistence to freeze 
>>>>>>>>>> dependencies in minor release.
>>>>>>>>>> 
>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>>>>>>>>> 
>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>>>>> > - Dependencies are frozen and behavioral changes are minimized in 
>>>>>>>>>> > minor releases.
>>>>>>>>>> 
>>>>>>>>>> I would interpret the proposed dependency policy applies to both 
>>>>>>>>>> Java/Scala and Python dependency management for Spark. If so, that 
>>>>>>>>>> means PySpark will always use pinned dependencies version since 
>>>>>>>>>> 4.3.0. But if the intention is to only apply such a dependency 
>>>>>>>>>> policy to Java/Scala, then it creates a very strange situation - an 
>>>>>>>>>> extremely conservative dependency management strategy for 
>>>>>>>>>> Java/Scala, and an extremely liberal one for Python.
>>>>>>>>>> 
>>>>>>>>>> To Tian Gao,
>>>>>>>>>> 
>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always make 
>>>>>>>>>> > us more secure - that's my major point.
>>>>>>>>>> 
>>>>>>>>>> Product must be usable first, then security, performance, etc. If it 
>>>>>>>>>> claims require `foo>=2.0.0`, how do you ensure it is compatible with 
>>>>>>>>>> foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures 
>>>>>>>>>> occurred many times, e.g.,[2]. On the contrary, if it claims require 
>>>>>>>>>> `foo==2.0.0`, that means it was thoroughly tested with `foo==2.0.0`, 
>>>>>>>>>> and users take their own risk to use it with other `foo` versions, 
>>>>>>>>>> for exmaple, if the `foo` strictly follow semantic version, it 
>>>>>>>>>> should work with `foo<3.0.0`, but this is not Spark's 
>>>>>>>>>> responsibility, users should assess and assume the risk of 
>>>>>>>>>> incompatibility themselves.
>>>>>>>>>> 
>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Cheng Pan
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected] 
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Response inline 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas 
>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau 
>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> One possibility would be to make the pinned version optional (eg 
>>>>>>>>>>>>> pyspark[pinned]) or publish a separate constraints file for 
>>>>>>>>>>>>> people to optionally use with -c?
>>>>>>>>>>>> 
>>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is 
>>>>>>>>>>>> possible today for people using modern Python packaging workflows 
>>>>>>>>>>>> that use lock files. In fact, it happens automatically; all 
>>>>>>>>>>>> transitive dependencies are pinned in the lock file, and this is 
>>>>>>>>>>>> by design.
>>>>>>>>>>> 
>>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda where 
>>>>>>>>>>> does this come from?
>>>>>>>>>>> 
>>>>>>>>>>> The idea here is we provide the versions we used during the release 
>>>>>>>>>>> stage so if folks want a “known safe” initial starting point for a 
>>>>>>>>>>> new env they’ve got one.
>>>>>>>>>>>> 
>>>>>>>>>>>> Furthermore, it is straightforward to add additional restrictions 
>>>>>>>>>>>> to your project spec (i.e. pyproject.toml) so that when the 
>>>>>>>>>>>> packaging tool builds the lock file, it does it with whatever 
>>>>>>>>>>>> restrictions you want that are specific to your project. That 
>>>>>>>>>>>> could include specific versions or version ranges of libraries to 
>>>>>>>>>>>> exclude, for example.
>>>>>>>>>>> Yes, but as it stands we leave it to the end user to start from 
>>>>>>>>>>> scratch picking these versions, we can make their lives simpler by 
>>>>>>>>>>> providing the versions we tested against with a lock file they can 
>>>>>>>>>>> choose to use, ignore, or update to their desired versions and 
>>>>>>>>>>> include.
>>>>>>>>>>> 
>>>>>>>>>>> Also for interactive workloads I more often see a bare requirements 
>>>>>>>>>>> file or even pip installs in nb cells (but this could be sample 
>>>>>>>>>>> bias).
>>>>>>>>>>>> 
>>>>>>>>>>>> I had to do this, for example, on a personal project that used 
>>>>>>>>>>>> PySpark Connect but which was pulling in a version of grpc that 
>>>>>>>>>>>> was generating a lot of log noise 
>>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>>>>>>  I pinned the version of grpc in my project file and let the 
>>>>>>>>>>>> packaging tool resolve all the requirements across PySpark Connect 
>>>>>>>>>>>> and my custom restrictions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Nick
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to