Skip to content

chore(deps): update dependency pyspark to v4 #13387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

renovate-bot
Copy link
Contributor

This PR contains the following updates:

Package Change Age Adoption Passing Confidence
pyspark (source) ==3.5.5 -> ==4.0.0 age adoption passing confidence

Release Notes

apache/spark (pyspark)

v4.0.0

Compare Source


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Never, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate-bot renovate-bot requested review from a team as code owners May 23, 2025 04:28
@trusted-contributions-gcf trusted-contributions-gcf bot added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels May 23, 2025
@product-auto-label product-auto-label bot added samples Issues that are directly related to samples. api: pubsublite Issues related to the Pub/Sub Lite API. labels May 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @renovate-bot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR is an automated dependency update generated by renovate-bot. Its primary purpose is to upgrade the pyspark dependency within the pubsublite/spark-connector module. Specifically, it updates the required version of pyspark[sql] from 3.5.5 to the new major version 4.0.0. This is a straightforward update to keep the project's dependencies current.

Highlights

  • Dependency Update: The core change is updating the pyspark[sql] dependency to version 4.0.0.
  • Automated PR: This pull request was automatically created by renovate-bot.
  • Manual Merge Required: The PR is configured to require manual merging and is not set up for automerge.

Changelog

  • pubsublite/spark-connector/requirements.txt
    • Updated the version constraint for pyspark[sql] from ==3.5.5 to ==4.0.0.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Spark version bumps,
From three-five-five to four,
New features await.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates pyspark from version 3.5.5 to 4.0.0. While keeping dependencies up-to-date is generally good practice, this is a major version upgrade for PySpark, which often introduces breaking changes and requires careful consideration.

The most significant concern with this upgrade to Spark 4.0 is its move to Scala 2.13 by default (from Scala 2.12 in Spark 3.x). This has direct implications for the binary compatibility of Spark connectors, such as the Pub/Sub Lite Spark connector JARs used in the examples and tests within this repository.

Thorough testing and verification are crucial before merging this change.

Summary of Findings

  • Connector JAR Compatibility with Spark 4.0/Scala 2.13: Spark 4.0 uses Scala 2.13 by default. The existing Pub/Sub Lite Spark connector JARs (e.g., pubsublite-spark-sql-streaming-1.0.0) are likely built for Scala 2.12 and Spark 3.x, which will cause runtime failures. This needs to be verified, and compatible JARs must be used. This is a critical issue.
  • Thorough Testing for PySpark 4.0 Breaking Changes: Major version upgrades like PySpark 3.x to 4.x can introduce subtle breaking changes or behavior modifications beyond connector compatibility (e.g., changes in default SQL configurations). All examples and especially the integration tests in spark_streaming_test.py need to be thoroughly re-validated. This is a high-severity concern.
  • Missing newline at end of file: The file pubsublite/spark-connector/requirements.txt is missing a newline character at the end of the file. This is a minor stylistic issue and has not been commented on directly due to review settings focusing on medium severity and above.

Merge Readiness

This pull request updates pyspark to a new major version (4.0.0). Due to the critical concern regarding the compatibility of the Pub/Sub Lite Spark connector JARs with Spark 4.0's default Scala 2.13, and the general risk of breaking changes in a major library upgrade, I recommend that this PR not be merged until these issues are thoroughly investigated and addressed.

Specifically, it's essential to:

  1. Confirm the availability and usage of Pub/Sub Lite Spark connector JARs compatible with Spark 4.0/Scala 2.13.
  2. Successfully run all integration tests in spark_streaming_test.py with the new PySpark version and compatible connectors.
  3. Review PySpark 4.0 release notes for other potential breaking changes affecting the examples.

As a reviewer, I am not authorized to approve pull requests. Please ensure these critical points are resolved and further review is conducted before considering merging.

pyspark[sql]==4.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Upgrading to PySpark 4.0.0 is a significant change. Apache Spark 4.0 now uses Scala 2.13 by default, whereas Spark 3.x versions (like 3.5.5) primarily used Scala 2.12.

This raises a critical concern about the compatibility of the Pub/Sub Lite Spark connector JARs used in your examples and tests (spark_streaming_test.py):

  • gs://pubsublite-spark/pubsublite-spark-sql-streaming-1.0.0-with-dependencies.jar
  • gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar

These JARs, especially version 1.0.0, are likely compiled against Scala 2.12. For instance, the java-pubsublite-spark connector version 1.1.0 (released Feb 2024) specifies Spark 3.3.2 and Scala 2.12.

Could you please verify the following points?

  1. Are there versions of the Pub/Sub Lite Spark connector JARs available that are compiled for Spark 4.0 and Scala 2.13?
  2. Have the paths in spark_streaming_test.py been updated to use these compatible JARs if necessary?
  3. Have all tests in spark_streaming_test.py been executed successfully with PySpark 4.0 and the (potentially new) connector JARs? Without compatible connectors, these tests are very likely to fail due to binary incompatibilities between Scala versions.

Additionally, PySpark 4.0 may introduce other breaking changes or behavior modifications (e.g., the default change for spark.sql.legacy.respectNullabilityInTextDatasetConversion). It's important to review the Apache Spark 4.0.0 release notes for any other changes that might affect your examples and ensure they behave as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsublite Issues related to the Pub/Sub Lite API. owlbot:run Add this label to trigger the Owlbot post processor. samples Issues that are directly related to samples.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants