pan3793 commented on code in PR #657: URL: https://github.com/apache/spark-website/pull/657#discussion_r2670714826
########## releases/_posts/2025-12-16-spark-release-4.1.0.md: ########## @@ -11,8 +11,444 @@ meta: _wpas_done_all: '1' --- -Apache Spark 4.1.0 is a new feature release. It introduces new functionality and improvements. We encourage users to try it and provide feedback. +Apache Spark 4.1.0 is the second release in the 4.x series. With significant contributions from the open-source community, this release addressed over 1,800 Jira tickets with contributions from more than 230 individuals. -You can find the list of resolved issues and detailed changes in the [JIRA release notes](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12355581). +This release continues the Spark 4.x momentum and focuses on higher-level data engineering, lower-latency streaming, faster and easier PySpark, and a more capable SQL surface. -We would like to acknowledge all community members for contributing patches and features to this release. +This release adds Spark Declarative Pipelines (SDP): A new declarative framework where you define datasets and queries, and Spark handles the execution graph, dependency ordering, parallelism, checkpoints, and retries. + +This release supports Structured Streaming Real-Time Mode (RTM): First official support for Structured Streaming queries running in real-time mode for continuous, sub-second latency processing. For stateless tasks, latency can even drop to single-digit milliseconds. + +PySpark UDFs and Data Sources have been improved: New Arrow-native UDF and UDTF decorators for efficient PyArrow execution without Pandas conversion overhead, plus Python Data Source filter pushdown to reduce data movement. + +Spark ML on Connect is GA for the Python client, with smarter model caching and memory management. Spark 4.1 also improves stability for large workloads with zstd-compressed protobuf plans, chunked Arrow result streaming, and enhanced support for large local relations. + +SQL Scripting is GA and enabled by default, with improved error handling and cleaner declarations. VARIANT is GA with shredding for faster reads on semi-structured data, plus recursive CTE support and new approximate data sketches (KLL and Theta). + +To download Apache Spark 4.1.0, please visit the [downloads](https://spark.apache.org/downloads.html) page. For [detailed changes](https://issues.apache.org/jira/projects/SPARK/versions/12355581), you can consult JIRA. We have also curated a list of high-level changes here, grouped by major components. + +* This will become a table of contents (this text will be scraped). +{:toc} + + +### Highlights +- **[[SPARK-51727]](https://issues.apache.org/jira/browse/SPARK-51727)** SPIP: **Declarative Pipelines**, a new component to define and run data pipelines +- **[[SPARK-54499]](https://issues.apache.org/jira/browse/SPARK-54499)** Enable SQL scripting by default (SQL scripting GA) Review Comment: ```suggestion - **[[SPARK-54499]](https://issues.apache.org/jira/browse/SPARK-54499)** Enable SQL Scripting by default (SQL Scripting GA) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
