Spark update

Marcin Ziemiński Mon, 15 Aug 2016 14:43:22 -0700

Hi,

The recent version of Spark - 2.0.0 comes with many changes and
improvements. The same is for the related Spark MLlib .
-> Spark 2.0.0 release notes
<https://spark.apache.org/releases/spark-release-2-0-0.html>
-> Spark 2.0.0 overview
<https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html>
-> Announcement: DataFrame-based API is primary API
<http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api>

The changes are quite serious as they move the usage more towards DataSets
instead of RDDs and offer many performance improvements.
The current version of Spark that PredictionIO is built with is quite old
1.4.0 and soon some people will want to make use of the latest features in
new templates.
What is more, we could come up with a new kind of workflow different from
DASE fitting better Spark ML pipelines, which would be a part of some next
future release and would set a different direction of the project.

I am aware that simply upgrading the version of Spark would break a lot of
existing projects and cause many inconveniences to many people. Besides it
requires bumping Scala version to 2.11. Having a separate branch for the
new version is rather not maintainable, therefore I think that we could
cross-build the project against two versions of Scala - 2.10 and 2.11,
where the 2.11 version would be specifically for the new version of Spark.
Any sources not compatible with two versions at the same time, could be
split up between separate version-specific directories: src/main/scala-2.10/,
src/main/scala-2.11/. Sbt as of 13.8 should not have problems with
that -> merged
PR <https://github.com/sbt/sbt/pull/1799>
Such setup would make it possible to work on fixes and features available
in both Scala versions, but more importantly it would let us add new
functionalities specific to the latest Spark releases without breaking the
old build.

I have already tried to update Spark in PredictionIO in my own fork and I
have managed to get a working version without modifying too much code -> diff
here
<https://github.com/apache/incubator-predictionio/compare/develop...Ziemin:upgrade#diff-fdc3abdfd754eeb24090dbd90aeec2ce>
.
Both unit-tests and integration tests were successful.

What do you think? Would it affect the current release cycle in a negative
way? Maybe someone has a better idea on how to perform this upgrade.
Sticking to Spark 1.x version forever is probably not an option and the
sooner we upgrade the better.

Regards,
Marcin

Spark update

Reply via email to