Hi, The recent version of Spark - 2.0.0 comes with many changes and improvements. The same is for the related Spark MLlib . -> Spark 2.0.0 release notes <https://spark.apache.org/releases/spark-release-2-0-0.html> -> Spark 2.0.0 overview <https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html> -> Announcement: DataFrame-based API is primary API <http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api>
The changes are quite serious as they move the usage more towards DataSets instead of RDDs and offer many performance improvements. The current version of Spark that PredictionIO is built with is quite old 1.4.0 and soon some people will want to make use of the latest features in new templates. What is more, we could come up with a new kind of workflow different from DASE fitting better Spark ML pipelines, which would be a part of some next future release and would set a different direction of the project. I am aware that simply upgrading the version of Spark would break a lot of existing projects and cause many inconveniences to many people. Besides it requires bumping Scala version to 2.11. Having a separate branch for the new version is rather not maintainable, therefore I think that we could cross-build the project against two versions of Scala - 2.10 and 2.11, where the 2.11 version would be specifically for the new version of Spark. Any sources not compatible with two versions at the same time, could be split up between separate version-specific directories: src/main/scala-2.10/, src/main/scala-2.11/. Sbt as of 13.8 should not have problems with that -> merged PR <https://github.com/sbt/sbt/pull/1799> Such setup would make it possible to work on fixes and features available in both Scala versions, but more importantly it would let us add new functionalities specific to the latest Spark releases without breaking the old build. I have already tried to update Spark in PredictionIO in my own fork and I have managed to get a working version without modifying too much code -> diff here <https://github.com/apache/incubator-predictionio/compare/develop...Ziemin:upgrade#diff-fdc3abdfd754eeb24090dbd90aeec2ce> . Both unit-tests and integration tests were successful. What do you think? Would it affect the current release cycle in a negative way? Maybe someone has a better idea on how to perform this upgrade. Sticking to Spark 1.x version forever is probably not an option and the sooner we upgrade the better. Regards, Marcin
