There are 2 ways to support alternative engines—be they DBs, compute engines, search engines or other. 1) Abstract them into a single API as is done with PEvents in PIO, where HBase and JDBC, specifically MySQL and Postrges are supported backends. 2) Do not abstract them but allow native API usage. We have written templates than use VW for instance, having no need for Spark or a distributed compute engine at all. This Template would also more naturally be a kappa-type online learner but we had to force it into lambda to get it done quickly with PIO.
There are lots of pros and cons for both methods and the use of them together is often required to avoid an abstraction that is the least common denominator. I think the key decision that allows either method to be chosen for any particular microservice is refactoring into microservices to begin with. To illustrate, with PIO now you have to install with JDBC EventStore or HBase event Store. The code to service both is in PIO. If these were microservices the code for only the desired EventStore could be chosen as a containerized Store and creating a new one would be creating a new microservice, not adding implementation class(es) for yet another backing store. This would also allow a particular Algorithm to use a Cassandra (micro?)service through its native API is desired, by punting the abstraction question. To get this kind of flexibility we need: 1) PIO as a collection of microservices (a major bit of work) 2) Decoupling PIO workflow commands from Spark but allowing them to use Spark when required. (would require a rethink of the CLI implementation) 3) container composition and orchestration for setup (major work but fairly straightforward with #1) Looking at the wishlist below again it seems the work could be divided into 2 phases: 1) Quick easy things like supporting TLS as opt-in, merging ActionML fork, minor cleanup 2) PIO as containerized microservices and all that implies. This could be packaged as pio-1.0 with a fairly quick release that is supported long term (arguably) and pio-2.x-snapshop, a months long WIP with relatively ambitious goals. If this discussion is seen as constructive by the PMC it should probably be moved to dev@ to get user’s input. On Jul 2, 2016, at 6:06 PM, Suneel Marthi <[email protected] <mailto:[email protected]>> wrote: On Sat, Jul 2, 2016 at 8:58 PM, Pat Ferrel <[email protected] <mailto:[email protected]>> wrote: For the last year some of us have had the experience of creating several applications with PIO for users and it is still our go-to platform for ML apps. However it has led to several observations and even a code fork. We have a unique opportunity to make some quick changes that will help users now and also, since we are in incubation looking to TLP status, rethink things at a deeper level to refresh it for a new generation of ML algos and libs. One person’s wish list (no attempt to prioritize): Simplify by removing `pio template get …` to be replaced with `git clone` or whatever other version control is used. Simplify by removing the Scala coded SBT build for templates, make build visible to users by providing sample SBT or MVN files. Simplify the CLI by splitting `pio build` into `pio register` and `sbt build`. Call `pio register any time the config of engine.json has changed when any workflow command is executed so the user need not remember to do this. Refactor the CLI code to make it much easier to use with a debugger for those who wish to. Simply having CLI SBT build will likely go a fair way to making this more accessible. Create a simple gallery of available templates even if it is only a page of links on the Apache doc site pointing to Github repos with blurb explanations. Allow users to PR against this page to get theirs listed. This has already been discussed on the mailing list. Make SSL an opt-in, rather than a requirement. I think a PR has been created for this. After #6 any of the ActionML fork should be merged. Alex and I will create one for review and possible cherry-picking. Support TLS in SDKs and examples asap for the opt-in SSL in PIO Remove SparkSubmit as a requirement period. If PIO is to be a framework for ML we should support non-MLlib algos. There are already several Templates that do not need Spark for train or deploy. Is there a need to couple PIO with Spark at all? Examples like TensorFlow, Vowpal Wabbit, Flink and many other perfectly good libs could be used in algos not tied to Spark. Other IPMC members have proposed during the incubation vote about possible redoing this using Apache Beam (and thus handle unified Batch + Streaming). Something to consider. Rethink and simplify the build, train, deploy workflow to fit lambda and kappa. For instance it may be that only something like deploy is needed for kappa. We can go Kappa with something like Flink or Beam. Make the PredictionServer multi-tenant. ActionML has code to donate here but with restrictions about Spark contexts. More discussion is needed and probably a PredictionServer refactoring/rearchitecture is needed. There is a great deal of unnecessary complexity around multi-tenancy (multiple data and model identification). There are appids, instance-ids, app keys, channels, port numbers etc. We should unify multi-tenancy around REST resource ids from the EventServer to PredictionServer. Add authentication with SSL Installation is the single biggest challenge to users. We should consider a refactoring/re-architecting into a microservice arch with associated containerized deployment using container orchestration tools. Examples are Docker + Swarm (containers) and Karamel, Chef, Ansible, or Docker-machine (orchestration). Clearly there are other candidates. The importance of this is hard to overstate IMHO. PIO is far too monolithic. Decouple Input from Output from Train/model updates. A large deployment of PIO for a lambda type template involves input, training, and output, which are potentially running on different machines. Input through the EventServer is separated out but not documented or containerized for independent deployment. Train and Predict are 2 methods of the same class and so are intimately coupled even though deployment or workflow execution may occur on separate machines/clusters. Go MicroServices !!! Fully support both lambda and kappa type algorithms. The need for this is illustrated by the rise in streaming algorithms that use online learning. To simplify this discussion 2 examples would help: A recommender, which is usually lambda for which PredictionIO has several examples A multi-armed-bandit, which can easily be kappa and is fairly simple to imagine. Another kappa style algorithm is a streaming online anomaly detector for event patterns. Adding kappa-style algos, allowing non-Spark algos, and refactoring Templates will allow the project to containerize algos so they come with their own deployment pattern, namely orchestration and composition of microservices. Some algos may share some (micro)services and use unique ones elsewhere but the orchestration and composition pattern can account for this. Refactor the Template API to make Algo.train and Algo.predict separate since their needs may be very different. In one current PIO example Algo.train is the only part of the workflow that needs Spark. Deploy does not and in fact may output csv or json instead of requiring a PredictionServer. If Algo.train needs Spark and the Template is based on containers, new machines can be spun up to do the train and taken down after, leading to large cloud compute savings for users. There are endless examples of how reworking PIO to be based on microservices and containers would enable many new applications and ease deployment. I’m interested to see if the current committers have the energy around this project to think this broadly, I have no wish at all to start a bunch of debates. PIO is the best project of its kind available today but it will evolve, the questions is; how much.
