There are 2 ways to support alternative engines—be they DBs, compute engines, 
search engines or other.
1) Abstract them into a single API as is done with PEvents in PIO, where HBase 
and JDBC, specifically MySQL and Postrges are supported backends.
2) Do not abstract them but allow native API usage. We have written templates 
than use VW for instance, having no need for Spark or a distributed compute 
engine at all. This Template would also more naturally be a kappa-type online 
learner but we had to force it into lambda to get it done quickly with PIO.

There are lots of pros and cons for both methods and the use of them together 
is often required to avoid an abstraction that is the least common denominator.

I think the key decision that allows either method to be chosen for any 
particular microservice is refactoring into microservices to begin with.

To illustrate, with PIO now you have to install with JDBC EventStore or HBase 
event Store. The code to service both is in PIO. If these were microservices 
the code for only the desired EventStore could be chosen as a containerized 
Store and creating a new one would be creating a new microservice, not adding 
implementation class(es) for yet another backing store. This would also allow a 
particular Algorithm to use a Cassandra (micro?)service through its native API 
is desired, by punting the abstraction question.

To get this kind of flexibility we need:
1) PIO as a collection of microservices (a major bit of work)
2) Decoupling PIO workflow commands from Spark but allowing them to use Spark 
when required. (would require a rethink of the CLI implementation)
3) container composition and orchestration for setup (major work but fairly 
straightforward with #1)

Looking at the wishlist below again it seems the work could be divided into 2 
phases:
1) Quick easy things like supporting TLS as opt-in, merging ActionML fork, 
minor cleanup
2) PIO as containerized microservices and all that implies.

This could be packaged as pio-1.0 with a fairly quick release that is supported 
long term (arguably) and pio-2.x-snapshop, a months long WIP with relatively 
ambitious goals.

If this discussion is seen as constructive by the PMC it should probably be 
moved to dev@ to get user’s input.


On Jul 2, 2016, at 6:06 PM, Suneel Marthi <[email protected] 
<mailto:[email protected]>> wrote:



On Sat, Jul 2, 2016 at 8:58 PM, Pat Ferrel <[email protected] 
<mailto:[email protected]>> wrote:
For the last year some of us have had the experience of creating several 
applications with PIO for users and it is still our go-to platform for ML apps. 
However it has led to several observations and even a code fork. We have a 
unique opportunity to make some quick changes that will help users now and 
also, since we are in incubation looking to TLP status, rethink things at a 
deeper level to refresh it for a new generation of ML algos and libs.

One person’s wish list (no attempt to prioritize):
Simplify by removing `pio template get …` to be replaced with `git clone` or 
whatever other version control is used.
Simplify by removing the Scala coded SBT build for templates, make build 
visible to users by providing sample SBT or MVN files.
Simplify the CLI by splitting `pio build` into `pio register` and `sbt build`. 
Call `pio register any time the config of engine.json has changed when any 
workflow command is executed so the user need not remember to do this.
Refactor the CLI code to make it much easier to use with a debugger for those 
who wish to. Simply having CLI SBT build will likely go a fair way to making 
this more accessible.
Create a simple gallery of available templates even if it is only a page of 
links on the Apache doc site pointing to Github repos with blurb explanations. 
Allow users to PR against this page to get theirs listed. This has already been 
discussed on the mailing list.
Make SSL an opt-in, rather than a requirement. I think a PR has been created 
for this.
After #6 any of the ActionML fork should be merged. Alex and I will create one 
for review and possible cherry-picking.
Support TLS in SDKs and examples asap for the opt-in SSL in PIO 
Remove SparkSubmit as a requirement period. If PIO is to be a framework for ML 
we should support non-MLlib algos. There are already several Templates that do 
not need Spark for train or deploy. Is there a need to couple PIO with Spark at 
all? Examples like TensorFlow, Vowpal Wabbit, Flink and many other perfectly 
good libs could be used in algos not tied to Spark. 
Other IPMC members have proposed during the incubation vote about possible 
redoing this using Apache Beam (and thus handle unified Batch + Streaming). 
Something to consider. 
Rethink and simplify the build, train, deploy workflow to fit lambda and kappa. 
For instance it may be that only something like deploy is needed for kappa.
We can go Kappa with something like Flink or Beam. 
Make the PredictionServer multi-tenant. ActionML has code to donate here but 
with restrictions about Spark contexts. More discussion is needed and probably 
a PredictionServer refactoring/rearchitecture is needed.
There is a great deal of unnecessary complexity around multi-tenancy (multiple 
data and model identification). There are appids, instance-ids, app keys, 
channels, port numbers etc. We should unify multi-tenancy around REST resource 
ids from the EventServer to PredictionServer.
Add authentication with SSL
Installation is the single biggest challenge to users. We should consider a 
refactoring/re-architecting into a microservice arch with associated 
containerized deployment using container orchestration tools. Examples are 
Docker + Swarm (containers) and Karamel, Chef, Ansible, or Docker-machine 
(orchestration). Clearly there are other candidates. The importance of this is 
hard to overstate IMHO.
PIO is far too monolithic. Decouple Input from Output from Train/model updates. 
A large deployment of PIO for a lambda type template involves input, training, 
and output, which are potentially running on different machines. Input through 
the EventServer is separated out but not documented or containerized for 
independent deployment. Train and Predict are 2 methods of the same class and 
so are intimately coupled even though deployment or workflow execution may 
occur on separate machines/clusters.
Go MicroServices !!!
Fully support both lambda and kappa type algorithms. The need for this is 
illustrated by the rise in streaming algorithms that use online learning. To 
simplify this discussion 2 examples would help:
A recommender, which is usually lambda for which PredictionIO has several 
examples
A multi-armed-bandit, which can easily be kappa and is fairly simple to 
imagine. Another kappa style algorithm is a streaming online anomaly detector 
for event patterns.
Adding kappa-style algos, allowing non-Spark algos, and refactoring Templates 
will allow the project to containerize algos so they come with their own 
deployment pattern, namely orchestration and composition of microservices. Some 
algos may share some (micro)services and use unique ones elsewhere but the 
orchestration and composition pattern can account for this.
Refactor the Template API to make Algo.train and Algo.predict separate since 
their needs may be very different. In one current PIO example Algo.train is the 
only part of the workflow that needs Spark. Deploy does not and in fact may 
output csv or json instead of requiring a PredictionServer. If Algo.train needs 
Spark and the Template is based on containers, new machines can be spun up to 
do the train and taken down after, leading to large cloud compute savings for 
users. There are endless examples of how reworking PIO to be based on 
microservices and containers would enable many new applications and ease 
deployment.
I’m interested to see if the current committers have the energy around this 
project to think this broadly, I have no wish at all to start a bunch of 
debates. PIO is the best project of its kind available today but it will 
evolve, the questions is; how much.


Reply via email to