Re: [DISCUSS] Apache Dataflow Incubator Proposal

Jean-Baptiste Onofré Sat, 23 Jan 2016 11:56:49 -0800

Hi Seshu,

it does both: streaming and batching data processing.


Regards
JB

On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote:

Did not get a chance to play with it yet, Within Google is it used more as
a MR replacement or a Stream processing engine? Or it does both of them
fantastically well?


On 1/22/16, 10:58 AM, "Frances Perry" <f...@google.com.INVALID> wrote:

Crunch started as a clone of FlumeJava, which was Google internal. In the
meantime inside Google, FlumeJava evolved into Dataflow. So all three
share
a number of concepts like PCollections, ParDo, DoFn, etc. However,
Dataflow
adds a number of new things -- the biggest being a unified batch/streaming
semantics using concepts like Windowing and Triggers. Tyler Akidau's
OReilly post has a really nice explanation:
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On Fri, Jan 22, 2016 at 10:42 AM, Ashish <paliwalash...@gmail.com> wrote:

Crunch has Spark pipelines, but not sure about the runner abstraction.

May be Josh Wills or Tom White can provide more insight on this topic.
They are core devs for both projects :)

On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi,

I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce

pipeline, it

doesn't provide runner abstraction. It's based on FlumeJava.

The logic is very similar (with DoFns, pipelines, ...). Correct me if

I'm

wrong, but Crunch started after Google Dataflow, especially because

Dataflow

was not opensourced at that time.

So, I agree it's very similar/close.

Regards
JB


On 01/22/2016 05:51 PM, Ashish wrote:


Hi JB,

Curious to know about how it compares to Apache Crunch? Constructs
looks very familiar (had used Crunch long ago)

Thoughts?

- Ashish

On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré

<j...@nanthrax.net>

wrote:


Hi Seshu,

I blogged about Apache Dataflow proposal:
http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/

You can see in the "what's next ?" section that new runners, skins

and

sources are on our roadmap. Definitely, a storm runner could be

part of

this.

Regards
JB


On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:



Awesome to see CloudDataFlow coming to Apache. The Stream

Processing

area
has been in general fragmented with a variety of solutions, hoping

the

community galvanizes around Apache Data Flow.

We are still in the "Apache Storm" world, Any chance for folks

building

a
"Storm Runner²?


On 1/20/16, 9:39 AM, "James Malone"

<jamesmal...@google.com.INVALID>

wrote:

Great proposal. I like that your proposal includes a well

presented

roadmap, but I don't see any goals that directly address

building a

larger
community. Y'all have any ideas around outreach that will help

with

adoption?


Thank you and fair point. We have a few additional ideas which we

can

put
into the Community section.


As a start, I recommend y'all add a section to the proposal on

the

wiki
page for "Additional Interested Contributors" so that folks who

want

to
sign up to participate in the project can do so without

requesting

additions to the initial committer list.

This is a great idea and I think it makes a lot of sense to add an
"Additional
Interested Contributors" section to the proposal.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

Hello everyone,

Attached to this message is a proposed new project - Apache

Dataflow,



unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the



proposal is



in draft form on the wiki where we will make any required

changes:


https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James

----

= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of
language-specific



SDKs



for defining and executing data processing workflows, and also

data

ingestion and integration flows, supporting Enterprise

Integration



Patterns



(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines



simplify



the mechanics of large-scale batch and streaming data processing

and

can



run on a number of runtimes like Apache Flink, Apache Spark, and



Google



Cloud Dataflow (a cloud service). Dataflow also brings DSL in



different



languages, allowing users to easily implement their data

integration

processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for

distributed



data



processing at any scale. Dataflow provides a unified programming



model, a



software development kit to define and construct data processing



pipelines,



and runners to execute Dataflow pipelines in several runtime

engines,



like



Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow

can
be



used



for a variety of streaming or batch data processing goals

including



ETL,



stream analysis, and aggregate computation. The underlying
programming
model for Dataflow provides MapReduce-like parallelism, combined

with

support for powerful data windowing, and fine-grained

correctness



control.




== Background ==

Dataflow started as a set of Google projects focused on making

data

processing easier, faster, and less costly. The Dataflow model

is a

successor to MapReduce, FlumeJava, and Millwheel inside Google

and
is

focused on providing a unified solution for batch and stream



processing.



These projects on which Dataflow is based have been published in



several



papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  -

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf


* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable
programming
layer. When you define a data processing pipeline with the

Dataflow



model,



you are creating a job which is capable of being processed by

any



number
of



Dataflow processing engines. Several engines have been

developed to

run



Dataflow pipelines in other open source runtimes, including a
Dataflow
runner for Apache Flink and Apache Spark. There is also a

³direct



runner²,



for execution on the developer machine (mainly for dev/debug



purposes).



Another runner allows a Dataflow program to run on a managed

service,

Google Cloud Dataflow, in Google Cloud Platform. The Dataflow

Java



SDK is



already available on GitHub, and independent from the Google

Cloud



Dataflow



service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners

will

be



submitted as an OSS project under the ASF. The runners which

are a



part
of



this proposal include those for Spark (from Cloudera), Flink

(from



data



Artisans), and local development (from Google); the Google Cloud



Dataflow



service runner is not included in this proposal. Further

references

to
Dataflow will refer to the Dataflow model, SDKs, and runners

which



are a



part of this proposal (Apache Dataflow) only. The initial

submission



will



contain the already-released Java SDK; Google intends to submit

the



Python



SDK later in the incubation process. The Google Cloud Dataflow
service



will



continue to be one of many runners for Dataflow, built on Google
Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow

will

develop against the Apache project additions, updates, and

changes.



Google



Cloud Dataflow will become one user of Apache Dataflow and will



participate



in the project openly and publicly.

The Dataflow programming model has been designed with

simplicity,

scalability, and speed as key tenants. In the Dataflow model,

you

only



need



to think about four top-level concepts when constructing your

data

processing job:

* Pipelines - The data processing job made of a series of
computations
including input, processing, and output

* PCollections - Bounded (or unbounded) datasets which represent

the



input,



intermediate and output data in pipelines

* PTransforms - A data processing step in a pipeline in which

one
or



more



PCollections are an input and output

* I/O Sources and Sinks - APIs for reading and writing data

which
are

the



roots and endpoints of the pipeline

== Rationale ==

With Dataflow, Google intended to develop a framework which

allowed

developers to be maximally productive in defining the

processing,
and



then



be able to execute the program at various levels of
latency/cost/completeness without re-architecting or re-writing

it.



This



goal was informed by Google¹s past experience  developing

several



models,



frameworks, and tools useful for large-scale and distributed

data

processing. While Google has previously published papers

describing



some
of



its technologies, Google decided to take a different approach

with

Dataflow. Google open-sourced the SDK and model alongside



commercialization



of the idea and ahead of publishing papers on the topic. As a

result,



number of open source runtimes exist for Dataflow, such as the

Apache



Flink



and Apache Spark runners.

We believe that submitting Dataflow as an Apache project will

provide

an



immediate, worthwhile, and substantial contribution to the open
source
community. As an incubating project, we believe Dataflow will

have
a



better



opportunity to provide a meaningful contribution to OSS and also



integrate



with other Apache projects.

In the long term, we believe Dataflow can be a powerful

abstraction



layer



for data processing. By providing an abstraction layer for data



pipelines



and processing, data workflows can be increasingly portable,



resilient to



breaking changes in tooling, and compatible across many

execution



engines,



runtimes, and open source projects.

== Initial Goals ==

We are breaking our initial goals into immediate (< 2 months),



short-term



(2-4 months), and intermediate-term (> 4 months).

Our immediate goals include the following:

* Plan for reconciling the Dataflow Java SDK and various runners

into

one



project

* Plan for refactoring the existing Java SDK for better

extensibility

by



SDK and runner writers

* Validating all dependencies are ASL 2.0 or compatible

* Understanding and adapting to the Apache development process

Our short-term goals include:

* Moving the newly-merged lists, and build utilities to Apache

* Start refactoring codebase and move code to Apache Git repo

* Continue development of new features, functions, and fixes in

the

Dataflow Java SDK, and Dataflow runners

* Cleaning up the Dataflow SDK sources and crafting a roadmap

and

plan

for



how to include new major ideas, modules, and runtimes

* Establishment of easy and clear build/test framework for

Dataflow

and



associated runtimes; creation of testing, rollback, and

validation



policy




* Analysis and design for work needed to make Dataflow a better

data

processing abstraction layer for multiple open source frameworks

and

environments

Finally, we have a number of intermediate-term goals:

* Roadmapping, planning, and execution of integrations with

other
OSS

and



non-OSS projects/products

* Inclusion of additional SDK for Python, which is under active



development




== Current Status ==

=== Meritocracy ===

Dataflow was initially developed based on ideas from many

employees



within



Google. As an ASL OSS project on GitHub, the Dataflow SDK has
received
contributions from data Artisans, Cloudera Labs, and other

individual

developers. As a project under incubation, we are committed to



expanding



our effort to build an environment which supports a

meritocracy. We

are



focused on engaging the community and other related projects for



support



and contributions. Moreover, we are committed to ensure

contributors

and



committers to Dataflow come from a broad mix of organizations

through



merit-based decision process during incubation. We believe

strongly

in

the



Dataflow model and are committed to growing an inclusive

community
of

Dataflow contributors.

=== Community ===

The core of the Dataflow Java SDK has been developed by Google

for

use



with



Google Cloud Dataflow. Google has active community engagement in

the

SDK



GitHub repository (



https://github.com/GoogleCloudPlatform/DataflowJavaSDK



),
on Stack Overflow (
http://stackoverflow.com/questions/tagged/google-cloud-dataflow)

and

has



had contributions from a number of organizations and

indivuduals.


Everyday, Cloud Dataflow is actively used by a number of
organizations

and



institutions for batch and stream processing of data. We believe



acceptance



will allow us to consolidate existing Dataflow-related work,

grow
the

Dataflow community, and deepen connections between Dataflow and

other



open



source projects.

=== Core Developers ===

The core developers for Dataflow and the Dataflow runners are:

* Frances Perry

* Tyler Akidau

* Davor Bonaci

* Luke Cwik

* Ben Chambers

* Kenn Knowles

* Dan Halperin

* Daniel Mills

* Mark Shields

* Craig Chambers

* Maximilian Michels

* Tom White

* Josh Wills

=== Alignment ===

The Dataflow SDK can be used to create Dataflow pipelines which

can

be
executed on Apache Spark or Apache Flink. Dataflow is also

related
to



other



Apache projects, such as Apache Crunch. We plan on expanding



functionality



for Dataflow runners, support for additional domain specific



languages,
and



increased portability so Dataflow is a powerful abstraction

layer
for



data



processing.

== Known Risks ==

=== Orphaned Products ===

The Dataflow SDK is presently used by several organizations,

from



small



startups to Fortune 100 companies, to construct production

pipelines



which



are executed in Google Cloud Dataflow. Google has a long-term



commitment
to



advance the Dataflow SDK; moreover, Dataflow is seeing

increasing



interest,



development, and adoption from organizations outside of Google.

=== Inexperience with Open Source ===

Google believes strongly in open source and the exchange of



information
to



advance new ideas and work. Examples of this commitment are

active

OSS
projects such as Chromium (https://www.chromium.org) and

Kubernetes (

http://kubernetes.io/). With Dataflow, we have tried to be



increasingly



open and forward-looking; we have published a paper in the VLDB



conference



describing the Dataflow model (
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick

to



release



the Dataflow SDK as open source software with the launch of

Cloud



Dataflow.



Our submission to the Apache Software Foundation is a logical



extension
of



our commitment to open source software.

=== Homogeneous Developers ===

The majority of committers in this proposal belong to Google

due to

the



fact that Dataflow has emerged from several internal Google

projects.



This



proposal also includes committers outside of Google who are

actively

involved with other Apache projects, such as Hadoop, Flink, and
Spark.

We



expect our entry into incubation will allow us to expand the

number

of
individuals and organizations participating in Dataflow

development.

Additionally, separation of the Dataflow SDK from Google Cloud



Dataflow



allows us to focus on the open source SDK and model and do what

is



best
for



this project.

=== Reliance on Salaried Developers ===

The Dataflow SDK and Dataflow runners have been developed

primarily

by
salaried developers supporting the Google Cloud Dataflow

project.



While
the



Dataflow SDK and Cloud Dataflow have been developed by different
teams



(and



this proposal would reinforce that separation) we expect our

initial



set
of



developers will still primarily be salaried. Contribution has

not

been
exclusively from salaried developers, however. For example, the



contrib



directory of the Dataflow SDK (


https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contri



)
contains items from free-time contributors. Moreover, seperate



projects,



such as ScalaFlow (https://github.com/darkjh/scalaflow) have

been



created



around the Dataflow model and SDK. We expect our reliance on

salaried

developers will decrease over time during incubation.

=== Relationship with other Apache products ===

Dataflow directly interoperates with or utilizes several

existing



Apache



projects.

* Build

** Apache Maven

* Data I/O, Libraries

** Apache Avro

** Apache Commons

* Dataflow runners

** Apache Flink

** Apache Spark

Dataflow when used in batch mode shares similarities with Apache



Crunch;



however, Dataflow is focused on a model, SDK, and abstraction

layer



beyond



Spark and Hadoop (MapReduce.) One key goal of Dataflow is to

provide

an



intermediate abstraction layer which can easily be implemented

and



utilized



across several different processing frameworks.

=== An excessive fascination with the Apache brand ===

With this proposal we are not seeking attention or publicity.

Rather,

we



firmly believe in the Dataflow model, SDK, and the ability to

make



Dataflow



a powerful yet simple framework for data processing. While the



Dataflow
SDK



and model have been open source, we believe putting code on

GitHub

can



only



go so far. We see the Apache community, processes, and mission

as



critical



for ensuring the Dataflow SDK and model are truly

community-driven,

positively impactful, and innovative open source software. While



Google
has



taken a number of steps to advance its various open source

projects,

we



believe Dataflow is a great fit for the Apache Software

Foundation



due to



its focus on data processing and its relationships to existing

ASF

projects.

== Documentation ==

The following documentation is relevant to this proposal.

Relevant



portion



of the documentation will be contributed to the Apache Dataflow



project.




* Dataflow website: https://cloud.google.com/dataflow

* Dataflow programming model:
https://cloud.google.com/dataflow/model/programming-model

* Codebases

** Dataflow Java SDK:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK

** Flink Dataflow runner:



https://github.com/dataArtisans/flink-dataflow




** Spark Dataflow runner:

https://github.com/cloudera/spark-dataflow


* Dataflow Java SDK issue tracker:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues

* google-cloud-dataflow tag on Stack Overflow:
http://stackoverflow.com/questions/tagged/google-cloud-dataflow

== Initial Source ==

The initial source for Dataflow which we will submit to the

Apache

Foundation will include several related projects which are

currently



hosted



on the GitHub repositories:

* Dataflow Java SDK (
https://github.com/GoogleCloudPlatform/DataflowJavaSDK)

* Flink Dataflow runner



(https://github.com/dataArtisans/flink-dataflow)




* Spark Dataflow runner (

https://github.com/cloudera/spark-dataflow)


These projects have always been Apache 2.0 licensed. We intend

to



bundle



all of these repositories since they are all complimentary and

should

be



maintained in one project. Prior to our submission, we will

combine



all
of



these projects into a new git repository.

== Source and Intellectual Property Submission Plan ==

The source for the Dataflow SDK and the three runners (Spark,

Flink,



Google



Cloud Dataflow) are already licensed under an Apache 2 license.

* Dataflow SDK -


https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENS




* Flink runner -

https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE


* Spark runner -
https://github.com/cloudera/spark-dataflow/blob/master/LICENSE

Contributors to the Dataflow SDK have also signed the Google



Individual



Contributor License Agreement (
https://cla.developers.google.com/about/google-individual) in

order

to
contribute to the project.

With respect to trademark rights, Google does not hold a

trademark
on

the



phrase ³Dataflow.² Based on feedback and guidance we receive

during

the



incubation process, we are open to renaming the project if

necessary

for



trademark or other concerns.

== External Dependencies ==

All external dependencies are licensed under an Apache 2.0 or
Apache-compatible license. As we grow the Dataflow community we

will

configure our build process to require and validate all

contributions

and



dependencies are licensed under the Apache 2.0 license or are

under

an
Apache-compatible license.

== Required Resources ==

=== Mailing Lists ===

We currently use a mix of mailing lists. We will migrate our

existing

mailing lists to the following:

* d...@dataflow.incubator.apache.org

* u...@dataflow.incubator.apache.org

* priv...@dataflow.incubator.apache.org

* comm...@dataflow.incubator.apache.org

=== Source Control ===

The Dataflow team currently uses Git and would like to continue

to
do

so.



We request a Git repository for Dataflow with mirroring to

GitHub



enabled.




=== Issue Tracking ===

We request the creation of an Apache-hosted JIRA. The Dataflow



project is



currently using both a public GitHub issue tracker and internal
Google
issue tracking. We will migrate and combine from these two

sources
to

the



Apache JIRA.

== Initial Committers ==

* Aljoscha Krettek     [aljos...@apache.org]

* Amit Sela            [amitsel...@gmail.com]

* Ben Chambers         [bchamb...@google.com]

* Craig Chambers       [chamb...@google.com]

* Dan Halperin         [dhalp...@google.com]

* Davor Bonaci         [da...@google.com]

* Frances Perry        [f...@google.com]

* James Malone         [jamesmal...@google.com]

* Jean-Baptiste Onofré [jbono...@apache.org]

* Josh Wills           [jwi...@apache.org]

* Kostas Tzoumas       [kos...@data-artisans.com]

* Kenneth Knowles      [k...@google.com]

* Luke Cwik            [lc...@google.com]

* Maximilian Michels   [m...@apache.org]

* Stephan Ewen         [step...@data-artisans.com]

* Tom White            [t...@cloudera.com]

* Tyler Akidau         [taki...@google.com]

== Affiliations ==

The initial committers are from six organizations. Google

developed

Dataflow and the Dataflow SDK, data Artisans developed the Flink



runner,



and Cloudera (Labs) developed the Spark runner.

* Cloudera

** Tom White

* Data Artisans

** Aljoscha Krettek

** Kostas Tzoumas

** Maximilian Michels

** Stephan Ewen

* Google

** Ben Chambers

** Dan Halperin

** Davor Bonaci

** Frances Perry

** James Malone

** Kenneth Knowles

** Luke Cwik

** Tyler Akidau

* PayPal

** Amit Sela

* Slack

** Josh Wills

* Talend

** Jean-Baptiste Onofré

== Sponsors ==

=== Champion ===

* Jean-Baptiste Onofre      [jbono...@apache.org]

=== Nominated Mentors ===

* Jim Jagielski           [j...@apache.org]

* Venkatesh Seetharam     [venkat...@apache.org]

* Bertrand Delacretaz     [bdelacre...@apache.org]

* Ted Dunning             [tdunn...@apache.org]

=== Sponsoring Entity ===

The Apache Incubator




--
Sean

---------------------------------------------------------------------

To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------

To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org




--
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

Reply via email to