Crunch has Spark pipelines, but not sure about the runner abstraction.
May be Josh Wills or Tom White can provide more insight on this topic.
They are core devs for both projects :)
On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:
Hi,
I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce
pipeline, it
doesn't provide runner abstraction. It's based on FlumeJava.
The logic is very similar (with DoFns, pipelines, ...). Correct me if
I'm
wrong, but Crunch started after Google Dataflow, especially because
Dataflow
was not opensourced at that time.
So, I agree it's very similar/close.
Regards
JB
On 01/22/2016 05:51 PM, Ashish wrote:
Hi JB,
Curious to know about how it compares to Apache Crunch? Constructs
looks very familiar (had used Crunch long ago)
Thoughts?
- Ashish
On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré
<j...@nanthrax.net>
wrote:
Hi Seshu,
I blogged about Apache Dataflow proposal:
http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
You can see in the "what's next ?" section that new runners, skins
and
sources are on our roadmap. Definitely, a storm runner could be
part of
this.
Regards
JB
On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
Awesome to see CloudDataFlow coming to Apache. The Stream
Processing
area
has been in general fragmented with a variety of solutions, hoping
the
community galvanizes around Apache Data Flow.
We are still in the "Apache Storm" world, Any chance for folks
building
a
"Storm Runner²?
On 1/20/16, 9:39 AM, "James Malone"
<jamesmal...@google.com.INVALID>
wrote:
Great proposal. I like that your proposal includes a well
presented
roadmap, but I don't see any goals that directly address
building a
larger
community. Y'all have any ideas around outreach that will help
with
adoption?
Thank you and fair point. We have a few additional ideas which we
can
put
into the Community section.
As a start, I recommend y'all add a section to the proposal on
the
wiki
page for "Additional Interested Contributors" so that folks who
want
to
sign up to participate in the project can do so without
requesting
additions to the initial committer list.
This is a great idea and I think it makes a lot of sense to add an
"Additional
Interested Contributors" section to the proposal.
On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:
Hello everyone,
Attached to this message is a proposed new project - Apache
Dataflow,
a
unified programming model for data processing and integration.
The text of the proposal is included below. Additionally, the
proposal is
in draft form on the wiki where we will make any required
changes:
https://wiki.apache.org/incubator/DataflowProposal
We look forward to your feedback and input.
Best,
James
----
= Apache Dataflow =
== Abstract ==
Dataflow is an open source, unified model and set of
language-specific
SDKs
for defining and executing data processing workflows, and also
data
ingestion and integration flows, supporting Enterprise
Integration
Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch and streaming data processing
and
can
run on a number of runtimes like Apache Flink, Apache Spark, and
Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in
different
languages, allowing users to easily implement their data
integration
processes.
== Proposal ==
Dataflow is a simple, flexible, and powerful system for
distributed
data
processing at any scale. Dataflow provides a unified programming
model, a
software development kit to define and construct data processing
pipelines,
and runners to execute Dataflow pipelines in several runtime
engines,
like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow
can
be
used
for a variety of streaming or batch data processing goals
including
ETL,
stream analysis, and aggregate computation. The underlying
programming
model for Dataflow provides MapReduce-like parallelism, combined
with
support for powerful data windowing, and fine-grained
correctness
control.
== Background ==
Dataflow started as a set of Google projects focused on making
data
processing easier, faster, and less costly. The Dataflow model
is a
successor to MapReduce, FlumeJava, and Millwheel inside Google
and
is
focused on providing a unified solution for batch and stream
processing.
These projects on which Dataflow is based have been published in
several
papers made available to the public:
* MapReduce - http://research.google.com/archive/mapreduce.html
* Dataflow model -
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
* MillWheel - http://research.google.com/pubs/pub41378.html
Dataflow was designed from the start to provide a portable
programming
layer. When you define a data processing pipeline with the
Dataflow
model,
you are creating a job which is capable of being processed by
any
number
of
Dataflow processing engines. Several engines have been
developed to
run
Dataflow pipelines in other open source runtimes, including a
Dataflow
runner for Apache Flink and Apache Spark. There is also a
³direct
runner²,
for execution on the developer machine (mainly for dev/debug
purposes).
Another runner allows a Dataflow program to run on a managed
service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow
Java
SDK is
already available on GitHub, and independent from the Google
Cloud
Dataflow
service. Another Python SDK is currently in active development.
In this proposal, the Dataflow SDKs, model, and a set of runners
will
be
submitted as an OSS project under the ASF. The runners which
are a
part
of
this proposal include those for Spark (from Cloudera), Flink
(from
data
Artisans), and local development (from Google); the Google Cloud
Dataflow
service runner is not included in this proposal. Further
references
to
Dataflow will refer to the Dataflow model, SDKs, and runners
which
are a
part of this proposal (Apache Dataflow) only. The initial
submission
will
contain the already-released Java SDK; Google intends to submit
the
Python
SDK later in the incubation process. The Google Cloud Dataflow
service
will
continue to be one of many runners for Dataflow, built on Google
Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow
will
develop against the Apache project additions, updates, and
changes.
Google
Cloud Dataflow will become one user of Apache Dataflow and will
participate
in the project openly and publicly.
The Dataflow programming model has been designed with
simplicity,
scalability, and speed as key tenants. In the Dataflow model,
you
only
need
to think about four top-level concepts when constructing your
data
processing job:
* Pipelines - The data processing job made of a series of
computations
including input, processing, and output
* PCollections - Bounded (or unbounded) datasets which represent
the
input,
intermediate and output data in pipelines
* PTransforms - A data processing step in a pipeline in which
one
or
more
PCollections are an input and output
* I/O Sources and Sinks - APIs for reading and writing data
which
are
the
roots and endpoints of the pipeline
== Rationale ==
With Dataflow, Google intended to develop a framework which
allowed
developers to be maximally productive in defining the
processing,
and
then
be able to execute the program at various levels of
latency/cost/completeness without re-architecting or re-writing
it.
This
goal was informed by Google¹s past experience developing
several
models,
frameworks, and tools useful for large-scale and distributed
data
processing. While Google has previously published papers
describing
some
of
its technologies, Google decided to take a different approach
with
Dataflow. Google open-sourced the SDK and model alongside
commercialization
of the idea and ahead of publishing papers on the topic. As a
result,
a
number of open source runtimes exist for Dataflow, such as the
Apache
Flink
and Apache Spark runners.
We believe that submitting Dataflow as an Apache project will
provide
an
immediate, worthwhile, and substantial contribution to the open
source
community. As an incubating project, we believe Dataflow will
have
a
better
opportunity to provide a meaningful contribution to OSS and also
integrate
with other Apache projects.
In the long term, we believe Dataflow can be a powerful
abstraction
layer
for data processing. By providing an abstraction layer for data
pipelines
and processing, data workflows can be increasingly portable,
resilient to
breaking changes in tooling, and compatible across many
execution
engines,
runtimes, and open source projects.
== Initial Goals ==
We are breaking our initial goals into immediate (< 2 months),
short-term
(2-4 months), and intermediate-term (> 4 months).
Our immediate goals include the following:
* Plan for reconciling the Dataflow Java SDK and various runners
into
one
project
* Plan for refactoring the existing Java SDK for better
extensibility
by
SDK and runner writers
* Validating all dependencies are ASL 2.0 or compatible
* Understanding and adapting to the Apache development process
Our short-term goals include:
* Moving the newly-merged lists, and build utilities to Apache
* Start refactoring codebase and move code to Apache Git repo
* Continue development of new features, functions, and fixes in
the
Dataflow Java SDK, and Dataflow runners
* Cleaning up the Dataflow SDK sources and crafting a roadmap
and
plan
for
how to include new major ideas, modules, and runtimes
* Establishment of easy and clear build/test framework for
Dataflow
and
associated runtimes; creation of testing, rollback, and
validation
policy
* Analysis and design for work needed to make Dataflow a better
data
processing abstraction layer for multiple open source frameworks
and
environments
Finally, we have a number of intermediate-term goals:
* Roadmapping, planning, and execution of integrations with
other
OSS
and
non-OSS projects/products
* Inclusion of additional SDK for Python, which is under active
development
== Current Status ==
=== Meritocracy ===
Dataflow was initially developed based on ideas from many
employees
within
Google. As an ASL OSS project on GitHub, the Dataflow SDK has
received
contributions from data Artisans, Cloudera Labs, and other
individual
developers. As a project under incubation, we are committed to
expanding
our effort to build an environment which supports a
meritocracy. We
are
focused on engaging the community and other related projects for
support
and contributions. Moreover, we are committed to ensure
contributors
and
committers to Dataflow come from a broad mix of organizations
through
a
merit-based decision process during incubation. We believe
strongly
in
the
Dataflow model and are committed to growing an inclusive
community
of
Dataflow contributors.
=== Community ===
The core of the Dataflow Java SDK has been developed by Google
for
use
with
Google Cloud Dataflow. Google has active community engagement in
the
SDK
GitHub repository (
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
),
on Stack Overflow (
http://stackoverflow.com/questions/tagged/google-cloud-dataflow)
and
has
had contributions from a number of organizations and
indivuduals.
Everyday, Cloud Dataflow is actively used by a number of
organizations
and
institutions for batch and stream processing of data. We believe
acceptance
will allow us to consolidate existing Dataflow-related work,
grow
the
Dataflow community, and deepen connections between Dataflow and
other
open
source projects.
=== Core Developers ===
The core developers for Dataflow and the Dataflow runners are:
* Frances Perry
* Tyler Akidau
* Davor Bonaci
* Luke Cwik
* Ben Chambers
* Kenn Knowles
* Dan Halperin
* Daniel Mills
* Mark Shields
* Craig Chambers
* Maximilian Michels
* Tom White
* Josh Wills
=== Alignment ===
The Dataflow SDK can be used to create Dataflow pipelines which
can
be
executed on Apache Spark or Apache Flink. Dataflow is also
related
to
other
Apache projects, such as Apache Crunch. We plan on expanding
functionality
for Dataflow runners, support for additional domain specific
languages,
and
increased portability so Dataflow is a powerful abstraction
layer
for
data
processing.
== Known Risks ==
=== Orphaned Products ===
The Dataflow SDK is presently used by several organizations,
from
small
startups to Fortune 100 companies, to construct production
pipelines
which
are executed in Google Cloud Dataflow. Google has a long-term
commitment
to
advance the Dataflow SDK; moreover, Dataflow is seeing
increasing
interest,
development, and adoption from organizations outside of Google.
=== Inexperience with Open Source ===
Google believes strongly in open source and the exchange of
information
to
advance new ideas and work. Examples of this commitment are
active
OSS
projects such as Chromium (https://www.chromium.org) and
Kubernetes (
http://kubernetes.io/). With Dataflow, we have tried to be
increasingly
open and forward-looking; we have published a paper in the VLDB
conference
describing the Dataflow model (
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick
to
release
the Dataflow SDK as open source software with the launch of
Cloud
Dataflow.
Our submission to the Apache Software Foundation is a logical
extension
of
our commitment to open source software.
=== Homogeneous Developers ===
The majority of committers in this proposal belong to Google
due to
the
fact that Dataflow has emerged from several internal Google
projects.
This
proposal also includes committers outside of Google who are
actively
involved with other Apache projects, such as Hadoop, Flink, and
Spark.
We
expect our entry into incubation will allow us to expand the
number
of
individuals and organizations participating in Dataflow
development.
Additionally, separation of the Dataflow SDK from Google Cloud
Dataflow
allows us to focus on the open source SDK and model and do what
is
best
for
this project.
=== Reliance on Salaried Developers ===
The Dataflow SDK and Dataflow runners have been developed
primarily
by
salaried developers supporting the Google Cloud Dataflow
project.
While
the
Dataflow SDK and Cloud Dataflow have been developed by different
teams
(and
this proposal would reinforce that separation) we expect our
initial
set
of
developers will still primarily be salaried. Contribution has
not
been
exclusively from salaried developers, however. For example, the
contrib
directory of the Dataflow SDK (
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contri
b
)
contains items from free-time contributors. Moreover, seperate
projects,
such as ScalaFlow (https://github.com/darkjh/scalaflow) have
been
created
around the Dataflow model and SDK. We expect our reliance on
salaried
developers will decrease over time during incubation.
=== Relationship with other Apache products ===
Dataflow directly interoperates with or utilizes several
existing
Apache
projects.
* Build
** Apache Maven
* Data I/O, Libraries
** Apache Avro
** Apache Commons
* Dataflow runners
** Apache Flink
** Apache Spark
Dataflow when used in batch mode shares similarities with Apache
Crunch;
however, Dataflow is focused on a model, SDK, and abstraction
layer
beyond
Spark and Hadoop (MapReduce.) One key goal of Dataflow is to
provide
an
intermediate abstraction layer which can easily be implemented
and
utilized
across several different processing frameworks.
=== An excessive fascination with the Apache brand ===
With this proposal we are not seeking attention or publicity.
Rather,
we
firmly believe in the Dataflow model, SDK, and the ability to
make
Dataflow
a powerful yet simple framework for data processing. While the
Dataflow
SDK
and model have been open source, we believe putting code on
GitHub
can
only
go so far. We see the Apache community, processes, and mission
as
critical
for ensuring the Dataflow SDK and model are truly
community-driven,
positively impactful, and innovative open source software. While
Google
has
taken a number of steps to advance its various open source
projects,
we
believe Dataflow is a great fit for the Apache Software
Foundation
due to
its focus on data processing and its relationships to existing
ASF
projects.
== Documentation ==
The following documentation is relevant to this proposal.
Relevant
portion
of the documentation will be contributed to the Apache Dataflow
project.
* Dataflow website: https://cloud.google.com/dataflow
* Dataflow programming model:
https://cloud.google.com/dataflow/model/programming-model
* Codebases
** Dataflow Java SDK:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
** Flink Dataflow runner:
https://github.com/dataArtisans/flink-dataflow
** Spark Dataflow runner:
https://github.com/cloudera/spark-dataflow
* Dataflow Java SDK issue tracker:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
* google-cloud-dataflow tag on Stack Overflow:
http://stackoverflow.com/questions/tagged/google-cloud-dataflow
== Initial Source ==
The initial source for Dataflow which we will submit to the
Apache
Foundation will include several related projects which are
currently
hosted
on the GitHub repositories:
* Dataflow Java SDK (
https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
* Flink Dataflow runner
(https://github.com/dataArtisans/flink-dataflow)
* Spark Dataflow runner (
https://github.com/cloudera/spark-dataflow)
These projects have always been Apache 2.0 licensed. We intend
to
bundle
all of these repositories since they are all complimentary and
should
be
maintained in one project. Prior to our submission, we will
combine
all
of
these projects into a new git repository.
== Source and Intellectual Property Submission Plan ==
The source for the Dataflow SDK and the three runners (Spark,
Flink,
Google
Cloud Dataflow) are already licensed under an Apache 2 license.
* Dataflow SDK -
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENS
E
* Flink runner -
https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
* Spark runner -
https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
Contributors to the Dataflow SDK have also signed the Google
Individual
Contributor License Agreement (
https://cla.developers.google.com/about/google-individual) in
order
to
contribute to the project.
With respect to trademark rights, Google does not hold a
trademark
on
the
phrase ³Dataflow.² Based on feedback and guidance we receive
during
the
incubation process, we are open to renaming the project if
necessary
for
trademark or other concerns.
== External Dependencies ==
All external dependencies are licensed under an Apache 2.0 or
Apache-compatible license. As we grow the Dataflow community we
will
configure our build process to require and validate all
contributions
and
dependencies are licensed under the Apache 2.0 license or are
under
an
Apache-compatible license.
== Required Resources ==
=== Mailing Lists ===
We currently use a mix of mailing lists. We will migrate our
existing
mailing lists to the following:
* d...@dataflow.incubator.apache.org
* u...@dataflow.incubator.apache.org
* priv...@dataflow.incubator.apache.org
* comm...@dataflow.incubator.apache.org
=== Source Control ===
The Dataflow team currently uses Git and would like to continue
to
do
so.
We request a Git repository for Dataflow with mirroring to
GitHub
enabled.
=== Issue Tracking ===
We request the creation of an Apache-hosted JIRA. The Dataflow
project is
currently using both a public GitHub issue tracker and internal
Google
issue tracking. We will migrate and combine from these two
sources
to
the
Apache JIRA.
== Initial Committers ==
* Aljoscha Krettek [aljos...@apache.org]
* Amit Sela [amitsel...@gmail.com]
* Ben Chambers [bchamb...@google.com]
* Craig Chambers [chamb...@google.com]
* Dan Halperin [dhalp...@google.com]
* Davor Bonaci [da...@google.com]
* Frances Perry [f...@google.com]
* James Malone [jamesmal...@google.com]
* Jean-Baptiste Onofré [jbono...@apache.org]
* Josh Wills [jwi...@apache.org]
* Kostas Tzoumas [kos...@data-artisans.com]
* Kenneth Knowles [k...@google.com]
* Luke Cwik [lc...@google.com]
* Maximilian Michels [m...@apache.org]
* Stephan Ewen [step...@data-artisans.com]
* Tom White [t...@cloudera.com]
* Tyler Akidau [taki...@google.com]
== Affiliations ==
The initial committers are from six organizations. Google
developed
Dataflow and the Dataflow SDK, data Artisans developed the Flink
runner,
and Cloudera (Labs) developed the Spark runner.
* Cloudera
** Tom White
* Data Artisans
** Aljoscha Krettek
** Kostas Tzoumas
** Maximilian Michels
** Stephan Ewen
* Google
** Ben Chambers
** Dan Halperin
** Davor Bonaci
** Frances Perry
** James Malone
** Kenneth Knowles
** Luke Cwik
** Tyler Akidau
* PayPal
** Amit Sela
* Slack
** Josh Wills
* Talend
** Jean-Baptiste Onofré
== Sponsors ==
=== Champion ===
* Jean-Baptiste Onofre [jbono...@apache.org]
=== Nominated Mentors ===
* Jim Jagielski [j...@apache.org]
* Venkatesh Seetharam [venkat...@apache.org]
* Bertrand Delacretaz [bdelacre...@apache.org]
* Ted Dunning [tdunn...@apache.org]
=== Sponsoring Entity ===
The Apache Incubator
--
Sean
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org
--
thanks
ashish
Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org