Re: [PROPOSAL] Apache Spark for the Incubator

Reynold Xin Fri, 31 May 2013 13:10:42 -0700

Spark it is an execution framework, but it also provides some high level
APIs which makes it much easier to do data analytics.


For example, to do grep like queries:

val docs = sparkContext.textFile("hdfs://...")
docs.filter(doc => doc.contains("Berkeley")).count

Another example to do word count (using the Scala API):

val docs = sparkContext.textFile("hdfs://...")
val counts = docs.flatMap(line => line.split("\\s+")).map(word =>
(word, 1)).reduceByKey(_
+ _)
counts.saveAsTextFile("hdfs://...")

The high level APIs are similar to a lot of the relational operators,
including aggregations, group bys, joins, etc.

Shark uses Spark as the execution engine but provides a Hive-compatible SQL
interface. This proposal is however only about moving Spark to ASF
incubator, and not Shark.

--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org


On Fri, May 31, 2013 at 1:03 PM, Henry Saputra <henry.sapu...@gmail.com>wrote:

> I believe it is more of a framework but you can take a look at Shark which
> using Spark to do data warehousing that support hive query (
> http://shark.cs.berkeley.edu)
>
> - Henry
>
> On Friday, May 31, 2013, Chen, Pei wrote:
>
> > +1 (non-binding)
> > This seems like a really interesting project.
> > Q- Is Spark just a framework/API or does it also have some tools
> > implemented for data analytics?
> > --Pei
> >
> > > -----Original Message-----
> > > From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov]
> > > Sent: Friday, May 31, 2013 2:04 PM
> > > To: general@incubator.apache.org
> > > Subject: [PROPOSAL] Apache Spark for the Incubator
> > >
> > > Hi Folks,
> > >
> > > I'm pleased to bring you a proposal to the Apache Incubator for the
> > Apache
> > > Spark project: https://wiki.apache.org/incubator/SparkProposal
> > >
> > > The work originates from the Berkeley AMPLab and through a number of
> > > industry participants, and other institutions. Spark is a framework for
> > large-
> > > scale data analysis on clusters, with a particular focus on low latency
> > > operations.
> > > The
> > > source code is written in Scala, and provides a number of APIs and
> > bindings in
> > > various programming languages.
> > >
> > > The proposal text is copied to the bottom of this email. I'm going to
> > leave this
> > > thread open for the next week for discussion. Once it's died down, I'll
> > call an
> > > official VOTE.
> > >
> > > Suresh, Ross G. -- heads up -- this project may be of interest to you
> > both and
> > > would welcome you guys as additional mentors. We currently have 3
> > > mentors committed to the project, but would love to have more. People
> > > interested in contributing should declare their interest here on the
> > > general@incubator thread and those potential contributors will be
> > discussed
> > > by the incoming Spark community.
> > >
> > > Questions -- let's hear em'! :)
> > >
> > > Cheers,
> > > Chris
> > > ("Champion", incoming Apache Spark)
> > >
> > > === Abstract ===
> > > Spark is an open source system for large-scale data analysis on
> clusters.
> > >
> > > === Proposal ===
> > > Spark is an open source system for fast and flexible large-scale data
> > analysis.
> > > Spark provides a general purpose runtime that supports low-latency
> > > execution in several forms. These include interactive exploration of
> very
> > > large datasets, near real-time stream processing, and ad-hoc SQL
> > analytics
> > > (through higher layer extensions). Spark interfaces with HDFS, HBase,
> > > Cassandra and several other storage storage layers, and exposes APIs in
> > > Scala, Java and Python.
> > > Background
> > > Spark started as U.C. Berkeley research project, designed to
> efficiently
> > run
> > > machine learning algorithms on large datasets. Over time, it has
> evolved
> > into
> > > a general computing engine as outlined above. Spark¹s developer
> community
> > > has also grown to include additional institutions, such as
> universities,
> > > research labs, and corporations. Funding has been provided by various
> > > institutions including the U.S. National Science Foundation, DARPA,
> and a
> > > number of industry sponsors. See:
> > > https://amplab.cs.berkeley.edu/sponsors/ for full details.
> > >
> > > === Rationale ===
> > > As the number of contributors to Spark has grown, we have sought for a
> > > long-term home for the project, and we believe the Apache foundation
> > > would be a great fit. Spark is a natural fit for the Apache foundation:
> > Spark
> > > already interoperates with several existing Apache projects (HDFS,
> HBase,
> > > Hive, Cassandra, Avro and Flume to name a few). The Spark team is
> > familiar
> > > with the Apache process and and subscribes to the Apache mission - the
> > > team includes multiple Apache committers already. Finally, joining
> Apache
> > > will help coordinate the development effort of the growing number of
> > > organizations which contribute to Spark.
> > >
> > > == Initial Goals ==
> > > The initial goals will most likely be to move the existing codebase to
> > Apache
> > > and integrate with the Apache development process. Furthermore, we plan
> > > for incremental development, and releases along with the Apache
> > > guidelines.
> > >
> > > === Current Status ===
> > > == Meritocracy ==
> > > The Spark project already operates on meritocratic principles. Today,
> > Spark
> > > has several developers and has accepted multiple major patches from
> > > outside of U.C. Berkeley. While this process has remained mostly
> informal
> > > (we do not have an official committer list), an implicit organization
> > exists in
> > > which individuals who contribute major components act as maintainers
> for
> > > those modules. If accepted, the Spark project would include several of
> > these
> > > participants as committers from the onset. We will work to identify all
> > > committers and PPMC members for the project and to operate under the
> > > ASF meritocratic principles.
> > >
> > > === Community ===
> > > Acceptance into the Apache foundation would bolster the already strong
> > > user and developer community around Spark. That community includes
> > > dozens of contributors from several institutions, a meetup group with
> > > several hundred members, and an active mailing list composed of
> hundreds
> > > of users.
> > > Core Developers
> > > The core developers of our project are listed in our contributors and
> > initial
> > > PPMC below. Though many exist at UC Berkeley, there is a representative
> > > cross sampling of other organizations including Quantifind, Microsoft,
> > Yahoo!,
> > > ClearStory Data, Bizo, Intel, Tagged and Webtrends.
> > >
> > >
> > > === Alignment ===
> > > Our proposed ef
>

Re: [PROPOSAL] Apache Spark for the Incubator

Reply via email to