Spark it is an execution framework, but it also provides some high level
APIs which makes it much easier to do data analytics.

For example, to do grep like queries:

val docs = sparkContext.textFile("hdfs://...")
docs.filter(doc => doc.contains("Berkeley")).count

Another example to do word count (using the Scala API):

val docs = sparkContext.textFile("hdfs://...")
val counts = docs.flatMap(line => line.split("\\s+")).map(word =>
(word, 1)).reduceByKey(_
+ _)
counts.saveAsTextFile("hdfs://...")

The high level APIs are similar to a lot of the relational operators,
including aggregations, group bys, joins, etc.

Shark uses Spark as the execution engine but provides a Hive-compatible SQL
interface. This proposal is however only about moving Spark to ASF
incubator, and not Shark.

--
Reynold Xin, AMPLab, UC Berkeley
http://rxin.org


On Fri, May 31, 2013 at 1:03 PM, Henry Saputra <henry.sapu...@gmail.com>wrote:

> I believe it is more of a framework but you can take a look at Shark which
> using Spark to do data warehousing that support hive query (
> http://shark.cs.berkeley.edu)
>
> - Henry
>
> On Friday, May 31, 2013, Chen, Pei wrote:
>
> > +1 (non-binding)
> > This seems like a really interesting project.
> > Q- Is Spark just a framework/API or does it also have some tools
> > implemented for data analytics?
> > --Pei
> >
> > > -----Original Message-----
> > > From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov]
> > > Sent: Friday, May 31, 2013 2:04 PM
> > > To: general@incubator.apache.org
> > > Subject: [PROPOSAL] Apache Spark for the Incubator
> > >
> > > Hi Folks,
> > >
> > > I'm pleased to bring you a proposal to the Apache Incubator for the
> > Apache
> > > Spark project: https://wiki.apache.org/incubator/SparkProposal
> > >
> > > The work originates from the Berkeley AMPLab and through a number of
> > > industry participants, and other institutions. Spark is a framework for
> > large-
> > > scale data analysis on clusters, with a particular focus on low latency
> > > operations.
> > > The
> > > source code is written in Scala, and provides a number of APIs and
> > bindings in
> > > various programming languages.
> > >
> > > The proposal text is copied to the bottom of this email. I'm going to
> > leave this
> > > thread open for the next week for discussion. Once it's died down, I'll
> > call an
> > > official VOTE.
> > >
> > > Suresh, Ross G. -- heads up -- this project may be of interest to you
> > both and
> > > would welcome you guys as additional mentors. We currently have 3
> > > mentors committed to the project, but would love to have more. People
> > > interested in contributing should declare their interest here on the
> > > general@incubator thread and those potential contributors will be
> > discussed
> > > by the incoming Spark community.
> > >
> > > Questions -- let's hear em'! :)
> > >
> > > Cheers,
> > > Chris
> > > ("Champion", incoming Apache Spark)
> > >
> > > === Abstract ===
> > > Spark is an open source system for large-scale data analysis on
> clusters.
> > >
> > > === Proposal ===
> > > Spark is an open source system for fast and flexible large-scale data
> > analysis.
> > > Spark provides a general purpose runtime that supports low-latency
> > > execution in several forms. These include interactive exploration of
> very
> > > large datasets, near real-time stream processing, and ad-hoc SQL
> > analytics
> > > (through higher layer extensions). Spark interfaces with HDFS, HBase,
> > > Cassandra and several other storage storage layers, and exposes APIs in
> > > Scala, Java and Python.
> > > Background
> > > Spark started as U.C. Berkeley research project, designed to
> efficiently
> > run
> > > machine learning algorithms on large datasets. Over time, it has
> evolved
> > into
> > > a general computing engine as outlined above. Spark¹s developer
> community
> > > has also grown to include additional institutions, such as
> universities,
> > > research labs, and corporations. Funding has been provided by various
> > > institutions including the U.S. National Science Foundation, DARPA,
> and a
> > > number of industry sponsors. See:
> > > https://amplab.cs.berkeley.edu/sponsors/ for full details.
> > >
> > > === Rationale ===
> > > As the number of contributors to Spark has grown, we have sought for a
> > > long-term home for the project, and we believe the Apache foundation
> > > would be a great fit. Spark is a natural fit for the Apache foundation:
> > Spark
> > > already interoperates with several existing Apache projects (HDFS,
> HBase,
> > > Hive, Cassandra, Avro and Flume to name a few). The Spark team is
> > familiar
> > > with the Apache process and and subscribes to the Apache mission - the
> > > team includes multiple Apache committers already. Finally, joining
> Apache
> > > will help coordinate the development effort of the growing number of
> > > organizations which contribute to Spark.
> > >
> > > == Initial Goals ==
> > > The initial goals will most likely be to move the existing codebase to
> > Apache
> > > and integrate with the Apache development process. Furthermore, we plan
> > > for incremental development, and releases along with the Apache
> > > guidelines.
> > >
> > > === Current Status ===
> > > == Meritocracy ==
> > > The Spark project already operates on meritocratic principles. Today,
> > Spark
> > > has several developers and has accepted multiple major patches from
> > > outside of U.C. Berkeley. While this process has remained mostly
> informal
> > > (we do not have an official committer list), an implicit organization
> > exists in
> > > which individuals who contribute major components act as maintainers
> for
> > > those modules. If accepted, the Spark project would include several of
> > these
> > > participants as committers from the onset. We will work to identify all
> > > committers and PPMC members for the project and to operate under the
> > > ASF meritocratic principles.
> > >
> > > === Community ===
> > > Acceptance into the Apache foundation would bolster the already strong
> > > user and developer community around Spark. That community includes
> > > dozens of contributors from several institutions, a meetup group with
> > > several hundred members, and an active mailing list composed of
> hundreds
> > > of users.
> > > Core Developers
> > > The core developers of our project are listed in our contributors and
> > initial
> > > PPMC below. Though many exist at UC Berkeley, there is a representative
> > > cross sampling of other organizations including Quantifind, Microsoft,
> > Yahoo!,
> > > ClearStory Data, Bizo, Intel, Tagged and Webtrends.
> > >
> > >
> > > === Alignment ===
> > > Our proposed ef
>

Reply via email to