Spark it is an execution framework, but it also provides some high level APIs which makes it much easier to do data analytics.
For example, to do grep like queries: val docs = sparkContext.textFile("hdfs://...") docs.filter(doc => doc.contains("Berkeley")).count Another example to do word count (using the Scala API): val docs = sparkContext.textFile("hdfs://...") val counts = docs.flatMap(line => line.split("\\s+")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") The high level APIs are similar to a lot of the relational operators, including aggregations, group bys, joins, etc. Shark uses Spark as the execution engine but provides a Hive-compatible SQL interface. This proposal is however only about moving Spark to ASF incubator, and not Shark. -- Reynold Xin, AMPLab, UC Berkeley http://rxin.org On Fri, May 31, 2013 at 1:03 PM, Henry Saputra <henry.sapu...@gmail.com>wrote: > I believe it is more of a framework but you can take a look at Shark which > using Spark to do data warehousing that support hive query ( > http://shark.cs.berkeley.edu) > > - Henry > > On Friday, May 31, 2013, Chen, Pei wrote: > > > +1 (non-binding) > > This seems like a really interesting project. > > Q- Is Spark just a framework/API or does it also have some tools > > implemented for data analytics? > > --Pei > > > > > -----Original Message----- > > > From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov] > > > Sent: Friday, May 31, 2013 2:04 PM > > > To: general@incubator.apache.org > > > Subject: [PROPOSAL] Apache Spark for the Incubator > > > > > > Hi Folks, > > > > > > I'm pleased to bring you a proposal to the Apache Incubator for the > > Apache > > > Spark project: https://wiki.apache.org/incubator/SparkProposal > > > > > > The work originates from the Berkeley AMPLab and through a number of > > > industry participants, and other institutions. Spark is a framework for > > large- > > > scale data analysis on clusters, with a particular focus on low latency > > > operations. > > > The > > > source code is written in Scala, and provides a number of APIs and > > bindings in > > > various programming languages. > > > > > > The proposal text is copied to the bottom of this email. I'm going to > > leave this > > > thread open for the next week for discussion. Once it's died down, I'll > > call an > > > official VOTE. > > > > > > Suresh, Ross G. -- heads up -- this project may be of interest to you > > both and > > > would welcome you guys as additional mentors. We currently have 3 > > > mentors committed to the project, but would love to have more. People > > > interested in contributing should declare their interest here on the > > > general@incubator thread and those potential contributors will be > > discussed > > > by the incoming Spark community. > > > > > > Questions -- let's hear em'! :) > > > > > > Cheers, > > > Chris > > > ("Champion", incoming Apache Spark) > > > > > > === Abstract === > > > Spark is an open source system for large-scale data analysis on > clusters. > > > > > > === Proposal === > > > Spark is an open source system for fast and flexible large-scale data > > analysis. > > > Spark provides a general purpose runtime that supports low-latency > > > execution in several forms. These include interactive exploration of > very > > > large datasets, near real-time stream processing, and ad-hoc SQL > > analytics > > > (through higher layer extensions). Spark interfaces with HDFS, HBase, > > > Cassandra and several other storage storage layers, and exposes APIs in > > > Scala, Java and Python. > > > Background > > > Spark started as U.C. Berkeley research project, designed to > efficiently > > run > > > machine learning algorithms on large datasets. Over time, it has > evolved > > into > > > a general computing engine as outlined above. Spark¹s developer > community > > > has also grown to include additional institutions, such as > universities, > > > research labs, and corporations. Funding has been provided by various > > > institutions including the U.S. National Science Foundation, DARPA, > and a > > > number of industry sponsors. See: > > > https://amplab.cs.berkeley.edu/sponsors/ for full details. > > > > > > === Rationale === > > > As the number of contributors to Spark has grown, we have sought for a > > > long-term home for the project, and we believe the Apache foundation > > > would be a great fit. Spark is a natural fit for the Apache foundation: > > Spark > > > already interoperates with several existing Apache projects (HDFS, > HBase, > > > Hive, Cassandra, Avro and Flume to name a few). The Spark team is > > familiar > > > with the Apache process and and subscribes to the Apache mission - the > > > team includes multiple Apache committers already. Finally, joining > Apache > > > will help coordinate the development effort of the growing number of > > > organizations which contribute to Spark. > > > > > > == Initial Goals == > > > The initial goals will most likely be to move the existing codebase to > > Apache > > > and integrate with the Apache development process. Furthermore, we plan > > > for incremental development, and releases along with the Apache > > > guidelines. > > > > > > === Current Status === > > > == Meritocracy == > > > The Spark project already operates on meritocratic principles. Today, > > Spark > > > has several developers and has accepted multiple major patches from > > > outside of U.C. Berkeley. While this process has remained mostly > informal > > > (we do not have an official committer list), an implicit organization > > exists in > > > which individuals who contribute major components act as maintainers > for > > > those modules. If accepted, the Spark project would include several of > > these > > > participants as committers from the onset. We will work to identify all > > > committers and PPMC members for the project and to operate under the > > > ASF meritocratic principles. > > > > > > === Community === > > > Acceptance into the Apache foundation would bolster the already strong > > > user and developer community around Spark. That community includes > > > dozens of contributors from several institutions, a meetup group with > > > several hundred members, and an active mailing list composed of > hundreds > > > of users. > > > Core Developers > > > The core developers of our project are listed in our contributors and > > initial > > > PPMC below. Though many exist at UC Berkeley, there is a representative > > > cross sampling of other organizations including Quantifind, Microsoft, > > Yahoo!, > > > ClearStory Data, Bizo, Intel, Tagged and Webtrends. > > > > > > > > > === Alignment === > > > Our proposed ef >