Hey Priyank, Replies inlined.
On Wed, May 16, 2012 at 9:25 PM, Priyank Rastogi <priyank.rast...@huawei.com> wrote: > Hi Josh, > A couple of queries I have. > 1) How is this different from Oozie or Cascading? Re: Oozie, the one context in which Crunch and Oozie overlap is when a developer is writing a series of MapReduce jobs in Java and wanted to tie the output of one job to the input of another together. In Crunch, you would express that dependency in Java code; in Oozie, you would express it in XML. In my opinion, expressing those dependencies in Java has some advantages for certain types of problems, such as the iterative computations that are performed in systems like Mahout. It is also nice to be able to construct a data pipeline within a single language, without having to move back and forth between Java and XML. Oozie has cron-like job scheduling functionality, which Crunch does not. It also supports chaining different types of MapReduce-based systems together, like Pig jobs and Hive queries. Crunch provides a library of common MapReduce design patterns, like counts, aggregations, joins, and sorts, that Oozie does not. I think it would be easy to see a case where a developer would write a Crunch pipeline whose output was consumed by one or more Pig jobs, and Oozie was used to schedule the execution of those jobs and specify their dependencies. Re: Cascading, Cascading has the same data model as Pig and Hive, i.e., all of the operations they define are based on a single, serializable type (SST), which is usually referred to as a "Tuple." Crunch does not have a single, serializable type-- developers can work with data in whatever format make sense for the data they are processing, such as time series, images, or serializable object formats like Apache Avro, Apache Thrift, or JSON. The initial reason that we developed Crunch at Cloudera was a customer project that required large-scale data pipelines over time series, and we felt that none of the existing pipeline languages, including Cascading, were designed for this type of problem. Additionally, Cascading is not an Apache project, either top-level or in the incubator. > 2) Are there any patents filed/planned for any part of work within Crunch? Google has a patent related to FlumeJava, which is the basis for Crunch's design: http://www.google.com/patents?id=vEz9AQAAEBAJ&printsec=frontcover&dq=craig+chambers+google&hl=en&sa=X&ei=pIO0T67wD-7YiQLDmsXFAg&ved=0CDQQ6AEwAA Of course, Google has also patented MapReduce and GFS, the basis of the core of Apache Hadoop. Cloudera has not filed any patents on the work done on Crunch and has no intention of doing so. > -Priyank > > -----Original Message----- > From: Josh Wills [mailto:jwi...@cloudera.com] > Sent: Thursday, May 17, 2012 8:44 AM > To: general@incubator.apache.org > Subject: [DISCUSS] Crunch joining the Apache Incubator > > Hi all, > > I would like to propose Crunch, a library for writing MapReduce > pipelines in Java and Scala, as an Apache Incubator project. The > proposal is here: > > http://wiki.apache.org/incubator/CrunchProposal > > We would gladly welcome additional volunteers to act as mentors on the > project, so if this sounds like your cup of tea, please feel free to > sign up or let us know. > > Thanks! > Josh > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Director of Data Science Cloudera Twitter: @josh_wills --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org