I certainly understand the exploration and learning motivation -- I did
much the same thing. At this point, I wouldn't consider either of our
efforts to be a complete or fully usable Clojure API for Spark, but there
are definitely ideas worth looking at in both if anyone gets to the point
of attempting to write a complete and robust API -- which I won't be doing
in the immediate future.
I'm not sure that I am following you on Cascading and Spark. Are you
saying that you want to use the Cascading API to express workflows which
will then be transformed into a DAG of Spark stages and run as a Spark job?
I don't think that I agree with that strategy. While I can get behind
various higher-level abstractions to express Spark jobs (which is what
Shark is doing, after all), I don't find Cascading's API to be terribly
elegant: When writing a Spark job in Scala, I just don't find myself think
that it would be a whole lot easier if I could write the job in Cascading.
Part of that is because I'm not fluent in Cascading, but from what I have
seen and done with it, I don't lust after Cascading. The other problem I
have with the Cascading-to-Spark strategy is that Cascading has been
designed and implemented very much with Hadoop in mind, but Spark can do
quite a bit more that Hadoop cannot. I don't think that Cascading itself
would be a good fit for expressing Spark jobs that can really leverage the
advantages that Spark has over Hadoop. None of that is meant to say that
Cascading isn't a step forward over writing jobs using Hadoop's Java API;
but at this point I just don't see Cascading as a step forward for writing
Spark jobs.
Anyway, here's one of the early problems I ran into when trying to follow
your README:
$ lein --version
Leiningen 2.0.0 on Java 1.7.0_09 OpenJDK 64-Bit Server VM
$ lein deps
$ lein compile
Compiling clj-spark.spark.functions
Compiling clj-spark.api
Compiling clj-spark.util
Compiling clj-spark.examples.query
$ lein run
2013-01-23 11:17:10,436 WARN api:1 - JavaSparkContext local Simple Job
/home/mark/Desktop/Scala/Spark/0.6 [] {}
Exception in thread "main" java.lang.ClassCastException:
clojure.lang.PersistentVector cannot be cast to java.lang.CharSequence
at clojure.string$split.invoke(string.clj:174)
at clj_spark.api$spark_context.doInvoke(api.clj:18)
at clojure.lang.RestFn.invoke(RestFn.java:805)
at clj_spark.examples.query$_main.doInvoke(query.clj:33)
at clojure.lang.RestFn.invoke(RestFn.java:397)
at clojure.lang.Var.invoke(Var.java:411)
at user$eval22.invoke(NO_SOURCE_FILE:1)
at clojure.lang.Compiler.eval(Compiler.java:6511)
at clojure.lang.Compiler.eval(Compiler.java:6501)
at clojure.lang.Compiler.eval(Compiler.java:6477)
at clojure.core$eval.invoke(core.clj:2797)
at clojure.main$eval_opt.invoke(main.clj:297)
at clojure.main$initialize.invoke(main.clj:316)
at clojure.main$null_opt.invoke(main.clj:349)
at clojure.main$main.doInvoke(main.clj:427)
at clojure.lang.RestFn.invoke(RestFn.java:421)
at clojure.lang.Var.invoke(Var.java:419)
at clojure.lang.AFn.applyToHelper(AFn.java:163)
at clojure.lang.Var.applyTo(Var.java:532)
at clojure.main.main(main.java:37)
zsh: exit 1 lein run
On Wednesday, January 23, 2013 7:02:43 AM UTC-8, Marc Limotte wrote:
>
> Hi Mark.
>
> This was very much exploratory work, and a lot of it was just about
> learning the Spark paradigms. That being said, merging for future work
> seems appropriate, but it's not clear yet if I will be pursuing this work
> further. Might wind up using Shark instead [would love to use Cascading
> over Spark as well, if it existed].
>
> I'd like to know what issue you had in getting the code/examples to work?
> I had a couple of people try this out from scratch on clean systems and it
> did work for them.
>
> Serializing the functions is necessary as far as I can tell. It would not
> work for me without this. As far as I can tell (this is largely
> guesswork), the problem is that each time the anonymous function is
> evaluated on a different JVM it gets a different class name (e.g. fn_123).
> There is high likelihood that the name assigned on the master is not the
> same as the name on the task JVMs, so you wind up with a
> ClassNotFoundException.
>
> I don't know why this would work for you. If you have any insight on
> this, I would love to hear it?
>
> Marc
>
> On Tue, Jan 22, 2013 at 8:09 AM, Mark Hamstra <[email protected]<javascript:>
> > wrote:
>
>> Hmmm... a lot of duplicated work. Sorry I didn't get my stuff in a more
>> usable form for you, but I wasn't aware that anybody was even interested in
>> it. I've got some stuff that I want to rework a little, and I'm still
>> thinking through the best way to integrate with the new reducers code in
>> Clojure, but I haven't had the right combination of time and motivation to
>> finish off what I started and document it. At any rate, we should work at
>> merging the two efforts, since I don't see any need for duplicate APIs.
>>
>> In taking a quick first pass at it, I wasn't able to get your code and
>> examples to work, but I'm curious what your reasoning is for
>> using serializable.fn and avoiding use of
>> clojure.core/fn or #(). I'm not sure that is strictly necessary. For
>> example, the following works just fine with my API:
>>
>> (require 'spark.api.clojure.core)
>>
>> (wrappers!) ; one of the pieces I want to re-work, but allows functions
>> like map to work with either Clojure collections or RDDs
>>
>> (set-spark-context! "local[4]" "cljspark")
>>
>> (def rdd (parallelize [1 2 3 4]))
>>
>> (def mrdd1 (map #(+ 2 %) rdd))
>>
>> (def result1 (collect mrdd1))
>>
>> (def offset1 4)
>>
>> (def mrdd2 (map #(+ offset %) rdd))
>>
>> (def result2 (collect mrdd2))
>>
>> (def mrdd3 (map (let [offset2 5] (+ offset %)) rdd))
>>
>> (def result3 (collect mrdd3))
>>
>>
>> That will result in result1, result2, and result3 being [3 4 5 6], [5 6 7
>> 8], and [6 7 8 9] respectively, without any need for serializable-fn.
>>
>>
>> On Tuesday, January 22, 2013 6:55:53 AM UTC-8, Marc Limotte wrote:
>>
>>> A Clojure api for the Spark Project. I am aware that there is another
>>> clojure spark wrapper project which looks very interesting, This project
>>> has similar goals. And also similar to that project it is not absolutely
>>> complete, but it is does have some documentation and examples. And it is
>>> useable and should be easy enough to extend as needed. This is the result
>>> of about three weeks of work. It handles many of the initial problems like
>>> serializing anonymous functions, converting back and forth between Scala
>>> Tuples and Clojure seqs, and converting RDDs to PairRDDs.
>>>
>>> The project is available here:
>>>
>>> https://github.com/**TheClimateCorporation/clj-**spark<https://github.com/TheClimateCorporation/clj-spark>
>>>
>>> Thanks to The Climate Corporation for allowing me to release it. At
>>> Climate, we do the majority of our Big Data work with Cascalog (on top of
>>> Cascading). I was looking into Spark for some of the benefits that it
>>> provides. I suspect we will explore Shark next, and may work it in to our
>>> processes for some of our more adhoc/exploratory queries.
>>>
>>> I think it would be interesting to see a Cascading planner on top of
>>> Spark, which would enable Cascalog queries (mostly) for free. I suspect
>>> that might be a superior method of using Clojure on Spark.
>>>
>>> Marc Limotte
>>>
>>> --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to [email protected]<javascript:>
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>>
>
>
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en