[jira] [Commented] (TINKERPOP-1033) Store sideEffects as a persisted RDD

ASF GitHub Bot (JIRA) Wed, 06 Jan 2016 15:15:00 -0800

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086467#comment-15086467
 ]


ASF GitHub Bot commented on TINKERPOP-1033:
-------------------------------------------

GitHub user okram opened a pull request:

    https://github.com/apache/incubator-tinkerpop/pull/192

    TINKERPOP-1033: Store sideEffects as a persisted RDD

    https://issues.apache.org/jira/browse/TINKERPOP-1033
    
    This is a massive amount of work. Just making sideEffects be stored as 
persisted RDDs led to a swath of other updates. Here is the list of things:
    
    * It is now possible for Spark users to completely avoid using HDFS -- they 
simply use `PersistedInputRDD` and `PersistedOutputRDD` for everything.
    * Added a significant amount of testing to ensure that persisted RDDs work 
as expected in all situations.
    * `InputRDD`s now have a `readMemoryRDD()` method which handles reading 
sideEffects (i.e. memory).
    * `OutputRDD`s now have a `writeMemoryRDD()` method which handles writing 
sideEffects (i.e. memory).
    * There is a `Storage` interface in gremlin-core which providers can 
implement to have "file-system semantics" for their data source. HDFS and Spark 
both implement it. No more Groovy meta-programming for HDFS! Sweeeeet.
    * With `Storage` all the file management in both Spark and Giraph is much 
simpler as the methods in `Storage` allowed me to gut alot of (error prone) 
code.
    * There is a general test suite which makes sure both HDFS and Spark 
storage behave "the same."
    * Updated documentation, upgrade docs, and added JavaDoc to `Storage`.
    * The docs for `BulkLoaderVertexProgram` and Spark/Giraph uses a `data/` 
directory. It wasn't consistent with our other examples so I cleaned it up.
    * Fixed a minor bug in `ClusterCountMapReduce`.
    * Cleaned up how HDFS data is streamed -- its pure now, based solely on 
`InputFormat` behavior (I learned something new in Hadoop).
    * There are a few minor "breaking changes" around `hdfs.methods()`. They 
are "ok" as HDFS interaction prior to this moment  has always been manual via 
the Gremlin Console.
    
    I updated the "update" docs:
    
    http://tinkerpop.apache.org/docs/3.1.1-SNAPSHOT/upgrade/#_storage_i_o
    
    I updated the "reference" docs:
    
    http://tinkerpop.apache.org/docs/3.1.1-SNAPSHOT/reference/#_storage_systems
    
    You can see the JavaDoc for the new `Storage` interface:
    
    
http://tinkerpop.apache.org/javadocs/3.1.1-SNAPSHOT/core/org/apache/tinkerpop/gremlin/structure/io/Storage.html
    
    I ran integration tests and built and deployed docs successfully. 
      
    VOTE +1.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1033

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-tinkerpop/pull/192.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #192
    
----
commit f3ebed0bde6ac889640cb136b50b362c5cd2d2ea
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-12-09T17:41:09Z

    InputRDD now has readMemoryRDD(). OutputRDD now has writeMemoryRDD(). 
InputFormatRDD and OutputFormatRDD took the code from SparkExecutor that uses 
SequenceFiles for output. As such, memory reading/writing has been generalized. 
Graph system providers that ONLY want to provide Spark support are not required 
to have HDFS as SparkServer can maintains all persisted data via graphRDD and 
memoryRDD. There is still more work to do. More tests cases is next.

commit 58d9240764cd6e1f3779097966c53058264e00e6
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-12-09T20:46:43Z

    added Storage to gremlin-core. Storage is an interface that OLAP system can 
implement. It provides ls(), rmr(), rm(), etc. type methods that make it easy 
for users to interact (via a common interface) with the underlying persitance 
system. Now both HDFS and Spark provide their own Storage implementations and 
TADA. Really pretty.

commit 2c0d327c04219de9fdf20444a100d3cb3dd1d221
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-12-09T20:48:49Z

    merged master and merged conflicts from @spmallettes changes to 
SparkGremlinPlugin and HadoopGremlinPlugin.

commit b4d8e9608d4eca3ae177b28fe588518a9d77506c
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-12-09T22:58:50Z

    Greatly greatly simplified Hadoop OLTP and interactions with HDFS and 
SparkContext. The trend -- dir/~g for graphs and dir/x for memory. A consistent 
persistence schema makes everything so much simpler. I always assumed this 
would be all generalized/blah/blah. Never actually did it so, hell, stick with 
a consistent schema and watch the code just fall away.

commit 3fff8f546501d10a4c1d34762a626a2493e758be
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-12-09T23:57:28Z

    lots more clean up, tests, and organization. She is a real beauty.

commit 74b9c8ecfe787ead99d79c127fd85a4fccd926ec
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-12-10T01:27:29Z

    migrated GiraphGraphComputer over to the new Storage model via 
FileSystemStorage for HDFS.

commit 55165a572f5d07e1ca20be13b064843da18fc8e6
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-12-10T02:11:33Z

    cleanup HDFS if Persist.NOTHING.

commit dbd4a5360a75d562df64eecd91cc8c12550adb10
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-01-05T22:54:14Z

    merged master into branch. Minor tweaks given @spmallette new work on 
TestDirectory stuffs.

commit 53e57a73aa5316b44d5ef4917347a6ba8934a102
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-01-06T15:02:33Z

    breaking commit. ignore.

commit b0f3e4a96ced7f45f5e823b9060eac9dd0be1f7e
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-01-06T17:26:46Z

    Storage is complete and has a really cool TestSuite. There are two types of 
Storage. FileSystemStorage (HDFS) and SparkContextStorage (persited RDDs). You 
can ls(), cp(), rm(), rmr(), head(), etc. There is a single abstract test suite 
called AbstractStorageCheck that confirms that both Spark and HDFS behave the 
same. Moved around and organized Hadoop test cases given the new developments.

commit 5c9e81b0cebd8c3841e2442a8ef13b3d23d44295
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-01-06T22:58:18Z

    added documentation,  upgrade docs, JavaDoc, more test cases, and fixed up 
some random inconsistencies in BulkLoaderVertexProgram documentation examples.

commit a7db52bda732810fc8d5d3a8279a4f7095285d3d
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-01-06T23:03:59Z

    Merge branch 'master' into TINKERPOP-1033

----


> Store sideEffects as a persisted RDD
> ------------------------------------
>
>                 Key: TINKERPOP-1033
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1033
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: hadoop
>    Affects Versions: 3.1.0-incubating
>            Reporter: Marko A. Rodriguez
>            Assignee: Marko A. Rodriguez
>             Fix For: 3.1.1-incubating
>
>
> I think we can completely get away from HDFS for {{SparkGraphComputer}}. We 
> will need something like {{PesistedSideEffectsRDD}}. Once we do that, if the 
> user wants to use Spark without Hadoop, its possible.
> This would beg the question -- do we go all the way and support 
> {{SparkGraph}} ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1033) Store sideEffects as a persisted RDD

Reply via email to