[
https://issues.apache.org/jira/browse/TINKERPOP-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272581#comment-15272581
]
ASF GitHub Bot commented on TINKERPOP-1288:
-------------------------------------------
GitHub user okram opened a pull request:
https://github.com/apache/incubator-tinkerpop/pull/305
TINKERPOP-1288 & TINKERPOP-1290: VertexProgramInterceptor and GraphComputer
TraversalStrategies
https://issues.apache.org/jira/browse/TINKERPOP-1288
https://issues.apache.org/jira/browse/TINKERPOP-1290
We now have the concept of `TraversalStrategies` for `GraphComputers`. A
clear pattern has been established for their use with `SparkGraphComputer`.
Extending these concepts to `TinkerGraphComputer` will be easy. Moreover, we
have `SparkInterceptorStrategy` which we can continue to extend with traversal
patterns that we can map to use Spark DSL under the hood instead of
`GraphComputer` semantics for performance boosting. However, after some
intelligent refactoring around partitioning and caching (thanks
@RussellSpitzer), `SparkGraphComputer` is very close in speed to Spark DSL
computations. This is good as if that was sufficiently large, that would
destroy the purpose of writing Gremlin traversals for analytics. Either way, we
are both fast with non-intercepted traversals and intercepted traversal. In
short, this is an epic PR.
CHANGELOG (master)
```
* Added `VertexProgramInterceptor` interface as a general pattern for
`GraphComputer` providers to use for bypassing `GraphComputer` semantics where
appropriate.
* Added `SparkStarBarrierInterceptor` that uses Spark DSL for local star
graph traversals that end with a `ReducingBarrierStep`.
* Added `SparkInterceptorStrategy` which identifies which interceptor to
use (if any) given the submitted `VertexProgram`.
* Added `SparkSingleIterationStrategy` that does not partition nor cache
the graph RDD if the traversal does not message pass.
* Added more helper methods to `TraversalHelper` for handling scoped
traversal children.
```
I'm going to cherry pick the bug fix to
`TraversalHelper.isLocalStarGraph()` to tp31/.
CHANGELOG (tp31)
```
* Fixed a severe bug in `TraversalHelper.isLocalStarGraph()`.
```
---
BENCHMARKS (Friendster on the Blades):
```
g.V().count()
before: 4.5 minutes
no-partition: 2.9 minutes
interceptor: 2.5 minutes
g.V().out().count()
before: 10 minutes
no-partitioner: 3.1 minutes
interceptor: 2.6 minutes
g.V().groupCount().by(outE().count())
before: 12 minutes
no-partitioner: 3.2 minutes
interceptor: 2.6 minutes
```
---
VOTE +1 (`mvn clean install` + integration tests -- Giraph still running)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1288
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-tinkerpop/pull/305.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #305
----
commit 77480a2c7875e284b063ecca434bf2165116689d
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-03T20:14:29Z
Created the first GraphComputer Provider TraversalStrategy --
SparkPartitionAwareStratgegy. This strategy analyzes the traversal to determine
if partitioning should be skipped -- e.g. if no message pass is happening. I
think this strategy will start to set the stage for how future GraphComputer
provider-specific strategies will work.
commit e8bba359be7a773a138b023929c0a2c02a9d6de8
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-03T20:48:27Z
syncing with 1120.
commit 40b3931f843dcd3fe22da4ec91b1b252e41ffe7e
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-03T20:48:31Z
Merge branch 'TINKERPOP-1120' into TINKERPOP-1288
commit 6e92c4e115599a77381a34fdd94e19ce48866f37
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-03T20:53:14Z
merged in TINKERPOP-1120 as that has all the new optimizations. going to
test on the blades.
commit 7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-03T23:10:29Z
DefaultTraversal.hashCode() was bad. Fixed. Added NativeInterceptor which
right now is tied to SparkGraphComputer, but should be generalizable to all
GraphComputer providers. In short, instead of executing the VertexProgram, an
interceptor can take the Graph/Memory and pop out a Graph/Memory. Thus,
skipping the VertexProgram execution. Right now I have VertexCountInterceptor
which does inputRDD.count(). BAM. Works. Added more test cases and cleanup.
Going to probably just roost on this branch for a bit until I get the concept
clear and simple. Pushing so I can benchmark on the Blades.
commit d5e7c78628d89c464418f1329a797fc192880e8b
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-04T02:55:03Z
lots of cleanup. NativeInterceptor is now called VertexProgramInterceptor.
Its generalized to be used by any GraphComputer implementation. There exists
SparkVertexProgramInterceptor which extends VertexProgramInterceptor. Cleaned
up naming conventions in the test cases. Found a bug in TraversalSource that
was causing RemoteGraph computer tests to fail. SparkGraphComputer is cleaner
with how interception happens -- really clear delineation which will be used
for all GraphComptuers that use the Interceptor-model. Running integration
tests over night.
commit f0acfd7512330f82d14e0c17d90f46487008b6e1
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-04T14:40:34Z
Generalized SparkVertexCountInterceptor to SparkCountInterceptor and it can
recognize and natively process various g.V.name.count(), g.V.out.count, etc.
type of traversals. Added a few more test cases to AdjacentToIncidentStrategy.
Came up with a nice test infrastructure for SparkInterceptorStrategyTest.
commit 2717b281ed085a365d59edc718db9a33a9266343
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-04T15:44:03Z
added test cases around persisted of rdds. if the interceptor does not
'touch' the loadedGraphRDD then its still the loadedGraphRDD and thus,
persistence rules for loadedGraphRDD should apply there.
commit 6f9e5fa0c7ffa01b7fbd3c3fed9539f6ad12bd4c
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-04T17:26:00Z
just generalized SparkCountInterceptor to work with ANY star graph
computation that ends with count(). The concept is simple, just do
flatMapToPair(traversal).count(). tada. This is so clean and pretty.... next is
going to be generalizing to work with any ReducingBarrierStep, not just
count(). Also, need to add some test cases that use sideEffects and ensure that
Memory is updated accordingly. However, besides that, this is such a dead
simple concept that I actually think this is how we should redesign
SparkGraphComputer. Break up the traverasl into starGraphTraversals and then
concatenate the jobs via Spark join()s.
commit 71f66c5114ae5e90aad8dc5f5cf77b200f6e1f30
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-04T18:51:35Z
Got sideEffects working with SparkCountInterceptor.
g.V.groupCount(m).count() works as expected. Classy. Found a bug in
TraversalHelper.isLocalStarGraph(), fixed.
commit 01fda96707335f2e33b073cef980c7844ca6d80c
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-04T19:48:39Z
merged master into this branch. had to fix up SparkGraphComputer conflicts
cause of ThreadInterrupt work from @spmallette. Added some Scope-based
recurssive methods to TraversalHelper. SparkCountItercepter is now smart about
sideEffects. Added more tests to ensure proper interception.
commit 6a256f8fcbe6a87a8d794466a7b559ff9ba8d2d6
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-04T21:45:27Z
SparkCountInterceptor is now called SparkStarBarrierInterceptor and works
with count(), max(), sum(), min(), and mean(). groupCount(), group(), and
tree() are still left to do.
commit 6e122e42b9c49648ce42a295b94d780c3eb11aae
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-04T22:56:04Z
got GroupCount working with SparkStarBarrierInterceptor. Going to test on
the Blades.
commit 854b259cb0645d1d50e5365392b49a373e2131a5
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-05T15:54:41Z
found severe issues with TraversalHelper.isLocalStarGraph(). Given that
this is used extensively by GraphComputer implementations, this was a good
find. Crazy enough, no test cases for that method. Added a host of them for
traverals of varying complexity. Added spark.skipPartitioner and
spark.skipGraphCache which are both current modulated by
SparkSingleIterationStrategy. Opened up some IgnoreEngines from @pietermartin
that we due to bad TraversalHelper.isLocalStarGraph(). More tests cases for the
new SparkXXXStrategy work. Lots of logic cleanup in SparkGraphComputer given
the distinction between skipPartiioner and skipGraphCache.
commit 1e6aa16b9dfbd8a07c0fab8bb0ad7ee88a3cd630
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-05T16:15:02Z
forgot to get rid of this class which was created in this branch -- not
needed.
----
> Support gremlin.spark.skipPartitioning configuration.
> -----------------------------------------------------
>
> Key: TINKERPOP-1288
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1288
> Project: TinkerPop
> Issue Type: Improvement
> Components: hadoop, process
> Affects Versions: 3.2.0-incubating
> Reporter: Marko A. Rodriguez
> Assignee: Marko A. Rodriguez
> Fix For: 3.2.1
>
>
> If a {{VertexProgram}} does not use message passing, then its best to NOT
> partition after load as its pointless to do so.
> In particular, for {{TraversalVertexProgram}}, if the submitted traversal
> does not contain a {{VertexStep}} then partitioning can be avoided. This can
> be reasoned via a {{SparkPartitionStrategy}}, but for now, simply making the
> configuration and having it do its job is sufficient.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)