[
https://issues.apache.org/jira/browse/TINKERPOP-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169370#comment-15169370
]
ASF GitHub Bot commented on TINKERPOP-1166:
-------------------------------------------
GitHub user okram opened a pull request:
https://github.com/apache/incubator-tinkerpop/pull/243
TINKERPOP-1166, TINKERPOP-1164, TINKERPOP-1057, TINKERPOP-1162
https://issues.apache.org/jira/browse/TINKERPOP-1166
https://issues.apache.org/jira/browse/TINKERPOP-1164
https://issues.apache.org/jira/browse/TINKERPOP-1057
https://issues.apache.org/jira/browse/TINKERPOP-1162
This PR has all these tickets combined because in order to solve any I
needed to make sure all these ticket's respective solutions played well with
one another. In summary, GraphComputer `Memory` has come to the center stage as
a replacement for `MapReduce`. The benefit is that `Memory` can compute/reduce
in parallel with a VertexProgram's execution run. This has the following
benefits:
* Reductions happen within the vertex program and thus, no subsequent
rescan of the full graph needed to generate sideEffect. For example --
`g.V().count()` is one pass through the graph, not two.
* Because reductions happen within the vertex program, barrier steps are no
longer the cut-off point for an OLAP traversal. An OLAP traversal can now have
multiple reducing barrier steps within it (e.g. `max()`, `groupCount()`,
`min()`, `group()`, etc.). Its all one job.
* Vertex compute keys can be marked transient and are automagically removed
from the resultant graph. This is good for removing "scratch data." (e.g.
PageRankVertexProgram.EDGE_COUNTS).
* Memory compute keys can be marked transient and are automagically remove
from the result memory. This is good for removing "scratch data." (e.g.
VOTE_TO_HALT).
* Memory compute keys can be declared to NOT broadcast and thus, no be sent
to the workers on each iteration. Workers can still send data, but just can not
read data.
Finally, one of the major side-effect benefits of this work is that
numerous traversals that were considered "illegal" by
`ComputerVerificationStrategy` are no longer illegal. The only types of
traversals that are illegal in OLAP are those that have `by()`-modulators that
go beyond the local star graph or are path-based (`select()` and `path()`) and
go beyond the element id in their `by()`-modulations.
This work creates breaking changes for both users (trivial) and providers
(intense). However, for providers, its only those providers that have their own
custom `GraphComputer` implementation. If they use `SparkGraphComputer` or
`GiraphGraphComputer`, no work is required of them.
CHANGELOG
```
* Added `MemoryComputeKey` for specification of `Memory` keys in
`VertexProgram`. (*breaking*)
* Added `VertexComputeKey` for specification of vertex compute properties
in `VertexProgram`. (*breaking*)
* Added `and`, `or`, and `addAll` to `Operator`.
* `Memory` API changed to support setting and adding values for reduction.
(*breaking*)
* `Memory` keys can be marked as broadcast and only those values are sent
to workers on each iterator.
* `Memory` keys can be marked transient and thus deleted at the end of the
OLAP job.
* Vertex compute keys can be marked transient and thus deleted at the end
of the OLAP job.
* `VertexProgram` API changed to support `MemoryComputeKey` and
`VertexComputeKey`. (*breaking*)
* `TraversalVertexProgram` able to execute OLAP and OLTP traversal sections
dynamically within the same job.
* Removed `FinalGet` interface as all post processing of reductions should
be handled by the reducing step explicitly. (*breaking*)
* Greatly simplified all `ReducingBarrierStep` implementations as they no
longer require `MapReduce` in OLAP.
* Nearly all steps in OLAP that used `MapReduce` now use `Memory` to do
their reductions which expands the list of legal traversals.
* `GroupStep` simplified with `GroupHelper.GroupMap` no longer being
needed. Related to the removal of `FinalGet`.
* OLAP side-effects that are no longer generated by `MapReduce` are simply
stored in `ComputerResult.Memory` w/ no disk persistence needed. (*breaking*)
```
UPGRADE-PROVIDERS
```
GraphComputer updates to semantics and API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Providers that have a custom `GraphComputer` implementation will have a lot
to handle. Note that if the graph system simply uses `SparkGraphComputer` or
`GiraphGraphComputer` provided by TinkerPop, then no updates are required.
`Memory` updates:
* Any `BinaryOperator` can be used for reduction and is made explicit in
the `MemoryComputeKey`.
* `MemoryComputeKeys` can be marked transient and must be removed from the
resultant `ComputerResult.memory()`.
* `MemoryComputeKeys` can be specified to not broadcast and thus, must not
be available to workers to read in `VertexProgram.execute()`.
* The `Memory` API has been changed. No more `incr()`, `and()`, etc. Now
its just `set()` (setup/terminate) and `add()` (execute).
See TINKERPOP-1166:https://issues.apache.org/jira/browse/TINKERPOP-1166
Operational semantic test cases have been added to `GraphComputerTest` to
ensure that all the above behaviors are implemented correctly.
Providers that have a custom `ReducingBarrierStep` implementation will need
to adjust their implementation slightly to accommodate a new API that reflects
the `Memory` updates above. This should be a simple change. Note that
`FinalGet` no longer exists and such post-reduction processing is handled by
the reducing step.
See TINKERPOP-1164:https://issues.apache.org/jira/browse/TINKERPOP-1164
```
*UPGRADE-USERS*
```
VertexProgram and MemoryComputeKey and VertexComputeKey
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Users that have custom `VertexProgram` implementations will need to change
their implementations to support the new `VertexComputeKey` and
`MemoryComputeKey` classes. In the `VertexPrograms` provided by TinkerPop,
these changes were trivial, taking less than 5 minutes to make the updates.
* `VertexProgram.getVertexComputeKeys()` returns a `Set<VertexComputeKey>`.
No longer a `Set<String>`. Use `VertexComputeKey.of(String key,boolean
transient)` to generate a `VertexComputeKey`. Transient keys were not supported
in the past, so to make the implementation semantically equivalent, the boolean
transient should be false.
* `VertexProgram.getMemoryComputeKeys()` returns a `Set<MemoryComputeKey>`.
No longer a `Set<String>`. Use `MemoryComputeKey.of(String key, BinaryOperator
reducer, boolean broadcast, boolean transient)` to generate a
`MemoryComputeKey`. Broadcasting and transients were not supported in the past
so to make the implementation semantically equivalent, to boolean broadcast
should be true and the boolean transient should be false.
See TINKERPOP-1162:https://issues.apache.org/jira/browse/TINKERPOP-1162
*SparkGraphComputer and GiraphGraphComputer Persistence*
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Most of the `MapReduce`-based steps in `TraversalVertexProgram` have been
removed and replaced using the new `Memory`-reduction model. `MapReduce` jobs
always created a persistence footprint (e.g. in HDFS). Memory data was never
persisted to HDFS. As such, there will not be data on the disk that is
accessible. For instance, no more `~reducing`, `~traversers`, and specially
named side-effects such as those from `groupCount('m')`. The data is still
accessible via `ComputerResult.memory()`, its just simply does not have a
corresponding on-disk representation.
```
*CHANGELOG*
```
* Added `MemoryComputeKey` for specification of `Memory` keys in
`VertexProgram`. (*breaking*)
* Added `VertexComputeKey` for specification of vertex compute properties
in `VertexProgram`. (*breaking*)
* Added `and`, `or`, and `addAll` to `Operator`.
* `Memory` API changed to support setting and adding values for reduction.
(*breaking*)
* `Memory` keys can be marked as broadcast and only those values are sent
to workers on each iterator.
* `Memory` keys can be marked transient and thus deleted at the end of the
OLAP job.
* Vertex compute keys can be marked transient and thus deleted at the end
of the OLAP job.
* `VertexProgram` API changed to support `MemoryComputeKey` and
`VertexComputeKey`. (*breaking*)
* `TraversalVertexProgram` able to execute OLAP and OLTP traversal sections
dynamically within the same job.
* Removed `FinalGet` interface as all post processing of reductions should
be handled by the reducing step explicitly. (*breaking*)
* Greatly simplified all `ReducingBarrierStep` implementations as they no
longer require `MapReduce` in OLAP.
* Nearly all steps in OLAP that used `MapReduce` now use `Memory` to do
their reductions which expands the list of legal traversals.
* `GroupStep` simplified with `GroupHelper.GroupMap` no longer being
needed. Related to the removal of `FinalGet`.
* OLAP side-effects that are no longer generated by `MapReduce` are simply
stored in `ComputerResult.Memory` w/ no disk persistence needed. (*breaking*)
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1166
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-tinkerpop/pull/243.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #243
----
commit b50a43ce781572a1610fa3e31b5132205796af67
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-19T22:48:50Z
Migrated over to the proposed Memory model of using registered
BinaryOperator reducers. It was really easy to change so thats good. All test
cases pass for TinkerGraphComputer, one fails in SparkGraphComputer, and I have
some NullPointer serialiation issue with GiraphGraphComputer that I will fix
later.
commit 0ae584cae51ab15eef7de776cf8c049b64ace852
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-19T23:45:18Z
couldn't help myself. once I walked away from the computer I realized how
to make GiraphMemory work. Got it working and test cases passing. Only one test
fails in SparkGraphComputer. I will handle THAT next week.
commit cf79c2255028d3a2a7dabb4198030b50bfb65417
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-20T02:35:52Z
minor tweak.
commit 7bf8213a5e27026f9a378a0eb166f3a67038321f
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-20T17:03:25Z
fixed the SparkMemory issue from last night. Wow. I was really expecting
this ticket to take me all of next week. Knocked it out before the week even
began. Bow to me all you peons.
commit 706759c1dbf9df6bf9210913001658ca9f9ff513
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-22T16:50:06Z
added GraphComputerTest.shouldSupportTransientKeys(). Ensured that only
Memory.set() is allowed in setup()/terminate() and Memory.add() in execute().
Fixed up SparkGraphComputer, TinkerGraphComputer, and GiraphSparkComputer to
respect the new MemoryComputeKey semantics and transient key semantics.
commit b03c1adc466fa48fffcfeb7e17b71a243f66b76f
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-22T17:38:50Z
EDGE_COUNT and VOTE_STRENGTH in PageRankVertexProgram and
PeerPressureVertexProgram are not transient and the respective property keys
are private static. Extended GraphComputerTest.shouldSupportTransientKeys()
with a MapReduce job that ensure that the transient vertex properties are not
accessible during MapReduce.
commit 65bbdaa336b4034421f7e2599bb3f2c307aa4773
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-22T21:41:02Z
Have all the ReducingBarrierSteps no longer implement MapReduce and using
MemoryComputeKeys for their reduction. GroupStep, FoldStep, and GroupStepV3d0
are not complete yet. Having the darndest time with GroupStep -- once I get it,
then the others will follow from it. Pushing to save work.
commit 3389122591a788ceb566f9bdf5a1ba6b72789faa
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-22T22:15:21Z
Got GroupStep working --- buts its a hack unfortunately. Pushing to save
work.
commit 9f8252b816432249a0667e654b682293b77ec3c1
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-23T00:21:48Z
Fixed a bug in FoldStep, got GroupStep working perfectly (both OLTP and
OLAP -- but there is still one awakward hack). Need to spend some more cycles
on GroupStep and then once I get that clean, clean, clean, I will map that
pattern over to GroupStepV3d0 and that will be that.
commit 90debb4fed3bffa5833543566620bcf0a30ae7f4
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-23T01:15:09Z
some more work on GroupStep ---- converging on final solution.
commit 9416e9498a363e8ca1ef69df6cc0d046762e43e3
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-23T17:25:36Z
New Reducing-based Memory-model implemented. A few kooky things emerged
because of this and will discuss in UPGRADE docs. However, all-in-all this is a
much nicer model which will lead to significant perofmrance improvements (still
need to benchmark and test). I don't think we will ever deprecate the MapReduce
infrastructure. The Memory model is not as flexible and efficient (for certain
jobs) as MapReduce.
commit d77bf58c8e60f52f180af21ef18091c3cb143c3b
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-23T20:40:04Z
Solved the GroupStep-hack. The whole notion of a FinalGet is bad. Every
ReducingBarrier step now implements generateFinalReduction() with the default
being the identity function. For GroupStep, MeanStep, and GroupStepV3d0, they
do the final respective reduction. Also, GroupStep is a bit more organized --
only one BinaryOperator needed.
commit 14d0d48435640572523d942d56b9a7a6dd128969
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-23T23:21:02Z
The next big thing in the new MemoryComputeKey model -- broadcasting. It is
possible to state that a MemoryKey will NOT be read by workers and thus, no
need to send the data to the workers on each iteration. Added
GraphComputerTest.shouldSupportBroadcastKeys(). SparkGraphComputer supports
this natively, TinkerGraphComputer simply hides the data when trying to be
accessed by workers, and GiraphGraphComputer (like TinkerGraphComputer) but I
will be able to at least clear the data immediately once its sent (future
work). Cleaned up GroupStep a bit. Have a consistent naming convention for
workers vs. master --- inExecute. All XXXMemory implementations use it so its
easier to see how they all relate to each other. Added a GraphComputerTest to
make sure exceptions are correct around get(), add(), set() for various
situations -- found a couple of inconsistencies that are now fixed up.
commit 90e862d76ee23d8e03f1b954be4276cd09f707ed
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-23T23:21:35Z
Merge branch 'master' into TINKERPOP-1166
commit e6adbecfc401c4f41705216fecfac74ff04b5c8e
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-24T00:37:23Z
Needed a no-args constructor for FoldBiOperator in GiraphGraphComputer.
commit aa79390f7afd78091755784753ed67a311f1d7da
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-24T20:52:29Z
Okay. So MapReduce in TraversalVertexProgram is going away fast.
GroupSideEffectStep and GroupStep are now one step -- GroupStep. Likewise for
GroupCountStep. groupCount() is simply groupCount().cap(). This makes the code
alot simpler and easier to optimize everything in one spot. With this
direction, TraversalVertexProgram will be able to do OLAP ... then in
terminate(), do reductions. If those reductions yield traversers, termiate() ==
false, and we distributed messages again. Thus, we will have
OLTP->OLAP->OLTP->OLAP all possible within in a single TraveraslVertexProgram.
The idea is that the master worker (termiante()) will do OLTP processing until
it needs to go back to OLAP (if necessary). There is still lots more work to
do. This push is to save work.
commit 980525cd2d7ee656858f68703635972c740486fa
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-24T22:13:09Z
StoreStep no longer uses MapReduce. All that is left is AggregateStep and
ProfileStep.
commit 2f3b2ac971ade400b4f2bfb794b451a612ed0f69
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-25T15:29:23Z
out of my 24 hole. @dkuppitz -- I was wrong, you can't have XXXStep and
XXXSideEffectStep as one in the same. I learned exactly what I learned about a
year ago by trying. However, note that all GroupCountXXXStep and GroupXXXStep
are no longer MapReduce-based, but GraphComputer.Memory-based and because of
their alignment they share lots of the same code. All that is left is TreeStep,
ProfileStep, and AggregateStep to convert to the GraphComputer.Memory model.
commit aae8910388873f7a749b3df330fca05ce7b5d2eb
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-25T15:53:09Z
Got rid of the FinalGet concept. This won't work moving forward. Each step
is responsible for its final reduction via
GraphComputing.generateFinalReduction(). TreeStep and TreeSideEffectStep now
both use the GraphComputer.Memory model. All that is left is AggregateStep and
ProfileStep.
commit d1812f806cd360ce5716a620413a28938ef80ab6
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-25T17:40:40Z
RangeGlobalStep is now computing in TraversalVertexProgram. Will slowly
dissect away the end-steps of TraverserMapReduce such that we have a pure
OLTP/OLAP model within TraversalVertexProgram.
commit e6d9d6d2581dd3ef4c32d336df0eb7a8989bbaba
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-25T17:53:18Z
TailGlobalStep is no longer an OLAP end step. CollectingBarriers are up
next -- Aggregate, Order, Dedup.
commit c4b9e741421d2af895bdf3c37166f952209d72d4
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-25T18:37:44Z
All the ReducingBarrierSteps inherent their MemoryComputeKey from abstract
ReducingBarrierStep. There key is now step.getId(), not longer
ReducingBarrierStep.REDUCING as you can now have multiple reducers in the same
OLAP job. Updated ComputerResultStep accordingly. Next up will be such that
ComputerResultStep does not do ANY introspection into
TraversalVertexProgramStep -- it simply pulls ~traversers from
GraphComputer.Memory. And thats all there is to it.
commit 6950b198aff0d1fb4842d972bc145a6eeae539b2
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-25T21:46:23Z
Gremlin OLAP can now do all the same traversals as Gremlin OLTP. In fact, a
Gremlin OLAP job is a undulation between distributed traversers and localized
traversers. Its really neat -- it like 'breathes'. Fan out, reduce, fan out,
reduce.... I still have to convert over OrderGlobalStep so its not done done,
but yea. Saving work.
commit 32dd068291e45179648340745c882df459a8115a
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-26T00:04:55Z
Okay. Here is the mother load. OrderGlobalStep, DedupGlobalStep, etc. can
now exist anywhere in an OLAP traversal. There are some loose ends that still
need to be cleanup (as well as some major code reorg and compression), but this
is the stuff. This has been a long time coming. This new GraphComputer.Memory
model is sooooo much more efficient and gives us much more expressivity. Stoked.
commit 066cc218bb90a27f1ee016b4db3016a630f00f63
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-26T01:17:16Z
limit(1) is now compiled into the TraversalVertexProgramStep job. Forgot to
update this test.
commit 42b7254505563eaa1925639446cbb55ad708c7b7
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-26T15:45:14Z
I totally forgot about @dkuppitz Operator.java work. I gutted lots of
rewritten operators and now just uses Operator. This makes less classes to
register with GryoMapper -- phew. It was getting insane. Also, in the future,
if we want to make more optimal implementations, we can just add stuff like
Operator.sumLong() and it doesn't effect serialization because Operator is an
enum.
commit efc1739f183ff8f30be54b53c393e3ba8d03ced9
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-26T16:33:33Z
tada -- fixed my groupCount()...groupCount()... bug. Added a crazy
asynchronous traversal to groupCount() that demonstrates that the timing is
correct for OLAP to OLTP to OLAP conversion. I went through all the test case
that have Ignore over a test and remove lots of them. Some didnt even make
sense why we had them Ignored. There are only a few Ignore(COMPUTER) and they
make sense... (not related to this PR, but in general issues with serialization
or whatnot). This new OLAP work is sooooo slammin.
commit 958791792b3e48b7c983fb5a54ae21671e1154f6
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-26T17:11:38Z
renamed Operator.add() to Operator.addAll() as its more clear especially
since Operator.sum() exists.
----
> Add Memory.reduce() as option to Memory implementations.
> --------------------------------------------------------
>
> Key: TINKERPOP-1166
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1166
> Project: TinkerPop
> Issue Type: Improvement
> Components: hadoop, process, tinkergraph
> Affects Versions: 3.1.2-incubating
> Reporter: Marko A. Rodriguez
> Labels: breaking
>
> Currently {{Memory}} supports {{incr}}, {{and}}, {{or}}, ... These are great
> and what people will typically use. However, we should also provide the
> generalization which is simply {{Memory.reduce}}. In this situation,
> {{incr}}, {{or}}, {{and}}, etc. are just specifications of {{Memory.reduce}}.
> How would it work?
> When memory is initialized in a {{VertexProgram}}, it would be like this:
> {code}
> memory.set("myReduction", new MyReducingFunction(0))
> {code}
> Then {{ReducingFunction}} would look like this:
> {code}
> public class ReducingFunction implements UnaryOperator<A> {
> public A getInitialValue();
> public A apply(A first, A second);
> }
> {code}
> Easy peasy. Note that both Spark and Giraph support such types of
> function-based reduction in their respective "memory engines."
> TinkerGraphComputer will, of course, be easy to add this functionality too.
> Why do this? For two reasons:
> 1. We get extra flexibility in {{Memory}}.
> 2. https://issues.apache.org/jira/browse/TINKERPOP-1164
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)