[jira] [Commented] (TINKERPOP-1166) Add Memory.reduce() as option to Memory implementations.

ASF GitHub Bot (JIRA) Fri, 26 Feb 2016 09:28:45 -0800

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169370#comment-15169370
 ]


ASF GitHub Bot commented on TINKERPOP-1166:
-------------------------------------------

GitHub user okram opened a pull request:

    https://github.com/apache/incubator-tinkerpop/pull/243

    TINKERPOP-1166, TINKERPOP-1164, TINKERPOP-1057, TINKERPOP-1162

    https://issues.apache.org/jira/browse/TINKERPOP-1166
    https://issues.apache.org/jira/browse/TINKERPOP-1164
    https://issues.apache.org/jira/browse/TINKERPOP-1057
    https://issues.apache.org/jira/browse/TINKERPOP-1162
    
    This PR has all these tickets combined because in order to solve any I 
needed to make sure all these ticket's respective solutions played well with 
one another. In summary, GraphComputer `Memory` has come to the center stage as 
a replacement for `MapReduce`. The benefit is that `Memory` can compute/reduce 
in parallel with a VertexProgram's execution run. This has the following 
benefits:
    
    * Reductions happen within the vertex program and thus, no subsequent 
rescan of the full graph needed to generate sideEffect. For example -- 
`g.V().count()` is one pass through the graph, not two.
    * Because reductions happen within the vertex program, barrier steps are no 
longer the cut-off point for an OLAP traversal. An OLAP traversal can now have 
multiple reducing barrier steps within it (e.g. `max()`, `groupCount()`, 
`min()`, `group()`, etc.). Its all one job.
    * Vertex compute keys can be marked transient and are automagically removed 
from the resultant graph. This is good for removing "scratch data." (e.g. 
PageRankVertexProgram.EDGE_COUNTS).
    * Memory compute keys can be marked transient and are automagically remove 
from the result memory. This is good for removing "scratch data." (e.g. 
VOTE_TO_HALT).
    * Memory compute keys can be declared to NOT broadcast and thus, no be sent 
to the workers on each iteration. Workers can still send data, but just can not 
read data.
    
    Finally, one of the major side-effect benefits of this work is that 
numerous traversals that were considered "illegal" by 
`ComputerVerificationStrategy` are no longer illegal. The only types of 
traversals that are illegal in OLAP are those that have `by()`-modulators that 
go beyond the local star graph or are path-based (`select()` and `path()`) and 
go beyond the element id in their `by()`-modulations.
    
    This work creates breaking changes for both users (trivial) and providers 
(intense). However, for providers, its only those providers that have their own 
custom `GraphComputer` implementation. If they use `SparkGraphComputer` or 
`GiraphGraphComputer`, no work is required of them.
    
    CHANGELOG
    
    ```
    * Added `MemoryComputeKey` for specification of `Memory` keys in 
`VertexProgram`. (*breaking*)
    * Added `VertexComputeKey` for specification of vertex compute properties 
in `VertexProgram`. (*breaking*)
    * Added `and`, `or`, and `addAll` to `Operator`.
    * `Memory` API changed to support setting and adding values for reduction. 
(*breaking*)
    * `Memory` keys can be marked as broadcast and only those values are sent 
to workers on each iterator. 
    * `Memory` keys can be marked transient and thus deleted at the end of the 
OLAP job.
    * Vertex compute keys can be marked transient and thus deleted at the end 
of the OLAP job.
    * `VertexProgram` API changed to support `MemoryComputeKey` and 
`VertexComputeKey`. (*breaking*)
    * `TraversalVertexProgram` able to execute OLAP and OLTP traversal sections 
dynamically within the same job.
    * Removed `FinalGet` interface as all post processing of reductions should 
be handled by the reducing step explicitly. (*breaking*)
    * Greatly simplified all `ReducingBarrierStep` implementations as they no 
longer require `MapReduce` in OLAP.
    * Nearly all steps in OLAP that used `MapReduce` now use `Memory` to do 
their reductions which expands the list of legal traversals.
    * `GroupStep` simplified with `GroupHelper.GroupMap` no longer being 
needed. Related to the removal of `FinalGet`.
    * OLAP side-effects that are no longer generated by `MapReduce` are simply 
stored in `ComputerResult.Memory` w/ no disk persistence needed. (*breaking*)
    ```
    
    UPGRADE-PROVIDERS
    
    ```
    GraphComputer updates to semantics and API
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    Providers that have a custom `GraphComputer` implementation will have a lot 
to handle. Note that if the graph system simply uses `SparkGraphComputer` or 
`GiraphGraphComputer` provided by TinkerPop, then no updates are required.
    
    `Memory` updates:
    
    * Any `BinaryOperator` can be used for reduction and is made explicit in 
the `MemoryComputeKey`.
    * `MemoryComputeKeys` can be marked transient and must be removed from the 
resultant `ComputerResult.memory()`.
    * `MemoryComputeKeys` can be specified to not broadcast and thus, must not 
be available to workers to read in `VertexProgram.execute()`.
    * The `Memory` API has been changed. No more `incr()`, `and()`, etc. Now 
its just `set()` (setup/terminate) and `add()` (execute).
    
    See TINKERPOP-1166:https://issues.apache.org/jira/browse/TINKERPOP-1166
    
    Operational semantic test cases have been added to `GraphComputerTest` to 
ensure that all the above behaviors are implemented correctly.
    
    Providers that have a custom `ReducingBarrierStep` implementation will need 
to adjust their implementation slightly to accommodate a new API that reflects 
the `Memory` updates above. This should be a simple change. Note that 
`FinalGet` no longer exists and such post-reduction processing is handled by 
the reducing step.
    
    See TINKERPOP-1164:https://issues.apache.org/jira/browse/TINKERPOP-1164
    
    ```
    
    *UPGRADE-USERS*
    
    ```
    VertexProgram and MemoryComputeKey and VertexComputeKey
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    Users that have custom `VertexProgram` implementations will need to change 
their implementations to support the new `VertexComputeKey` and 
`MemoryComputeKey` classes. In the `VertexPrograms` provided by TinkerPop, 
these changes were trivial, taking less than 5 minutes to make the updates.
    
    * `VertexProgram.getVertexComputeKeys()` returns a `Set<VertexComputeKey>`. 
No longer a `Set<String>`. Use `VertexComputeKey.of(String key,boolean 
transient)` to generate a `VertexComputeKey`. Transient keys were not supported 
in the past, so to make the implementation semantically equivalent, the boolean 
transient should be false.
    * `VertexProgram.getMemoryComputeKeys()` returns a `Set<MemoryComputeKey>`. 
No longer a `Set<String>`. Use `MemoryComputeKey.of(String key, BinaryOperator 
reducer, boolean broadcast, boolean transient)` to generate a 
`MemoryComputeKey`. Broadcasting and transients were not supported in the past 
so to make the implementation semantically equivalent, to boolean broadcast 
should be true and the boolean transient should be false.
    
    See TINKERPOP-1162:https://issues.apache.org/jira/browse/TINKERPOP-1162
    
    *SparkGraphComputer and GiraphGraphComputer Persistence*
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    Most of the `MapReduce`-based steps in `TraversalVertexProgram` have been 
removed and replaced using the new `Memory`-reduction model. `MapReduce` jobs 
always created a persistence footprint (e.g. in HDFS). Memory data was never 
persisted to HDFS. As such, there will not be data on the disk that is 
accessible. For instance, no more `~reducing`, `~traversers`, and specially 
named side-effects such as those from `groupCount('m')`. The data is still 
accessible via `ComputerResult.memory()`, its just simply does not have a 
corresponding on-disk representation.
    ```
    
    *CHANGELOG*
    
    ```
    * Added `MemoryComputeKey` for specification of `Memory` keys in 
`VertexProgram`. (*breaking*)
    * Added `VertexComputeKey` for specification of vertex compute properties 
in `VertexProgram`. (*breaking*)
    * Added `and`, `or`, and `addAll` to `Operator`.
    * `Memory` API changed to support setting and adding values for reduction. 
(*breaking*)
    * `Memory` keys can be marked as broadcast and only those values are sent 
to workers on each iterator. 
    * `Memory` keys can be marked transient and thus deleted at the end of the 
OLAP job.
    * Vertex compute keys can be marked transient and thus deleted at the end 
of the OLAP job.
    * `VertexProgram` API changed to support `MemoryComputeKey` and 
`VertexComputeKey`. (*breaking*)
    * `TraversalVertexProgram` able to execute OLAP and OLTP traversal sections 
dynamically within the same job.
    * Removed `FinalGet` interface as all post processing of reductions should 
be handled by the reducing step explicitly. (*breaking*)
    * Greatly simplified all `ReducingBarrierStep` implementations as they no 
longer require `MapReduce` in OLAP.
    * Nearly all steps in OLAP that used `MapReduce` now use `Memory` to do 
their reductions which expands the list of legal traversals.
    * `GroupStep` simplified with `GroupHelper.GroupMap` no longer being 
needed. Related to the removal of `FinalGet`.
    * OLAP side-effects that are no longer generated by `MapReduce` are simply 
stored in `ComputerResult.Memory` w/ no disk persistence needed. (*breaking*)
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1166

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-tinkerpop/pull/243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #243
    
----
commit b50a43ce781572a1610fa3e31b5132205796af67
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-19T22:48:50Z

    Migrated over to the proposed Memory model of using registered 
BinaryOperator reducers. It was really easy to change so thats good. All test 
cases pass for TinkerGraphComputer, one fails in SparkGraphComputer, and I have 
some NullPointer serialiation issue with GiraphGraphComputer that I will fix 
later.

commit 0ae584cae51ab15eef7de776cf8c049b64ace852
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-19T23:45:18Z

    couldn't help myself. once I walked away from the computer I realized how 
to make GiraphMemory work. Got it working and test cases passing. Only one test 
fails in SparkGraphComputer. I will handle THAT next week.

commit cf79c2255028d3a2a7dabb4198030b50bfb65417
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-20T02:35:52Z

    minor tweak.

commit 7bf8213a5e27026f9a378a0eb166f3a67038321f
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-20T17:03:25Z

    fixed the SparkMemory issue from last night. Wow. I was really expecting 
this ticket to take me all of next week. Knocked it out before the week even 
began. Bow to me all you peons.

commit 706759c1dbf9df6bf9210913001658ca9f9ff513
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-22T16:50:06Z

    added GraphComputerTest.shouldSupportTransientKeys(). Ensured that only 
Memory.set() is allowed in setup()/terminate() and Memory.add() in execute(). 
Fixed up SparkGraphComputer, TinkerGraphComputer, and GiraphSparkComputer to 
respect the new MemoryComputeKey semantics and transient key semantics.

commit b03c1adc466fa48fffcfeb7e17b71a243f66b76f
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-22T17:38:50Z

    EDGE_COUNT and VOTE_STRENGTH in PageRankVertexProgram and 
PeerPressureVertexProgram are not transient and the respective property keys 
are private static. Extended GraphComputerTest.shouldSupportTransientKeys() 
with a MapReduce job that ensure that the transient vertex properties are not 
accessible during MapReduce.

commit 65bbdaa336b4034421f7e2599bb3f2c307aa4773
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-22T21:41:02Z

    Have all the ReducingBarrierSteps no longer implement MapReduce and using 
MemoryComputeKeys for their reduction. GroupStep, FoldStep, and GroupStepV3d0 
are not complete yet. Having the darndest time with GroupStep -- once I get it, 
then the others will follow from it. Pushing to save work.

commit 3389122591a788ceb566f9bdf5a1ba6b72789faa
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-22T22:15:21Z

    Got GroupStep working --- buts its a hack unfortunately. Pushing to save 
work.

commit 9f8252b816432249a0667e654b682293b77ec3c1
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-23T00:21:48Z

    Fixed a bug in FoldStep, got GroupStep working perfectly (both OLTP and 
OLAP -- but there is still one awakward hack). Need to spend some more cycles 
on GroupStep and then once I get that clean, clean, clean, I will map that 
pattern over to GroupStepV3d0 and that will be that.

commit 90debb4fed3bffa5833543566620bcf0a30ae7f4
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-23T01:15:09Z

    some more work on GroupStep ---- converging on final solution.

commit 9416e9498a363e8ca1ef69df6cc0d046762e43e3
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-23T17:25:36Z

    New Reducing-based Memory-model implemented. A few kooky things emerged 
because of this and will discuss in UPGRADE docs. However, all-in-all this is a 
much nicer model which will lead to significant perofmrance improvements (still 
need to benchmark and test). I don't think we will ever deprecate the MapReduce 
infrastructure. The Memory model is not as flexible and efficient (for certain 
jobs) as MapReduce.

commit d77bf58c8e60f52f180af21ef18091c3cb143c3b
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-23T20:40:04Z

    Solved the GroupStep-hack. The whole notion of a FinalGet is bad. Every 
ReducingBarrier step now implements generateFinalReduction() with the default 
being the identity function. For GroupStep, MeanStep, and GroupStepV3d0, they 
do the final respective reduction. Also, GroupStep is a bit more organized -- 
only one BinaryOperator needed.

commit 14d0d48435640572523d942d56b9a7a6dd128969
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-23T23:21:02Z

    The next big thing in the new MemoryComputeKey model -- broadcasting. It is 
possible to state that a MemoryKey will NOT be read by workers and thus, no 
need to send the data to the workers on each iteration. Added 
GraphComputerTest.shouldSupportBroadcastKeys(). SparkGraphComputer supports 
this natively, TinkerGraphComputer simply hides the data when trying to be 
accessed by workers, and GiraphGraphComputer (like TinkerGraphComputer) but I 
will be able to at least clear the data immediately once its sent (future 
work). Cleaned up GroupStep a bit. Have a consistent naming convention for 
workers vs. master --- inExecute. All XXXMemory implementations use it so its 
easier to see how they all relate to each other. Added a GraphComputerTest to 
make sure exceptions are correct around get(), add(), set() for various 
situations -- found a couple of inconsistencies that are now fixed up.

commit 90e862d76ee23d8e03f1b954be4276cd09f707ed
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-23T23:21:35Z

    Merge branch 'master' into TINKERPOP-1166

commit e6adbecfc401c4f41705216fecfac74ff04b5c8e
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-24T00:37:23Z

    Needed a no-args constructor for FoldBiOperator in GiraphGraphComputer.

commit aa79390f7afd78091755784753ed67a311f1d7da
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-24T20:52:29Z

    Okay. So MapReduce in TraversalVertexProgram is going away fast. 
GroupSideEffectStep and GroupStep are now one step -- GroupStep. Likewise for 
GroupCountStep. groupCount() is simply groupCount().cap(). This makes the code 
alot simpler and easier to optimize everything in one spot. With this 
direction, TraversalVertexProgram will be able to do OLAP ... then in 
terminate(), do reductions. If those reductions yield traversers, termiate() == 
false, and we distributed messages again. Thus, we will have 
OLTP->OLAP->OLTP->OLAP all possible within in a single TraveraslVertexProgram. 
The idea is that the master worker (termiante()) will do OLTP processing until 
it needs to go back to OLAP (if necessary). There is still lots more work to 
do. This push is to save work.

commit 980525cd2d7ee656858f68703635972c740486fa
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-24T22:13:09Z

    StoreStep no longer uses MapReduce. All that is left is AggregateStep and 
ProfileStep.

commit 2f3b2ac971ade400b4f2bfb794b451a612ed0f69
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-25T15:29:23Z

    out of my 24 hole. @dkuppitz -- I was wrong, you can't have XXXStep and 
XXXSideEffectStep as one in the same. I learned exactly what I learned about a 
year ago by trying. However, note that all GroupCountXXXStep and GroupXXXStep 
are no longer MapReduce-based, but GraphComputer.Memory-based and because of 
their alignment they share lots of the same code. All that is left is TreeStep, 
ProfileStep, and AggregateStep to convert to the GraphComputer.Memory model.

commit aae8910388873f7a749b3df330fca05ce7b5d2eb
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-25T15:53:09Z

    Got rid of the FinalGet concept. This won't work moving forward. Each step 
is responsible for its final reduction via 
GraphComputing.generateFinalReduction(). TreeStep and TreeSideEffectStep now 
both use the GraphComputer.Memory model. All that is left is AggregateStep and 
ProfileStep.

commit d1812f806cd360ce5716a620413a28938ef80ab6
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-25T17:40:40Z

    RangeGlobalStep is now computing in TraversalVertexProgram. Will slowly 
dissect away the end-steps of TraverserMapReduce such that we have a pure 
OLTP/OLAP model within TraversalVertexProgram.

commit e6d9d6d2581dd3ef4c32d336df0eb7a8989bbaba
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-25T17:53:18Z

    TailGlobalStep is no longer an OLAP end step. CollectingBarriers are up 
next -- Aggregate, Order, Dedup.

commit c4b9e741421d2af895bdf3c37166f952209d72d4
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-25T18:37:44Z

    All the ReducingBarrierSteps inherent their MemoryComputeKey from abstract 
ReducingBarrierStep. There key is now step.getId(), not longer 
ReducingBarrierStep.REDUCING as you can now have multiple reducers in the same 
OLAP job. Updated ComputerResultStep accordingly. Next up will be such that 
ComputerResultStep does not do ANY introspection into 
TraversalVertexProgramStep -- it simply pulls ~traversers from 
GraphComputer.Memory. And thats all there is to it.

commit 6950b198aff0d1fb4842d972bc145a6eeae539b2
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-25T21:46:23Z

    Gremlin OLAP can now do all the same traversals as Gremlin OLTP. In fact, a 
Gremlin OLAP job is a undulation between distributed traversers and localized 
traversers. Its really neat -- it like 'breathes'. Fan out, reduce, fan out, 
reduce.... I still have to convert over OrderGlobalStep so its not done done, 
but yea. Saving work.

commit 32dd068291e45179648340745c882df459a8115a
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-26T00:04:55Z

    Okay. Here is the mother load. OrderGlobalStep, DedupGlobalStep, etc. can 
now exist anywhere in an OLAP traversal. There are some loose ends that still 
need to be cleanup (as well as some major code reorg and compression), but this 
is the stuff. This has been a long time coming. This new GraphComputer.Memory 
model is sooooo much more efficient and gives us much more expressivity. Stoked.

commit 066cc218bb90a27f1ee016b4db3016a630f00f63
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-26T01:17:16Z

    limit(1) is now compiled into the TraversalVertexProgramStep job. Forgot to 
update this test.

commit 42b7254505563eaa1925639446cbb55ad708c7b7
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-26T15:45:14Z

    I totally forgot about @dkuppitz Operator.java work. I gutted lots of 
rewritten operators and now just uses Operator. This makes less classes to 
register with GryoMapper -- phew. It was getting insane. Also, in the future, 
if we want to make more optimal implementations, we can just add stuff like 
Operator.sumLong() and it doesn't effect serialization because Operator is an 
enum.

commit efc1739f183ff8f30be54b53c393e3ba8d03ced9
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-26T16:33:33Z

    tada -- fixed my groupCount()...groupCount()... bug. Added a crazy 
asynchronous traversal to groupCount() that demonstrates that the timing is 
correct for OLAP to OLTP to OLAP conversion. I went through all the test case 
that have Ignore over a test and remove lots of them. Some didnt even make 
sense why we had them Ignored. There are only a few Ignore(COMPUTER) and they 
make sense... (not related to this PR, but in general issues with serialization 
or whatnot). This new OLAP work is sooooo slammin.

commit 958791792b3e48b7c983fb5a54ae21671e1154f6
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-26T17:11:38Z

    renamed Operator.add() to Operator.addAll() as its more clear especially 
since Operator.sum() exists.

----


> Add Memory.reduce() as option to Memory implementations.
> --------------------------------------------------------
>
>                 Key: TINKERPOP-1166
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1166
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: hadoop, process, tinkergraph
>    Affects Versions: 3.1.2-incubating
>            Reporter: Marko A. Rodriguez
>              Labels: breaking
>
> Currently {{Memory}} supports {{incr}}, {{and}}, {{or}}, ... These are great 
> and what people will typically  use. However, we should also provide the 
> generalization which is simply {{Memory.reduce}}. In this situation, 
> {{incr}}, {{or}}, {{and}}, etc. are just specifications of {{Memory.reduce}}.
> How would it work?
> When memory is initialized in a {{VertexProgram}}, it would be like this:
> {code}
> memory.set("myReduction", new MyReducingFunction(0))
> {code}
> Then {{ReducingFunction}} would look like this:
> {code}
> public class ReducingFunction implements UnaryOperator<A> {
>   public A getInitialValue();
>   public A apply(A first, A second);
> }
> {code}
> Easy peasy. Note that both Spark and Giraph support such types of 
> function-based reduction in their respective "memory engines." 
> TinkerGraphComputer will, of course, be easy to add this functionality too.
> Why do this? For two reasons:
> 1. We get extra flexibility in {{Memory}}.
> 2. https://issues.apache.org/jira/browse/TINKERPOP-1164



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1166) Add Memory.reduce() as option to Memory implementations.

Reply via email to