Marko A. Rodriguez created TINKERPOP-1250:
---------------------------------------------
Summary: Support Subgraph-Centric GraphComputer
Key: TINKERPOP-1250
URL: https://issues.apache.org/jira/browse/TINKERPOP-1250
Project: TinkerPop
Issue Type: New Feature
Components: process
Affects Versions: 3.2.0-incubating
Reporter: Marko A. Rodriguez
Right now, {{GraphComputer}} and {{VertexPrograms}} are "vertex-centric." That
is, the boundary of an atomic unit of computation is a single vertex, its
properties, and its incident edges. We should support "subgraph-centric"
computing where each worker's partition is loaded in memory (RAM) as a
connected subgraph. For those vertices that are not in its partition, a special
"reference vertex" is use to reference it. What this would mean is that when a
{{Traverser}} is processing, it can continue to evaluate as long as its within
the subgraph partition. The moment it references a vertex not in the partition
(a "reference vertex"), it serializes itself as a message.
This would greatly increase the speed of Gremlin OLAP at the cost of requiring
a large amount of memory to store all the worker partition subgraphs in RAM.
There might even be a way to have a hybrid-model where some of the partition is
held in RAM and the other is (even though still in the same partition) is
stored as "star vertices."
How would this be added to {{GraphComputer}} in a backwards compatible way?
{code}
GraphComputer.supportsVertexCentricComputing)() // currently true for all
implementations
GraphComputer.supportsSubgraphCentricComputing()
{code}
A {{VertexProgram}} could then have the following method:
{code}
boolean VertexProgram.withinCentricity(final M message)
{code}
If the message is NOT within "centricity", then it is serialized and
distributed, else it continues to execute.
I haven't thought through all the API and implementation considerations. Though
it would be good to make this backwards compatible and, moving forward, able to
support "edge-centric computing" and thus, have a very memory limited OLAP
system.
How to think of the different models:
1. Vertex-centric: medium speed, medium expressivity, medium memory costs.
2. Subgraph-centric: high speed, high expressivity, high memory costs.
3. Edge-centric: high speed, low expressivity, low memory costs.
What is "expressivity"? Well, subgraph-centric computations can have "local
traversals" that move beyond the "star graph." Edge-centric would not support
any `by()`-modulators as the computation is bound to a single edge. Thus, low
expressivity. It really depends on how you represent the edge-centric edge
list. Do you have the vertices on each side of an edge duplicated with all
their properties? This wold still be low memory costs, but you could get more
expressivity.
Anywho, in general it would be nice if the underlying execution engine can
handle the three common distributed graph computing paradigms. Subgraph-centric
seems the easiest to support at this point in time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)