GitHub user okram opened a pull request:
https://github.com/apache/incubator-tinkerpop/pull/268
TINKERPOP-1082 & TINKERPOP-1222: Hadoop Configuration Updates
https://issues.apache.org/jira/browse/TINKERPOP-1082
https://issues.apache.org/jira/browse/TINKERPOP-1222
We had a very confusing situation with `gremlin.hadoop.graphInputFormat`
and `gremlin.spark.graphInputRDD`. Not only did it cause a mess of `[WARN]`
messages it was awkward as users had to know that one overrode the other. To
make this cleaner, I created a new configuration called
`gremlin.hadoop.graphReader` and `gremlin.hadoop.graphWriter` that can either
take an `XXXFormat` or an `XXXRDD`. Internally, Spark/Giraph/etc. know how to
reason on what is what.
Finally, added `gremlin.hadoop.defaultGraphComputer` where users can
specify a default `GraphComputer` in their proprties file and if so,
`graph.compute()` will no longer throw an exception saying to use
`graph.compute(class)`.
Both of these changes are backwards compatible where there backwards
compatibility is tested via `SparkHadoopGraphProvider` where via a coin-flip,
sometimes the old model is used and sometimes the new model is used.
Finally, I forgot to add docs on `GraphFilter` and they have been added to
this PR.
CHANGELOG
```
* Added `gremlin.hadoop.defaultGraphComputer` so users can use
`graph.compute()` with `HadoopGraph`.
* Added `gremlin.hadoop.graphReader` and `gremlin.hadoop.graphWriter` which
can handled `XXXFormats` and `XXXRDDs`.
* Deprecated `gremlin.hadoop.graphInputFormat`,
`gremlin.hadoop.graphOutputFormat`, `gremlin.spark.graphInputRDD`, and
`gremlin.spark.graphOuputRDD`.
```
UPDATE
```
Hadoop Configurations
++++++++++++++++++
Note that `gremlin.hadoop.graphInputFormat`,
`gremlin.hadoop.graphOutputFormat`, `gremlin.spark.graphInputRDD`, and
`gremlin.spark.graphOuputRDD` have all been deprecated. Using them still works,
but moving forward, users only need to leverage `gremlin.hadoop.graphReader`
and `gremlin.hadoop.graphWriter`. An example properties file snippet is
provided below.
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1082
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-tinkerpop/pull/268.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #268
----
commit 6411d0d4142770f93fb1a188d7e991ed1b4355f3
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-03-16T22:01:37Z
gremlin.hadoop.graphReader and gremlin.hadoop.graphWriter are the new
configurations replacing gremlin.hadoop.graphInputFormat and
spark.graphInputRDD. Now HadoopGraph can handle either RDD or XXXFormats.
Cleaner configurations. Backwards compatible. The older keys just map to the
new keys inside HadoopConfiguration.
commit b7f617b383700390128fca53de48f60cda3211fe
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-03-16T22:26:22Z
fixed up the conf/.properties to use graphReader/graphWriter. Found more
areas where inputFormat/outputFormat was still being used. Tested Giraph and
its passing completely now. Need a helper utility that converts any
Reader/Writer into an InputFormat or OutputFormat automagically.
commit 13561b81aa8287c696b8d79befce42f84792f793
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-03-16T22:49:47Z
ConfUtil does the dirty work of InputRDD or InputFormat conversion to an
InputFormat.
commit 5f53589b487ab918719315db6047233fb13971ae
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-03-17T14:42:57Z
added gremlin.hadoop.defaultGraphComputer which allows users to specify in
their properties file which GraphComputer to use by default. This allows
providers that only support one Hadoop-based OLAP engine to 'hard set' the
implementation so the syntax is cleaner -- graph.compute() vs.
graph.compute(GiraphGraphComputer.class). This is backwards compatible. The
SparkHadoopGraphProvider has been updated to sometimes use compute() and
sometimes use compute(class).
commit 4a130d9092bc37dac252536280d60158fe75f74c
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-03-17T15:09:16Z
updated docs on GraphFilter and graphReader/graphWriter.
commit 5a9f56d53741c985982d2bb13d3d8f31ffb6dd85
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-03-17T15:32:04Z
gremlin.hadoop.graphInputFormat.hasEdges is not
gremlin.hadoop.graphReader.hasEdges. Likewise for graphOuputFormat. Backwards
compatible.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---