Hey, I'm a little torn here. On one side it's good to have the option of returning a ReferenceVertex, which is currently really complicated to do. On the other hand this new behavior is far from intuitive and has some difficultly surmountable issues.
If I'm understanding you correctly both behaviors would still live through, we would just switch the default mode right? I would like to debate whether or not this new behavior should be default (I don't really know where I stand but just for the sake of being thorough). Barring the actual issues this introduces (as I'm pretty sure it's only going to concern very few people and they can use whatever conf). People coming from the SQL world and who already have trouble adjusting to gremlin will find this counter-intuitive. After all these people couldn't care less about ReferenceVertex, on the other hand it's very natural to query a vertex and get it's info. Not to mention that when handling a vertex directly or using a traversal the ways of getting the properties are different and not very consistent. Again, I don't really know where I stand on this, I just wanted to be thorough. Thoughts? On Wed, May 18, 2016 at 4:04 PM, Stephen Mallette <[email protected]> wrote: > I'll try to keep this simple, as serialization tends to be anything but > simple.... > > Forgetting GraphML which has its own rules, GraphSON and Gryo are the two > key serialization modules that we have in IO. We use these for both > serialization to disk as well as serialization over the network in Gremlin > Server. If you issue a request like: > > g.V() > > it returns vertices obviously. For both Gryo and GraphSON, those vertices > are converted to DetachedVertex which includes the properties of the > Vertex. This can be tremendously expensive, especially if the graph > supports multi-properties. > > I think that Gremlin Server should take a hint from OLAP in relation to > this issue. With OLAP, a Vertex is converted to a ReferenceVertex where we > only get the element identifier passed around. > > gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties') > ==>hadoopgraph[gryoinputformat->gryooutputformat] > gremlin> g = graph.traversal().withComputer(SparkGraphComputer) > ==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], > sparkgraphcomputer] > gremlin> l = g.V().toList();[] > gremlin> l[0].class > ==>class > org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceVertex > > If you want more information, it is up to you to issue your query to > request that information - for example: > > g.V().valueMap(true) > > I think Gremlin Server should work in the same fashion (i.e. return a > ReferenceVertex when a Vertex is serialized over the network). It would > ease up on serialization overhead and force users to be more explicit about > the data that they want which would prevent unnecessary performance > surprises. This change might also be nice for the efficiency of > RemoteGraph/Connection implementations. > > This has bothered me for a while, but we carried over the pattern from > TinkerPop 2.x of sending back properties and I've been concerned about > introducing a break in trying to improve that. I dug into it more today > and my analysis seems to indicate that this change can occur without > breaking all the code that's currently out there. I think that we could > keep the existing serialization model and simply add in the ReferenceVertex > approach as a configuration option for 3.2.1 and then make it the default > for 3.3.x. > > If there are no objections in the next 72 hours (Saturday, May 21, 2016, > 4pm EST) I'll assume lazy consensus and move forward. >
