Hi Everyone,
The topic of Graph Schema has been discussed extensively in recent TInkerPop
Gatherings, and the following proposal has emerged from these gatherings. I
believe it is now ready for broad consideration and discussions. I’ve done my
best to incorporate initial feedback from Josh, Pieter, Valentyn, Stephen, Kris
and others into this proposal, however I won’t claim that it accurately
represents the views of anyone other than myself at this time. This is a broad
topic and I’m deliberately excluding critical topics to focus this thread on
standardizing interfaces for gremlin users and providers to interact with
schema (see assumptions for more details).
## Overview
This proposal introduces graph schema interfaces for TinkerPop: a way to define
vertex types, edge types, and property types as a meta-graph that is itself
traversable with Gremlin. The schema describes the structure of a data graph;
what kinds of vertices and edges exist, what properties they carry, and how
they connect..
## Assumptions
- Type keys are element labels: there is a 1-to-1 mapping between a label and a
type definition. A vertex labeled "person" corresponds to exactly one
VertexType, and an edge labeled "knows" corresponds to exactly one EdgeType.
- Java classes are used as a type system: This proposal uses Java classes to
define property type constraints. This is intended as a placeholder to be
replaced by a proper type system to be defined via a later discussion.
- This proposal makes very little consideration of if/when/where/how validation
and enforcement of schema takes place. I believe it is important for us to ship
something which is flexible and useful to providers out of the box as well as
leaving space for providers to plugin existing implementations or build their
own if they desire. I’ve left this out of scope for this proposal to focus
first on interfaces which give providers the appropriate access to schema.
## Design Points
### 1. Schema-as-Graph
`GraphSchema extends Graph`. Providers implement a familiar interface, and
users traverse the schema with schema.traversal(). This avoids inventing a
parallel API surface. The schema is just another graph.
A data graph exposes its schema via Graph.schema(), which returns the
GraphSchema instance. Providers that don't support schema return
UnsupportedOperationException by default.
### 2. All type definitions are vertices
VertexType, EdgeType, and PropertyType are all vertices in the schema
meta-graph.
- A VertexType vertex represents a vertex label definition (e.g. "person",
"software").
- An EdgeType vertex represents an edge label definition (e.g. "knows",
"created"). Even though it describes edges in the data graph, it is itself a
vertex in the schema graph, connected to its endpoint VertexType vertices via
from/to edges.
- A PropertyType vertex represents a property on a type, connected to its
parent type vertex via a “hasProperty" edge.
Property definitions are independent per type, no sharing across types.
Schema graph example for the classic TinkerPop modern graph:
```
(person:vertexType) --hasProperty--> (name:propertyType)
(person:vertexType) --hasProperty--> (age:propertyType)
(software:vertexType) --hasProperty--> (name:propertyType)
(software:vertexType) --hasProperty--> (lang:propertyType)
(knows:edgeType) --from--> (person:vertexType)
(knows:edgeType) --to--> (person:vertexType)
(knows:edgeType) --hasProperty--> (weight:propertyType)
(created:edgeType) --from--> (person:vertexType)
(created:edgeType) --to--> (software:vertexType)
(created:edgeType) --hasProperty--> (weight:propertyType)
```
### 3. Constraints are properties on type vertices
Rather than a fixed constraint taxonomy, constraints are regular properties on
type vertices, keyed by string via constraint(key, value). This keeps the model
extensible such that providers can define their own constraints without changes
to the core API.
Constraints can be added to VertexType, EdgeType, and PropertyType vertices
directly. The most common constraints such as property types and required
properties would apply to PropertyTypes, while edge multiplicity constraints
(e.g. one-to-many, one-to-one) are naturally expressed as constraints on the
EdgeType itself rather than on any property.
While constraint keys are arbitrary strings and providers are free to implement
any constraints they like, TinkerPop should standardize a set of core
constraint keys representing the most common constraints. Examples include
“type", “required", “unique", “minValue", “maxValue", etc. Providers that
support equivalent constraints are encouraged to follow these conventional
names for interoperability.
Non-core constraints (custom to a provider) are encouraged to follow a
namespaced key convention to avoid collisions, e.g. "tinkergraph:notNull". Core
constraint keys are unnamespaced.
### 4. Schema traversal steps in core Gremlin
New steps for schema manipulation live directly in
GraphTraversal/GraphTraversalSource, not in a separate DSL:
- addVType(label) — creates a VertexType vertex
- addEType(label) — creates an EdgeType vertex
- propertyType(name) — creates a PropertyType vertex and connects it via
hasProperty
- constraint(key, value) — adds a constraint property to the current type vertex
Example: defining a vertex type with properties:
```
schema.traversal().addVType("person")
.propertyType("name").constraint("type",
String.class).constraint("required", true).constraint("unique", true)
.propertyType("age").constraint("type", Integer.class)
```
Example: defining an edge type with endpoint types and a property:
```
schema.traversal().addEType("knows")
.from("person").to("person")
.propertyType("weight").constraint("type", Double.class)
```
This mirrors the addE().from().to() pattern from the data-graph. Here from()
and to() take vertex type labels (strings) and create from/to edges in the
schema graph connecting the EdgeType to the referenced VertexType vertices.
### 5. Convenience methods for direct access
The schema-as-graph model is the source of truth, but traversing it for simple
lookups isn’t always convenient. Direct methods provide compact access:
GraphSchema methods:
- vertexTypes() → Collection<VertexType>
- vertexType(String label) → Optional<VertexType>
- edgeTypes() → Collection<EdgeType>
- edgeType(String label) → Optional<EdgeType>
- addVertexType(String label) → VertexType
- addEdgeType(String label) → EdgeType
- store(OutputStream): serialize the schema to a compact JSON representation
- load(InputStream): deserialize and merge a schema from JSON into this schema
graph
EdgeType methods:
- fromVertexTypes() → Collection<VertexType>
- toVertexTypes() → Collection<VertexType>
Example:
```
GraphSchema schema = graph.schema();
// Look up a vertex type
VertexType person = schema.vertexType("person").orElseThrow();
// Inspect its properties
for (PropertyType pd : person.propertyTypes()) {
System.out.println(pd.name() + " : " + pd.constraint("type"));
}
// Look up an edge type and its connectivity
EdgeType knows = schema.edgeType("knows").orElseThrow();
Collection<VertexType> fromTypes = knows.fromVertexTypes();
Collection<VertexType> toTypes = knows.toVertexTypes();
```
### 6. Cross-graph jumps
Two steps bridge the data graph and schema graph:
- type(): from a data traversal, jump to the element's type definition in the
schema graph.
- instances(): from a schema traversal, jump to all matching elements in the
data graph.
These compose for round-trip traversals:
```
// Get the type definition for "person" vertices
g.V().hasLabel("person").type()
// Get all instances of a schema type
schema.traversal().vertexType("person").instances()
// Round-trip: find marko's type, then get all instances of that type
g.V().has("person", "name", "marko").type().instances()
```
### 7. Schema restriction strategy
There are some steps we will want to restrict in both the data graph and the
schema-graph. addVType() wouldn’t make sense in the data-graph, nor would
addV() be sensible in the schema-graph. A TraversalStrategy can restrict schema
traversals to a safe subset of Gremlin steps (allowlist-based). This prevents
accidentally running data element insertions, OLAP computations, complex
control flow, or side-effect steps against the schema graph. The strategy
should be auto-registered when traversing a GraphSchema instance.
The exact allowlist should be a topic for later discussion.
### 8. Instance counts on type vertices
VertexType.instanceCount() and EdgeType.instanceCount() return the count of
data graph elements matching each type. This is a method rather than a property
on the type vertex, keeping the schema graph definitional (not statistical) and
giving providers full implementation flexibility.
Approximate counts are likely acceptable and preferable for performance in most
cases. However, TinkerPop should not stand in the way of providers that prefer
exact counts, and should ensure that appropriate hooks are in place in
reference implementations so that providers can maintain exact counts if they
so desire.
Transactional implications need additional consideration. Maintaining accurate
counts across concurrent writes, rollbacks, and transaction isolation levels
adds significant complexity. This interacts with the broader schema
transactions question (see transactions below) and should be addressed
alongside it.
### 9. GLV Support
Each GLV (Python, JavaScript, .NET, Go) needs:
- Schema data classes: Parallel classes to the 4 core Java interfaces,
following the same pattern as existing Vertex and Edge classes. These are data
containers representing schema objects returned from the server:
- GraphSchema: holds collections of VertexTypes and EdgeTypes
- VertexType: label, full constraints map, and collection of PropertyTypes
- EdgeType: label, full constraints map, from/to VertexType references (same
pattern as Edge.outV/Edge.inV), and collection of PropertyTypes
- PropertyType: name and full constraints map (including data type as a
constraint)
- All new gremlin steps are supported from each GLV
## Future Questions
### Schema validation
Providers will need lots of flexibility regarding validation modes. Some
providers may choose to have write-time validation for all inserts, others may
choose validate an entire graph against a schema as a batch job, while others
may choose to validate on-commit. For our purposes, we need to provide a viable
reference implementation, as well as ensuring sufficient extension points exist
for providers to fulfill their needs.
### Dynamic schema updates from data writes
It would be useful to auto-update the schema graph when data writes introduce
new labels or properties (e.g. addV("newLabel”) automatically creates a
VertexType). Keeping the schema exactly in-sync with such operations may
introduce too much overhead for many purposes. We should provide appropriate
hooks for providers to implement such behaviour if desired, or to help
providers aggregate changes and perform incremental batch updates to the schema.
### Transactions
The schema graph will need to be transactional if the data
### File IO
It is often useful to persist and load schemas to/from files. This capability
should be build into the GraphSchema class via simple store() and load()
methods, using a custom compact JSON representation of the schema. The
specifics of this format are deferred to later discussion.
GraphSchema exposes file IO directly:
- store(OutputStream): serialize the schema to a compact JSON representation
- load(InputStream): deserialize a schema from JSON and merge it into the
current schema graph
Schema file IO should be implemented across all GLVs.
## Reference Implementation
TinkerGraph serves as the reference implementation:
- TinkerGraphSchema extends TinkerGraph implements GraphSchema
- TinkerVertexType extends TinkerVertex implements VertexType
- TinkerPropertyType extends TinkerVertex implements PropertyType
- TinkerEdgeType extends TinkerVertex implements EdgeType
- Recursion guard prevents schema-of-schema (TinkerGraphSchema overrides
initSchema())
Please let me know any thoughts you may have on the approach. I intend to move
this into a proposal PR soon, unless there are any major disagreements over the
design.
Thanks,
Cole