Hi Everyone,

The topic of Graph Schema has been discussed extensively in recent TInkerPop 
Gatherings, and the following proposal has emerged from these gatherings. I 
believe it is now ready for broad consideration and discussions. I’ve done my 
best to incorporate initial feedback from Josh, Pieter, Valentyn, Stephen, Kris 
and others into this proposal, however I won’t claim that it accurately 
represents the views of anyone other than myself at this time. This is a broad 
topic and I’m deliberately excluding critical topics to focus this thread on 
standardizing interfaces for gremlin users and providers to interact with 
schema (see assumptions for more details).

## Overview

This proposal introduces graph schema interfaces for TinkerPop: a way to define 
vertex types, edge types, and property types as a meta-graph that is itself 
traversable with Gremlin. The schema describes the structure of a data graph; 
what kinds of vertices and edges exist, what properties they carry, and how 
they connect..

## Assumptions

- Type keys are element labels: there is a 1-to-1 mapping between a label and a 
type definition. A vertex labeled "person" corresponds to exactly one 
VertexType, and an edge labeled "knows" corresponds to exactly one EdgeType.
- Java classes are used as a type system: This proposal uses Java classes to 
define property type constraints. This is intended as a placeholder to be 
replaced by a proper type system to be defined via a later discussion.
- This proposal makes very little consideration of if/when/where/how validation 
and enforcement of schema takes place. I believe it is important for us to ship 
something which is flexible and useful to providers out of the box as well as 
leaving space for providers to plugin existing implementations or build their 
own if they desire. I’ve left this out of scope for this proposal to focus 
first on interfaces which give providers the appropriate access to schema.

## Design Points

### 1. Schema-as-Graph

`GraphSchema extends Graph`. Providers implement a familiar interface, and 
users traverse the schema with schema.traversal(). This avoids inventing a 
parallel API surface. The schema is just another graph.

A data graph exposes its schema via Graph.schema(), which returns the 
GraphSchema instance. Providers that don't support schema return 
UnsupportedOperationException by default.

### 2. All type definitions are vertices

VertexType, EdgeType, and PropertyType are all vertices in the schema 
meta-graph.

- A VertexType vertex represents a vertex label definition (e.g. "person", 
"software").
- An EdgeType vertex represents an edge label definition (e.g. "knows", 
"created"). Even though it describes edges in the data graph, it is itself a 
vertex in the schema graph, connected to its endpoint VertexType vertices via 
from/to edges.
- A PropertyType vertex represents a property on a type, connected to its 
parent type vertex via a “hasProperty" edge.

Property definitions are independent per type, no sharing across types.

Schema graph example for the classic TinkerPop modern graph:
```
(person:vertexType) --hasProperty--> (name:propertyType)
(person:vertexType) --hasProperty--> (age:propertyType)
(software:vertexType) --hasProperty--> (name:propertyType)
(software:vertexType) --hasProperty--> (lang:propertyType)
(knows:edgeType) --from--> (person:vertexType)
(knows:edgeType) --to-->   (person:vertexType)
(knows:edgeType) --hasProperty--> (weight:propertyType)
(created:edgeType) --from--> (person:vertexType)
(created:edgeType) --to-->   (software:vertexType)
(created:edgeType) --hasProperty--> (weight:propertyType)
```

### 3. Constraints are properties on type vertices

Rather than a fixed constraint taxonomy, constraints are regular properties on 
type vertices, keyed by string via constraint(key, value). This keeps the model 
extensible such that providers can define their own constraints without changes 
to the core API.

Constraints can be added to VertexType, EdgeType, and PropertyType vertices 
directly. The most common constraints such as property types and required 
properties would apply to PropertyTypes, while edge multiplicity constraints 
(e.g. one-to-many, one-to-one) are naturally expressed as constraints on the 
EdgeType itself rather than on any property.

While constraint keys are arbitrary strings and providers are free to implement 
any constraints they like, TinkerPop should standardize a set of core 
constraint keys representing the most common constraints. Examples include 
“type", “required", “unique", “minValue", “maxValue", etc. Providers that 
support equivalent constraints are encouraged to follow these conventional 
names for interoperability.

Non-core constraints (custom to a provider) are encouraged to follow a 
namespaced key convention to avoid collisions, e.g. "tinkergraph:notNull". Core 
constraint keys are unnamespaced.

### 4. Schema traversal steps in core Gremlin

New steps for schema manipulation live directly in 
GraphTraversal/GraphTraversalSource, not in a separate DSL:

- addVType(label) — creates a VertexType vertex
- addEType(label) — creates an EdgeType vertex
- propertyType(name) — creates a PropertyType vertex and connects it via 
hasProperty
- constraint(key, value) — adds a constraint property to the current type vertex

Example: defining a vertex type with properties:
```
schema.traversal().addVType("person")
    .propertyType("name").constraint("type", 
String.class).constraint("required", true).constraint("unique", true)
    .propertyType("age").constraint("type", Integer.class)
```

Example: defining an edge type with endpoint types and a property:
```
schema.traversal().addEType("knows")
    .from("person").to("person")
    .propertyType("weight").constraint("type", Double.class)
```

This mirrors the addE().from().to() pattern from the data-graph. Here from() 
and to() take vertex type labels (strings) and create from/to edges in the 
schema graph connecting the EdgeType to the referenced VertexType vertices.

### 5. Convenience methods for direct access

The schema-as-graph model is the source of truth, but traversing it for simple 
lookups isn’t always convenient. Direct methods provide compact access:

GraphSchema methods:
- vertexTypes() → Collection<VertexType>
- vertexType(String label) → Optional<VertexType>
- edgeTypes() → Collection<EdgeType>
- edgeType(String label) → Optional<EdgeType>
- addVertexType(String label) → VertexType
- addEdgeType(String label) → EdgeType
- store(OutputStream):  serialize the schema to a compact JSON representation
- load(InputStream): deserialize and merge a schema from JSON into this schema 
graph

EdgeType methods:
- fromVertexTypes() → Collection<VertexType>
- toVertexTypes() → Collection<VertexType>

Example:
```
GraphSchema schema = graph.schema();

// Look up a vertex type
VertexType person = schema.vertexType("person").orElseThrow();

// Inspect its properties
for (PropertyType pd : person.propertyTypes()) {
    System.out.println(pd.name() + " : " + pd.constraint("type"));
}

// Look up an edge type and its connectivity
EdgeType knows = schema.edgeType("knows").orElseThrow();
Collection<VertexType> fromTypes = knows.fromVertexTypes();
Collection<VertexType> toTypes = knows.toVertexTypes();
```

### 6. Cross-graph jumps

Two steps bridge the data graph and schema graph:

- type(): from a data traversal, jump to the element's type definition in the 
schema graph.
- instances(): from a schema traversal, jump to all matching elements in the 
data graph.

These compose for round-trip traversals:
```
// Get the type definition for "person" vertices
g.V().hasLabel("person").type()

// Get all instances of a schema type
schema.traversal().vertexType("person").instances()

// Round-trip: find marko's type, then get all instances of that type
g.V().has("person", "name", "marko").type().instances()
```

### 7. Schema restriction strategy

There are some steps we will want to restrict in both the data graph and the 
schema-graph. addVType() wouldn’t make sense in the data-graph, nor would 
addV() be sensible in the schema-graph. A TraversalStrategy can restrict schema 
traversals to a safe subset of Gremlin steps (allowlist-based). This prevents 
accidentally running data element insertions, OLAP computations, complex 
control flow, or side-effect steps against the schema graph. The strategy 
should be auto-registered when traversing a GraphSchema instance.

The exact allowlist should be a topic for later discussion.

### 8. Instance counts on type vertices

VertexType.instanceCount() and EdgeType.instanceCount() return the count of 
data graph elements matching each type. This is a method rather than a property 
on the type vertex, keeping the schema graph definitional (not statistical) and 
giving providers full implementation flexibility.

Approximate counts are likely acceptable and preferable for performance in most 
cases. However, TinkerPop should not stand in the way of providers that prefer 
exact counts, and should ensure that appropriate hooks are in place in 
reference implementations so that providers can maintain exact counts if they 
so desire.

Transactional implications need additional consideration. Maintaining accurate 
counts across concurrent writes, rollbacks, and transaction isolation levels 
adds significant complexity. This interacts with the broader schema 
transactions question (see transactions below) and should be addressed 
alongside it.

### 9. GLV Support

Each GLV (Python, JavaScript, .NET, Go) needs:

- Schema data classes: Parallel classes to the 4 core Java interfaces, 
following the same pattern as existing Vertex and Edge classes. These are data 
containers representing schema objects returned from the server:
  - GraphSchema: holds collections of VertexTypes and EdgeTypes
  - VertexType: label, full constraints map, and collection of PropertyTypes
  - EdgeType: label, full constraints map, from/to VertexType references (same 
pattern as Edge.outV/Edge.inV), and collection of PropertyTypes
  - PropertyType: name and full constraints map (including data type as a 
constraint)
- All new gremlin steps are supported from each GLV

## Future Questions

### Schema validation

Providers will need lots of flexibility regarding validation modes. Some 
providers may choose to have write-time validation for all inserts, others may 
choose validate an entire graph against a schema as a batch job, while others 
may choose to validate on-commit. For our purposes, we need to provide a viable 
reference implementation, as well as ensuring sufficient extension points exist 
for providers to fulfill their needs.

### Dynamic schema updates from data writes

It would be useful to auto-update the schema graph when data writes introduce 
new labels or properties (e.g. addV("newLabel”) automatically creates a 
VertexType). Keeping the schema exactly in-sync with such operations may 
introduce too much overhead for many purposes. We should provide appropriate 
hooks for providers to implement such behaviour if desired, or to help 
providers aggregate changes and perform incremental batch updates to the schema.

### Transactions

The schema graph will need to be transactional if the data

### File IO

It is often useful to persist and load schemas to/from files. This capability 
should be build into the GraphSchema class via simple store() and load() 
methods, using a custom compact JSON representation of the schema. The 
specifics of this format are deferred to later discussion.

GraphSchema exposes file IO directly:
- store(OutputStream): serialize the schema to a compact JSON representation
- load(InputStream): deserialize a schema from JSON and merge it into the 
current schema graph

Schema file IO should be implemented across all GLVs.

## Reference Implementation

TinkerGraph serves as the reference implementation:

- TinkerGraphSchema extends TinkerGraph implements GraphSchema
- TinkerVertexType extends TinkerVertex implements VertexType
- TinkerPropertyType extends TinkerVertex implements PropertyType
- TinkerEdgeType extends TinkerVertex implements EdgeType
- Recursion guard prevents schema-of-schema (TinkerGraphSchema overrides 
initSchema())


Please let me know any thoughts you may have on the approach. I intend to move 
this into a proposal PR soon, unless there are any major disagreements over the 
design.

Thanks,
Cole

Reply via email to