Hello all, for a research project we currently have to make some decisions regarding ontology modeling.
We would like to invite you to discuss some general issues and are interested in your experiences and ideas. To give you an idea on kind and size of the ontology to be modeled: Until now we have been working with the DBpedia ontology with linked data added from Freebase, Geonames, and Yago concerning resources of rdf:type Person, Place, Event, Work and Organisation. Thus we are dealing with about 100 million triples corresponding to approx. 7.6 million resources. We are trying to devise a system for incrementally adding facts to the target ontology using the sources mentioned above plus additional linked data. We are also going to enable end users of our system to add facts. Moreover we would like to annotate certain facts (i.e., triples, or groups of triples) with the following pieces of metadata: - Source (e.g., "DBpedia", "Freebase", "<username>") - Temporal information, e.g. <Albert_Einstein> <spouse> <Elsa_Einstein> [start:1919, end: 1936] Eventually, this might be extended to additionally comprise the following: - Timestamp (when the triple/s was/were added) - Confidence value, e.g., in the case a fact was extracted from fulltext by a text mining algorithm that provides a measurement of confidence Our non-functional requirements mainly focus on querying with high performance. This includes that the amount of data should not become to big - ideally, the whole ontology can be loaded into RAM. It's furthermore preferable to stay with RDF but performance issues generally have the higher priority for us. We then would like to query for data within time ranges, e.g. "All facts valid between X and Y" Optionally in combination with other metadata ("source=z, confidence>60") or type filtering (e.g., only relations between Person and Place) and so on. We are currently evaluating the following approaches to metadata: 1) Annotating metadata via Named Graphs 1.1) Create a graph FOR EACH TRIPLE containing all individual metadata example: Graph <g_Albert_Einstein_spouse_Elsa_Einstein> containing just triple: <Albert_Einstein> <spouse> <Elsa_Einstein> <g_Albert_Einstein_spouse_Elsa_Einstein> <source> <dbpedia>. <g_Albert_Einstein_spouse_Elsa_Einstein> <startdate> "1919-01-01"^^xsd:date. <g_Albert_Einstein_spouse_Elsa_Einstein> <enddate> "1936-12-31"^^xsd:date. [more metadata possible] querying for a time interval would be (triples within 20th century): select ?graph ?s ?p ?o where { ?graph <startdate> ?start. ?graph <enddate> ?end. filter (?start > "1900-01-01" and ?start < "2000-01-01" and ?end > "1900-01-01" and ?end < "2000-01-01") ?graph { ?s ?p ?o } } pro: Each triple can have any metadata, therefore it's possible to define many optional values (like confidence, which we will have for few triples only) con: Huge amount of metadata triples (about 400 - 500 mio. metadata triples for 100 mio fact triples) Is it possible to query this performantly assuming that everything fits into main memory? 1.2) Create a graph FOR SEVERAL TRIPLES sharing same metadata Triples with the same combination of metadata values share the same RDF graph. The paper "Efficient Temporal Querying of RDF Data with SPARQL" http://www.ifi.uzh.ch/pax/uploads/pdf/publication/1004/tappolet09applied.pdf explains how to annotate triples with time intervals using named graphs. So this approach would also fit into 1.2 One could imagine to combine metadata properties, e.g. by creating a graph containing all triples coming from source dbpedia and having the exact time interval 1919-01-01 and 1936-01-01. pro: Less metadata triples in comparison to 1.1 con: Clumsy with many types of metadata. Also, when inserting data, we need an efficient way of detecting whether a graph for a certain combination of metadata already exists or not (hashing, querying, ...) 2) Annotating metadata via an N-TUPLE STORE In the mailing list there once was the rumor that it would be (easily) possible to extend Virtuoso's quad store by more columns. Under these circumstances one could create a new column for each metadata property, at least for frequently used properties. Thinking about this approach, further questions arise: - Is it in general a good idea to do this (would you recommend it)? - Has anybody done this before? - Is it possible to extend the SPARQL syntax in order to be able to continue using SPARQL? - What about performance issues? pro: We assume that performance will be good (with additional indices). The amount of data will hopefully be acceptable (on the one hand there is no aggregation: each triple contains it's own metadata, on the other hand metadata values are not saved as "expensive" types or even triples) con: We would definetely leave standards. So data isn't interchangable independently of the store anymore (which would be acceptable for us because we want to provide our data along with an own Web service). And we would need adaptions to Virtuoso - where it is not clear to which extent. 3) Annotating metadata WITHIN THE ONTOLOGY 3.1) Classical N-Ary Approach: Inserting arbitrary entities Ref.: http://www.w3.org/TR/swbp-n-aryRelations/ In practice n-ary relations can be modeled in different ways: Regarding our example: Due to the fact, that the <spouse> property is reflexive (a prop b ==> b prop a), we could write following: <Spouse_123> <member> <Albert_Einstein> . <Spouse_123> <member> <Elsa_Einstein> . <Spouse_123> <source> <dbpedia> . <Spouse_123> <startdate> "1919-01-01"^^xsd:date . <Spouse_123> <enddate> "1936-12-31"^^xsd:date . In this case <Albert_Einstein> and <Elsa_Einstein> are equal members to the relation. Instead of <Spouse_123> we could also use a blank node. pro: It seems to be a state of the art approach. This way one can model any metadata and even express reflexive and inverse facts quite easy. In comparison with reification, less triples are needed. con: The original triple isn't kept. So the structure changes for triples beeing annotated - one has to care for such while querying. General queries like "How many persons are directly connected to other persons?" can not be queried easily anymore. One could of course keep the original triple at the price of increasing the overall number of triples. 3.2) Our own approach: Inserting sub properties Let's just show an example: <Albert_Einstein> <spouse_123> <Elsa_Einstein> . <spouse_123> rdfs:subPropertyOf <spouse> . <spouse_123> <source> <dbpedia> . <spouse_123> <startdate> "1919-01-01"^^xsd:date . <spouse_123> <enddate> "1936-12-31"^^xsd:date . So for each relationship between two entities we need a new property that we connect to the original property via rdfs:subPropertyOf. pro: Relationships between resources (like <Albert_Einstein> and <Elsa_Einstein>) stay directly connected, so it is easy to query them. Only if we are also interested in a special property (or some of the metadata values), we additionally have to query for the according subproperty relation. Under some cirumstances we need less triples than the classical n-ary approach. con: For every A-Box relationship we define a new property, which is actually a part of the T-Box. On this way this approach violates the separation of A-Box and T-Box (which is conceptual problem, not a technical one). 3.3) Reification / Annotation Properties Ref.: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification Ref.: http://www.w3.org/TR/owl-ref/#Annotations Ref.: http://www.w3.org/TR/owl2-primer/#Annotating_Axioms_and_Entities With reification our example would look like this: <Albert_Einstein> <spouse> <Elsa_Einstein> . <statement_123> rdf:type rdf:Statement . <statement_123> rdf:subject <Albert_Einstein> . <statement_123> rdf:predicate <spouse> . <statement_123> rdf:object <Elsa_Einstein> . <statement_123> <source> <dbpedia> . <statement_123> <startdate> "1919-01-01"^^xsd:date . <statement_123> <enddate> "1936-12-31"^^xsd:date . This approach does not seem to be a good option, because there are 4 triples needed just to define a new statement entity. If every triple was to be annotated, we would end up with about 400 mio. triples without the actual metadata! Annotation properties seem to be a very similar approach using the OWL namespace. pro: We keep the original triple. con: It blows up the amount of data unacceptably and along with this, it requires complexer queries than using the n-ary approaches. We think that it would a good idea to combine some of the approaches, e.g. by using Named Graphs for annotating the source of facts and using n-ary approaches for the rest of the metadata. To sum up our questions we would like to know: - What about the n-tuple approach? Can Virtuoso be extended to handle n-tuples within Graphs, rather than triples? Which adaptions would have to be done and which query performance could be expected compared to SPARQL on Triples/Quads? - About named graphs: When and how would you use this approach or how did you use it? - Have you made experiences with n-ary approaches - are there problems/disadvantages with it (e.g. performance issues)? - Do you have any other ideas of storing metadata for RDF triples? Thanks in advance, -- -------------------------------- Martin Gerlach Softwareentwicklung neofonie Technologieentwicklung und Informationsmanagement GmbH Robert-Koch-Platz 4 10115 Berlin fon: +49.30 24627 413 fax: +49.30 24627 120 martin.gerl...@neofonie.de http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Geschaeftsfuehrung Helmut Hoffer von Ankershoffen (Sprecher der Geschaeftsfuehrung) Nurhan Yildirim ------------------------------- WePad Das WePad ist ein Tablet der neuesten Generation. Dem Nutzer bietet es schnellen Zugang zum Internet, eine komplette Welt von sofort nutzbaren Applikationen und einfachen Zugriff auf Bücher, Fotos sowie auf Magazine und Tageszeitungen verschiedener Verlage, die mit dem WeMagazine ePublishing Eco System realisiert wurden. Mehr über das WePad auf www.wepad.mobi oder auf www.facebook.com/WePad.
Hello all, for a research project we currently have to make some decisions regarding ontology modeling. We would like to invite you to discuss some general issues and are interested in your experiences and ideas. To give you an idea on kind and size of the ontology to be modeled: Until now we have been working with the DBpedia ontology with linked data added from Freebase, Geonames, and Yago concerning resources of rdf:type Person, Place, Event, Work and Organisation. Thus we are dealing with about 100 million triples corresponding to approx. 7.6 million resources. We are trying to devise a system for incrementally adding facts to the target ontology using the sources mentioned above plus additional linked data. We are also going to enable end users of our system to add facts. Moreover we would like to annotate certain facts (i.e., triples, or groups of triples) with the following pieces of metadata: - Source (e.g., "DBpedia", "Freebase", "<username>") - Temporal information, e.g. <Albert_Einstein> <spouse> <Elsa_Einstein> [start:1919, end: 1936] Eventually, this might be extended to additionally comprise the following: - Timestamp (when the triple/s was/were added) - Confidence value, e.g., in the case a fact was extracted from fulltext by a text mining algorithm that provides a measurement of confidence Our non-functional requirements mainly focus on querying with high performance. This includes that the amount of data should not become to big - ideally, the whole ontology can be loaded into RAM. It's furthermore preferable to stay with RDF but performance issues generally have the higher priority for us. We then would like to query for data within time ranges, e.g. "All facts valid between X and Y" Optionally in combination with other metadata ("source=z, confidence>60") or type filtering (e.g., only relations between Person and Place) and so on. We are currently evaluating the following approaches to metadata: 1) Annotating metadata via Named Graphs 1.1) Create a graph FOR EACH TRIPLE containing all individual metadata example: Graph <g_Albert_Einstein_spouse_Elsa_Einstein> containing just triple: <Albert_Einstein> <spouse> <Elsa_Einstein> <g_Albert_Einstein_spouse_Elsa_Einstein> <source> <dbpedia>. <g_Albert_Einstein_spouse_Elsa_Einstein> <startdate> "1919-01-01"^^xsd:date. <g_Albert_Einstein_spouse_Elsa_Einstein> <enddate> "1936-12-31"^^xsd:date. [more metadata possible] querying for a time interval would be (triples within 20th century): select ?graph ?s ?p ?o where { ?graph <startdate> ?start. ?graph <enddate> ?end. filter (?start > "1900-01-01" and ?start < "2000-01-01" and ?end > "1900-01-01" and ?end < "2000-01-01") ?graph { ?s ?p ?o } } pro: Each triple can have any metadata, therefore it's possible to define many optional values (like confidence, which we will have for few triples only) con: Huge amount of metadata triples (about 400 - 500 mio. metadata triples for 100 mio fact triples) Is it possible to query this performantly assuming that everything fits into main memory? 1.2) Create a graph FOR SEVERAL TRIPLES sharing same metadata Triples with the same combination of metadata values share the same RDF graph. The paper "Efficient Temporal Querying of RDF Data with SPARQL" http://www.ifi.uzh.ch/pax/uploads/pdf/publication/1004/tappolet09applied.pdf explains how to annotate triples with time intervals using named graphs. So this approach would also fit into 1.2 One could imagine to combine metadata properties, e.g. by creating a graph containing all triples coming from source dbpedia and having the exact time interval 1919-01-01 and 1936-01-01. pro: Less metadata triples in comparison to 1.1 con: Clumsy with many types of metadata. Also, when inserting data, we need an efficient way of detecting whether a graph for a certain combination of metadata already exists or not (hashing, querying, ...) 2) Annotating metadata via an N-TUPLE STORE In the mailing list there once was the rumor that it would be (easily) possible to extend Virtuoso's quad store by more columns. Under these circumstances one could create a new column for each metadata property, at least for frequently used properties. Thinking about this approach, further questions arise: - Is it in general a good idea to do this (would you recommend it)? - Has anybody done this before? - Is it possible to extend the SPARQL syntax in order to be able to continue using SPARQL? - What about performance issues? pro: We assume that performance will be good (with additional indices). The amount of data will hopefully be acceptable (on the one hand there is no aggregation: each triple contains it's own metadata, on the other hand metadata values are not saved as "expensive" types or even triples) con: We would definetely leave standards. So data isn't interchangable independently of the store anymore (which would be acceptable for us because we want to provide our data along with an own Web service). And we would need adaptions to Virtuoso - where it is not clear to which extent. 3) Annotating metadata WITHIN THE ONTOLOGY 3.1) Classical N-Ary Approach: Inserting arbitrary entities Ref.: http://www.w3.org/TR/swbp-n-aryRelations/ In practice n-ary relations can be modeled in different ways: Regarding our example: Due to the fact, that the <spouse> property is reflexive (a prop b ==> b prop a), we could write following: <Spouse_123> <member> <Albert_Einstein> . <Spouse_123> <member> <Elsa_Einstein> . <Spouse_123> <source> <dbpedia> . <Spouse_123> <startdate> "1919-01-01"^^xsd:date . <Spouse_123> <enddate> "1936-12-31"^^xsd:date . In this case <Albert_Einstein> and <Elsa_Einstein> are equal members to the relation. Instead of <Spouse_123> we could also use a blank node. pro: It seems to be a state of the art approach. This way one can model any metadata and even express reflexive and inverse facts quite easy. In comparison with reification, less triples are needed. con: The original triple isn't kept. So the structure changes for triples beeing annotated - one has to care for such while querying. General queries like "How many persons are directly connected to other persons?" can not be queried easily anymore. One could of course keep the original triple at the price of increasing the overall number of triples. 3.2) Our own approach: Inserting sub properties Let's just show an example: <Albert_Einstein> <spouse_123> <Elsa_Einstein> . <spouse_123> rdfs:subPropertyOf <spouse> . <spouse_123> <source> <dbpedia> . <spouse_123> <startdate> "1919-01-01"^^xsd:date . <spouse_123> <enddate> "1936-12-31"^^xsd:date . So for each relationship between two entities we need a new property that we connect to the original property via rdfs:subPropertyOf. pro: Relationships between resources (like <Albert_Einstein> and <Elsa_Einstein>) stay directly connected, so it is easy to query them. Only if we are also interested in a special property (or some of the metadata values), we additionally have to query for the according subproperty relation. Under some cirumstances we need less triples than the classical n-ary approach. con: For every A-Box relationship we define a new property, which is actually a part of the T-Box. On this way this approach violates the separation of A-Box and T-Box (which is conceptual problem, not a technical one). 3.3) Reification / Annotation Properties Ref.: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification Ref.: http://www.w3.org/TR/owl-ref/#Annotations Ref.: http://www.w3.org/TR/owl2-primer/#Annotating_Axioms_and_Entities With reification our example would look like this: <Albert_Einstein> <spouse> <Elsa_Einstein> . <statement_123> rdf:type rdf:Statement . <statement_123> rdf:subject <Albert_Einstein> . <statement_123> rdf:predicate <spouse> . <statement_123> rdf:object <Elsa_Einstein> . <statement_123> <source> <dbpedia> . <statement_123> <startdate> "1919-01-01"^^xsd:date . <statement_123> <enddate> "1936-12-31"^^xsd:date . This approach does not seem to be a good option, because there are 4 triples needed just to define a new statement entity. If every triple was to be annotated, we would end up with about 400 mio. triples without the actual metadata! Annotation properties seem to be a very similar approach using the OWL namespace. pro: We keep the original triple. con: It blows up the amount of data unacceptably and along with this, it requires complexer queries than using the n-ary approaches. We think that it would a good idea to combine some of the approaches, e.g. by using Named Graphs for annotating the source of facts and using n-ary approaches for the rest of the metadata. To sum up our questions we would like to know: - What about the n-tuple approach? Can Virtuoso be extended to handle n-tuples within Graphs, rather than triples? Which adaptions would have to be done and which query performance could be expected compared to SPARQL on Triples/Quads? - About named graphs: When and how would you use this approach or how did you use it? - Have you made experiences with n-ary approaches - are there problems/disadvantages with it (e.g. performance issues)? - Do you have any other ideas of storing metadata for RDF triples? Thanks in advance, ... [neo-footer]