My standard question for such a situation: How are you expecting your users
to query this data? Are they expecting simple English/natural language text,
or are they expecting structured "identifiers" that can be keys into other
data sources.
For example, are your "entities" simple text literal names, or might they be
Dublin Core (DC) "Agent" URI identifiers?
Ditto for topics - free text vs. some SKOS vocabulary or other form of
taxonomy.
In other words, clue us in as to your client "requirements".
-- Jack Krupansky
-----Original Message-----
From: Otis Gospodnetic
Sent: Tuesday, September 11, 2012 11:51 AM
To: solr-user@lucene.apache.org
Subject: Semantic document format... standards?
Hello,
If I'm extracting named entities, topics, key phrases/tags, etc. from
documents and I want to have a representation of this document, what format
should I use? Are there any standard or at least common formats or
approaches people use in such situations?
For example, the most straight forward format might be something like this:
<document>
<title>doc title</title>
<keywords>meta keywords coming from the web page</keywords>
<content>page meat</content>
<entities>name entities recognized in the document</entities>
<topics>topics extracted by the annotator</topics>
<tags>tags extracted by the annotator</tags>
<relations>relations extracted by the annotator</relations>
</document>
But this is a made up format - the XML tags above are just what somebody
happened to pick.
Are there any standard or at least common formats for this?
Thanks,
Otis
----
Performance Monitoring - Solr - ElasticSearch - HBase -
http://sematext.com/spm
Search Analytics - http://sematext.com/search-analytics/index.html