My standard question for such a situation: How are you expecting your users to query this data? Are they expecting simple English/natural language text, or are they expecting structured "identifiers" that can be keys into other data sources.

For example, are your "entities" simple text literal names, or might they be Dublin Core (DC) "Agent" URI identifiers?

Ditto for topics - free text vs. some SKOS vocabulary or other form of taxonomy.

In other words, clue us in as to your client "requirements".

-- Jack Krupansky

-----Original Message----- From: Otis Gospodnetic
Sent: Tuesday, September 11, 2012 11:51 AM
To: solr-user@lucene.apache.org
Subject: Semantic document format... standards?

Hello,

If I'm extracting named entities, topics, key phrases/tags, etc. from documents and I want to have a representation of this document, what format should I use? Are there any standard or at least common formats or approaches people use in such situations?

For example, the most straight forward format might be something like this:


<document>
 <title>doc title</title>
 <keywords>meta keywords coming from the web page</keywords>
 <content>page meat</content>
 <entities>name entities recognized in the document</entities>
 <topics>topics extracted by the annotator</topics>
 <tags>tags extracted by the annotator</tags>
 <relations>relations extracted by the annotator</relations>
</document>

But this is a made up format - the XML tags above are just what somebody happened to pick.

Are there any standard or at least common formats for this?


Thanks,
Otis
----
Performance Monitoring - Solr - ElasticSearch - HBase - http://sematext.com/spm

Search Analytics - http://sematext.com/search-analytics/index.html

Reply via email to