Otis,

if you have a bit of time to research, I think your document may look a lot 
like the documents processed by:
        http://langtech.jrc.it/
which is a flagship "multilingual technology" implementation and includes a 
fair amount of entity disambiguation as far as I could hear in Ralph's talk.
I do not have a more concrete pointer, sorry, and I would love to read 
something more concretely closer to solr about them.

Paul


Le 12 sept. 2012 à 00:46, Jack Krupansky a écrit :

> My standard question for such a situation: How are you expecting your users 
> to query this data? Are they expecting simple English/natural language text, 
> or are they expecting structured "identifiers" that can be keys into other 
> data sources.
> 
> For example, are your "entities" simple text literal names, or might they be 
> Dublin Core (DC) "Agent" URI identifiers?
> 
> Ditto for topics - free text vs. some SKOS vocabulary or other form of 
> taxonomy.
> 
> In other words, clue us in as to your client "requirements".
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Otis Gospodnetic
> Sent: Tuesday, September 11, 2012 11:51 AM
> To: solr-user@lucene.apache.org
> Subject: Semantic document format... standards?
> 
> Hello,
> 
> If I'm extracting named entities, topics, key phrases/tags, etc. from 
> documents and I want to have a representation of this document, what format 
> should I use? Are there any standard or at least common formats or approaches 
> people use in such situations?
> 
> For example, the most straight forward format might be something like this:
> 
> 
> <document>
> <title>doc title</title>
> <keywords>meta keywords coming from the web page</keywords>
> <content>page meat</content>
> <entities>name entities recognized in the document</entities>
> <topics>topics extracted by the annotator</topics>
> <tags>tags extracted by the annotator</tags>
> <relations>relations extracted by the annotator</relations>
> </document>
> 
> But this is a made up format - the XML tags above are just what somebody 
> happened to pick.
> 
> Are there any standard or at least common formats for this?
> 
> 
> Thanks,
> Otis
> ----
> Performance Monitoring - Solr - ElasticSearch - HBase - 
> http://sematext.com/spm
> 
> Search Analytics - http://sematext.com/search-analytics/index.html 

Reply via email to