Otis, if you have a bit of time to research, I think your document may look a lot like the documents processed by: http://langtech.jrc.it/ which is a flagship "multilingual technology" implementation and includes a fair amount of entity disambiguation as far as I could hear in Ralph's talk. I do not have a more concrete pointer, sorry, and I would love to read something more concretely closer to solr about them.
Paul Le 12 sept. 2012 à 00:46, Jack Krupansky a écrit : > My standard question for such a situation: How are you expecting your users > to query this data? Are they expecting simple English/natural language text, > or are they expecting structured "identifiers" that can be keys into other > data sources. > > For example, are your "entities" simple text literal names, or might they be > Dublin Core (DC) "Agent" URI identifiers? > > Ditto for topics - free text vs. some SKOS vocabulary or other form of > taxonomy. > > In other words, clue us in as to your client "requirements". > > -- Jack Krupansky > > -----Original Message----- From: Otis Gospodnetic > Sent: Tuesday, September 11, 2012 11:51 AM > To: solr-user@lucene.apache.org > Subject: Semantic document format... standards? > > Hello, > > If I'm extracting named entities, topics, key phrases/tags, etc. from > documents and I want to have a representation of this document, what format > should I use? Are there any standard or at least common formats or approaches > people use in such situations? > > For example, the most straight forward format might be something like this: > > > <document> > <title>doc title</title> > <keywords>meta keywords coming from the web page</keywords> > <content>page meat</content> > <entities>name entities recognized in the document</entities> > <topics>topics extracted by the annotator</topics> > <tags>tags extracted by the annotator</tags> > <relations>relations extracted by the annotator</relations> > </document> > > But this is a made up format - the XML tags above are just what somebody > happened to pick. > > Are there any standard or at least common formats for this? > > > Thanks, > Otis > ---- > Performance Monitoring - Solr - ElasticSearch - HBase - > http://sematext.com/spm > > Search Analytics - http://sematext.com/search-analytics/index.html