Otis, If you are doing Named Entity Recognition, you may want to look at the research area concerned with Named Entity Recognition. :-) In general, there is inline markup and standoff markup. You seem to be going for standoff/stand-alone markup. I am not clear though whether it is just 'discovery' format or actual annotation format (with reference to where in the sentence it is with offsets or token ids).
UIMA (which Solr integrate with already, right?), does NER so it must be using some sort of format. Also, TREC is one of the competitions and they provide marked-up datasets you might be able to learn something from: http://ilps.science.uva.nl/trec-entity/ If you are not sure where to start with NER, you can look at my collection of papers, though most of them are probably too specific: http://www.citeulike.org/user/arafalov Finally, if you have to deal with overlapping entities, there was an article about a month about some sort of general format. I can't seem to find the article right now, but I could try digging if you are still stuck. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Sep 11, 2012 at 11:51 AM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > Hello, > > If I'm extracting named entities, topics, key phrases/tags, etc. from > documents and I want to have a representation of this document, what format > should I use? Are there any standard or at least common formats or approaches > people use in such situations? > > For example, the most straight forward format might be something like this: > > > <document> > <title>doc title</title> > <keywords>meta keywords coming from the web page</keywords> > <content>page meat</content> > <entities>name entities recognized in the document</entities> > <topics>topics extracted by the annotator</topics> > <tags>tags extracted by the annotator</tags> > <relations>relations extracted by the annotator</relations> > </document> > > But this is a made up format - the XML tags above are just what somebody > happened to pick. > > Are there any standard or at least common formats for this? > > > Thanks, > Otis > ---- > Performance Monitoring - Solr - ElasticSearch - HBase - > http://sematext.com/spm > > Search Analytics - http://sematext.com/search-analytics/index.html