Otis,

If you are doing Named Entity Recognition, you may want to look at the
research area concerned with Named Entity Recognition. :-) In general,
there is inline markup and standoff markup. You seem to be going for
standoff/stand-alone markup. I am not clear though whether it is just
'discovery' format or actual annotation format (with reference to
where in the sentence it is with offsets or token ids).

UIMA (which Solr integrate with already, right?), does NER so it must
be using some sort of format.

Also, TREC is one of the competitions and they provide marked-up
datasets you might be able to learn something from:
http://ilps.science.uva.nl/trec-entity/

If you are not sure where to start with NER, you can look at my
collection of papers, though most of them are probably too specific:
http://www.citeulike.org/user/arafalov

Finally,  if you have to deal with overlapping entities, there was an
article about a month about some sort of general format. I can't seem
to find the article right now, but I could try digging if you are
still stuck.

Regards,
    Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, Sep 11, 2012 at 11:51 AM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
> Hello,
>
> If I'm extracting named entities, topics, key phrases/tags, etc. from 
> documents and I want to have a representation of this document, what format 
> should I use? Are there any standard or at least common formats or approaches 
> people use in such situations?
>
> For example, the most straight forward format might be something like this:
>
>
> <document>
>   <title>doc title</title>
>   <keywords>meta keywords coming from the web page</keywords>
>   <content>page meat</content>
>   <entities>name entities recognized in the document</entities>
>   <topics>topics extracted by the annotator</topics>
>   <tags>tags extracted by the annotator</tags>
>   <relations>relations extracted by the annotator</relations>
> </document>
>
> But this is a made up format - the XML tags above are just what somebody 
> happened to pick.
>
> Are there any standard or at least common formats for this?
>
>
> Thanks,
> Otis
> ----
> Performance Monitoring - Solr - ElasticSearch - HBase - 
> http://sematext.com/spm
>
> Search Analytics - http://sematext.com/search-analytics/index.html

Reply via email to