I'm probably a little unclear about the breadth of what you want to do, but I would recommend DC at the extremely lightweight end, and TEI at the very heavyweight end. Perhaps you could come up with a mashup of DC and your own fields in RDF as well.
Michael Della Bitta ------------------------------------------------ Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Tue, Sep 11, 2012 at 11:51 AM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > Hello, > > If I'm extracting named entities, topics, key phrases/tags, etc. from > documents and I want to have a representation of this document, what format > should I use? Are there any standard or at least common formats or approaches > people use in such situations? > > For example, the most straight forward format might be something like this: > > > <document> > <title>doc title</title> > <keywords>meta keywords coming from the web page</keywords> > <content>page meat</content> > <entities>name entities recognized in the document</entities> > <topics>topics extracted by the annotator</topics> > <tags>tags extracted by the annotator</tags> > <relations>relations extracted by the annotator</relations> > </document> > > But this is a made up format - the XML tags above are just what somebody > happened to pick. > > Are there any standard or at least common formats for this? > > > Thanks, > Otis > ---- > Performance Monitoring - Solr - ElasticSearch - HBase - > http://sematext.com/spm > > Search Analytics - http://sematext.com/search-analytics/index.html