As Michael hinted, I believe RDF would be the de-factor answer. Within it, things such as OWL or SKOS certainly represent classical formats. Processors such as OWLAPI can go pretty far there.
As Péter hinted, schema.org might provide a way to complement an existing XML with semantic information. The big support everyone talks about (because apparently the big names speak out), I haven't seen yet very present; in particular in terms of shared toolset. There's many many many alternatives. We've been recently touching with a format call DITA which is an XML format for annotated documents and it also claims to provide semantic support (e.g. with taxonomies). Is your goal to serve these as food for solr to index? Paul Le 11 sept. 2012 à 17:51, Otis Gospodnetic a écrit : > Hello, > > If I'm extracting named entities, topics, key phrases/tags, etc. from > documents and I want to have a representation of this document, what format > should I use? Are there any standard or at least common formats or approaches > people use in such situations? > > For example, the most straight forward format might be something like this: > > > <document> > <title>doc title</title> > <keywords>meta keywords coming from the web page</keywords> > <content>page meat</content> > <entities>name entities recognized in the document</entities> > <topics>topics extracted by the annotator</topics> > <tags>tags extracted by the annotator</tags> > <relations>relations extracted by the annotator</relations> > </document> > > But this is a made up format - the XML tags above are just what somebody > happened to pick. > > Are there any standard or at least common formats for this?