Indexing of annotated corpora

Emmanuel CARTIER Wed, 09 Dec 2015 22:12:07 -0800

Hi,

I am a newbie in Solr and I would like to know

1. The most efficient way(s?) to index annotated corpora with Linguisticinformation at the token and chunk levels. My documents are in XML andhas the following structure:

<corpus id="FFF">
<text id="XX" ...>
<s> <!-- sentence level with no attributes -->
<token pos="Pro" lemma="I">I</token>
<token pos="V" lemma="be">am</token>
<token pos="DET" lemma="a">a</token>
<NP head="newbie" struct="ADJ-N">
<token pos="ADJ" lemma="weak">weak</token>
<token pos="N" lemma="newbie">newbie</token>
</NP>
</text>
...

My main use case is to be able to search for tokens or lemma andfaceting with pos. Or to search for a combination word + specific pos-tag.I cannot figure out how to index the token level, so as to "link" toeach token its pos (part-of-speech) and lemma. I haven't find anydocumentation on that. At the moment, as my xml is not solr-conformant,I use the DataImportHandler.

2. if it is possible to use an existing Tokenizer or Filter to do adictionary lookup for each token (the external dictionary will containlemma and pos information for each word) - this is for a use case whenno token annotation has been done on the source document.


Any suggestion and pointers will be much appreciated!
Thanks in advance,

Emmanuel


--
Emmanuel Cartier
Enseignant-Chercheur en Linguistique Informatique
LIPN CNRS UMR 7030 - équipe RCLN
http://lipn.univ-paris13.fr/fr/rcln
Université Paris 13 Sorbonne Paris Cité
99 avenue Jean-Baptiste Clement
93430 Villetaneuse
tél. : (+33) 06 46 79 12 86
email : emmanuel.cart...@univ-paris13.fr

Indexing of annotated corpora

Reply via email to