Hi,

I am a newbie in Solr and I would like to know

1. The most efficient way(s?) to index annotated corpora with Linguistic information at the token and chunk levels. My documents are in XML and has the following structure:
<corpus id="FFF">
<text id="XX" ...>
<s> <!-- sentence level with no attributes -->
<token pos="Pro" lemma="I">I</token>
<token pos="V" lemma="be">am</token>
<token pos="DET" lemma="a">a</token>
<NP head="newbie" struct="ADJ-N">
<token pos="ADJ" lemma="weak">weak</token>
<token pos="N" lemma="newbie">newbie</token>
</NP>
</text>
...

My main use case is to be able to search for tokens or lemma and faceting with pos. Or to search for a combination word + specific pos-tag. I cannot figure out how to index the token level, so as to "link" to each token its pos (part-of-speech) and lemma. I haven't find any documentation on that. At the moment, as my xml is not solr-conformant, I use the DataImportHandler.

2. if it is possible to use an existing Tokenizer or Filter to do a dictionary lookup for each token (the external dictionary will contain lemma and pos information for each word) - this is for a use case when no token annotation has been done on the source document.

Any suggestion and pointers will be much appreciated!
Thanks in advance,

Emmanuel


--
Emmanuel Cartier
Enseignant-Chercheur en Linguistique Informatique
LIPN CNRS UMR 7030 - équipe RCLN
http://lipn.univ-paris13.fr/fr/rcln
Université Paris 13 Sorbonne Paris Cité
99 avenue Jean-Baptiste Clement
93430 Villetaneuse
tél. : (+33) 06 46 79 12 86
email : emmanuel.cart...@univ-paris13.fr

Reply via email to