Let me answer in line :

On 10 December 2015 at 06:11, Emmanuel CARTIER <
emmanuel.cart...@lipn.univ-paris13.fr> wrote:

> Hi,
>
> I am a newbie in Solr and I would like to know
>
> 1. The most efficient way(s?) to index annotated corpora with Linguistic
> information at the token and chunk levels. My documents are in XML and has
> the following structure:
> <corpus id="FFF">
> <text id="XX" ...>
> <s> <!-- sentence level with no attributes -->
> <token pos="Pro" lemma="I">I</token>
> <token pos="V" lemma="be">am</token>
> <token pos="DET" lemma="a">a</token>
> <NP head="newbie" struct="ADJ-N">
> <token pos="ADJ" lemma="weak">weak</token>
> <token pos="N" lemma="newbie">newbie</token>
> </NP>
> </text>
> ...
>
> My main use case is to be able to search for tokens or lemma and faceting
> with pos. Or to search for a combination word + specific pos-tag.
> I cannot figure out how to index the token level, so as to "link" to each
> token its pos (part-of-speech) and lemma. I haven't find any documentation
> on that. At the moment, as my xml is not solr-conformant, I use the
> DataImportHandler.
>

1) for requirement 1 I can not see any particular problem. Just model your
Solr document to be a "token" .
Your fields will be :
surface_form
lemma
pos
...
Index all the fields and curate the field properties.
Then do your boolean queries with all the facets you want.




>
> 2. if it is possible to use an existing Tokenizer or Filter to do a
> dictionary lookup for each token (the external dictionary will contain
> lemma and pos information for each word) - this is for a use case when no
> token annotation has been done on the source document.
>

mmm do you want to index the lemma as synonyms of the original token ?
and then not applying ad query time the lemmatisation ?
How do you want to use together lemmas and surface forms in this use case ?
For storing the pos you can use the payload of the token and specifically a
custom token filter if you want, or tokenises if it fits better.
Take inspiration from this :

https://wiki.apache.org/solr/OpenNLP

Cheers

>
> Any suggestion and pointers will be much appreciated!
> Thanks in advance,
>
> Emmanuel
>
>
> --
> Emmanuel Cartier
> Enseignant-Chercheur en Linguistique Informatique
> LIPN CNRS UMR 7030 - équipe RCLN
> http://lipn.univ-paris13.fr/fr/rcln
> Université Paris 13 Sorbonne Paris Cité
> 99 avenue Jean-Baptiste Clement
> 93430 Villetaneuse
> tél. : (+33) 06 46 79 12 86
> email : emmanuel.cart...@univ-paris13.fr
>
>


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Reply via email to