Let me answer in line : On 10 December 2015 at 06:11, Emmanuel CARTIER < emmanuel.cart...@lipn.univ-paris13.fr> wrote:
> Hi, > > I am a newbie in Solr and I would like to know > > 1. The most efficient way(s?) to index annotated corpora with Linguistic > information at the token and chunk levels. My documents are in XML and has > the following structure: > <corpus id="FFF"> > <text id="XX" ...> > <s> <!-- sentence level with no attributes --> > <token pos="Pro" lemma="I">I</token> > <token pos="V" lemma="be">am</token> > <token pos="DET" lemma="a">a</token> > <NP head="newbie" struct="ADJ-N"> > <token pos="ADJ" lemma="weak">weak</token> > <token pos="N" lemma="newbie">newbie</token> > </NP> > </text> > ... > > My main use case is to be able to search for tokens or lemma and faceting > with pos. Or to search for a combination word + specific pos-tag. > I cannot figure out how to index the token level, so as to "link" to each > token its pos (part-of-speech) and lemma. I haven't find any documentation > on that. At the moment, as my xml is not solr-conformant, I use the > DataImportHandler. > 1) for requirement 1 I can not see any particular problem. Just model your Solr document to be a "token" . Your fields will be : surface_form lemma pos ... Index all the fields and curate the field properties. Then do your boolean queries with all the facets you want. > > 2. if it is possible to use an existing Tokenizer or Filter to do a > dictionary lookup for each token (the external dictionary will contain > lemma and pos information for each word) - this is for a use case when no > token annotation has been done on the source document. > mmm do you want to index the lemma as synonyms of the original token ? and then not applying ad query time the lemmatisation ? How do you want to use together lemmas and surface forms in this use case ? For storing the pos you can use the payload of the token and specifically a custom token filter if you want, or tokenises if it fits better. Take inspiration from this : https://wiki.apache.org/solr/OpenNLP Cheers > > Any suggestion and pointers will be much appreciated! > Thanks in advance, > > Emmanuel > > > -- > Emmanuel Cartier > Enseignant-Chercheur en Linguistique Informatique > LIPN CNRS UMR 7030 - équipe RCLN > http://lipn.univ-paris13.fr/fr/rcln > Université Paris 13 Sorbonne Paris Cité > 99 avenue Jean-Baptiste Clement > 93430 Villetaneuse > tél. : (+33) 06 46 79 12 86 > email : emmanuel.cart...@univ-paris13.fr > > -- -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England