Re: My new lemmatizer interfers with the highlighter

Michael Sokolov Mon, 15 Dec 2014 05:13:00 -0800

I'm not sure, but is it necessary to set positionIncAttr to 1 when thereare *not* any lemmas found? I think the usual pattern is to callclearAttributes() at the start of incrementToken


-Mike


On 12/15/14 7:38 AM, Erlend Garåsen wrote:

I have written a dictionary-based lemmatizer for University of Oslowhich I also want to donate back to Apache. Before I do that, I needsome help to figure out why it interferes with the highlighter. Sometotally irrelevant words get highlighted, so there is somethingstrange going on. It does not happen frequently, but I'm not able toreproduce the problem if I change back to my defaultNorwegianMinimalStemmer.
Can someone take a look at the source code I have temporarily placedhere?
http://folk.uio.no/erlendfg/solr/
Please ignore the bad parameter names "articles" and "articlePos".They will be changed to wordClass and wordClassPos respectively.
As you can see (browse.png), the words "eller" (en: or) and "utenfor"(en: outside) get highlighted if I search for the word "grønnest" (en:greenest). Otherwise all the other documents in the search result havecorrect highlighted words.
This behaviour has nothing to do with Norwegian special characterslike æ, ø and å. I have seen other examples without these charactersas well. If I enter the word "grønnest" in the Analyzer, everythingseems to work as it should, also the words which sometimes get wronglyhighlighted.
Some basic information about my lemmatizer:
- It is not bound to any specific language (it works with Germandictionaries as well (tested)).- It needs a comma-separated dictionary with at least two columns(word and its stem).- A third column about the word class (verb, noun etc.) is preferable,but not mandatory.
- POS-tags may be stored optionally
- A small as possible hashmap is loaded into memory at startup withentries from the dictionary
My config in schema.xml:
<filterclass="no.uio.webapps.sok.analysis.DictionaryLemmatizerFilterFactory"
charset="iso-8859-1" // charset of the dic file
storePosTag="false"
articles="subst,verb,adj" // which word class to add (note: badparameter name, will be changed)reduceTo="subst,verb" // words with several stems get reduced to onein this order. Optionally
stemPos="1" // Where to find the stems
wordPos="2" // Where to find the words
articlePos="3" // Where to find the word classes. Note: Bad parametername, will be changed. Optionally
dictionaries="fullform_bm.txt.gz,fullform_nn.txt.gz,custom_dic.txt"/>

Our environment:
- Solr 4.4.0

Our highlighter config:
http://folk.uio.no/erlendfg/solr/highlighter.txt

Re: My new lemmatizer interfers with the highlighter

Reply via email to