I'm not sure, but is it necessary to set positionIncAttr to 1 when there
are *not* any lemmas found? I think the usual pattern is to call
clearAttributes() at the start of incrementToken
-Mike
On 12/15/14 7:38 AM, Erlend Garåsen wrote:
I have written a dictionary-based lemmatizer for University of Oslo
which I also want to donate back to Apache. Before I do that, I need
some help to figure out why it interferes with the highlighter. Some
totally irrelevant words get highlighted, so there is something
strange going on. It does not happen frequently, but I'm not able to
reproduce the problem if I change back to my default
NorwegianMinimalStemmer.
Can someone take a look at the source code I have temporarily placed
here?
http://folk.uio.no/erlendfg/solr/
Please ignore the bad parameter names "articles" and "articlePos".
They will be changed to wordClass and wordClassPos respectively.
As you can see (browse.png), the words "eller" (en: or) and "utenfor"
(en: outside) get highlighted if I search for the word "grønnest" (en:
greenest). Otherwise all the other documents in the search result have
correct highlighted words.
This behaviour has nothing to do with Norwegian special characters
like æ, ø and å. I have seen other examples without these characters
as well. If I enter the word "grønnest" in the Analyzer, everything
seems to work as it should, also the words which sometimes get wrongly
highlighted.
Some basic information about my lemmatizer:
- It is not bound to any specific language (it works with German
dictionaries as well (tested)).
- It needs a comma-separated dictionary with at least two columns
(word and its stem).
- A third column about the word class (verb, noun etc.) is preferable,
but not mandatory.
- POS-tags may be stored optionally
- A small as possible hashmap is loaded into memory at startup with
entries from the dictionary
My config in schema.xml:
<filter
class="no.uio.webapps.sok.analysis.DictionaryLemmatizerFilterFactory"
charset="iso-8859-1" // charset of the dic file
storePosTag="false"
articles="subst,verb,adj" // which word class to add (note: bad
parameter name, will be changed)
reduceTo="subst,verb" // words with several stems get reduced to one
in this order. Optionally
stemPos="1" // Where to find the stems
wordPos="2" // Where to find the words
articlePos="3" // Where to find the word classes. Note: Bad parameter
name, will be changed. Optionally
dictionaries="fullform_bm.txt.gz,fullform_nn.txt.gz,custom_dic.txt"/>
Our environment:
- Solr 4.4.0
Our highlighter config:
http://folk.uio.no/erlendfg/solr/highlighter.txt