I have written a dictionary-based lemmatizer for University of Oslo
which I also want to donate back to Apache. Before I do that, I need
some help to figure out why it interferes with the highlighter. Some
totally irrelevant words get highlighted, so there is something strange
going on. It does not happen frequently, but I'm not able to reproduce
the problem if I change back to my default NorwegianMinimalStemmer.
Can someone take a look at the source code I have temporarily placed here?
http://folk.uio.no/erlendfg/solr/
Please ignore the bad parameter names "articles" and "articlePos". They
will be changed to wordClass and wordClassPos respectively.
As you can see (browse.png), the words "eller" (en: or) and "utenfor"
(en: outside) get highlighted if I search for the word "grønnest" (en:
greenest). Otherwise all the other documents in the search result have
correct highlighted words.
This behaviour has nothing to do with Norwegian special characters like
æ, ø and å. I have seen other examples without these characters as well.
If I enter the word "grønnest" in the Analyzer, everything seems to work
as it should, also the words which sometimes get wrongly highlighted.
Some basic information about my lemmatizer:
- It is not bound to any specific language (it works with German
dictionaries as well (tested)).
- It needs a comma-separated dictionary with at least two columns (word
and its stem).
- A third column about the word class (verb, noun etc.) is preferable,
but not mandatory.
- POS-tags may be stored optionally
- A small as possible hashmap is loaded into memory at startup with
entries from the dictionary
My config in schema.xml:
<filter
class="no.uio.webapps.sok.analysis.DictionaryLemmatizerFilterFactory"
charset="iso-8859-1" // charset of the dic file
storePosTag="false"
articles="subst,verb,adj" // which word class to add (note: bad
parameter name, will be changed)
reduceTo="subst,verb" // words with several stems get reduced to one in
this order. Optionally
stemPos="1" // Where to find the stems
wordPos="2" // Where to find the words
articlePos="3" // Where to find the word classes. Note: Bad parameter
name, will be changed. Optionally
dictionaries="fullform_bm.txt.gz,fullform_nn.txt.gz,custom_dic.txt"/>
Our environment:
- Solr 4.4.0
Our highlighter config:
http://folk.uio.no/erlendfg/solr/highlighter.txt