By quickly looking at it, I think you have unreachable code in the NorwegianLemmatizerFilter class (certainly, attaching & debugging would be your best bet):
@Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { if (!keywordAttr.isKeyword()) { final String[] values = stemmer.stem(termAtt.buffer()); if (values == null || values.length == 0) { return false; } else { termAtt.setEmpty().append(values[0]); if (values.length > 1) { for (int i = 1; i < values.length; i++) { terms.add(values[i]); } } return true; } } return false; } else if (!terms.isEmpty()) { termAtt.setEmpty().append(terms.poll()); // I don't think this will exhaust terms queue at full for this token, // because on the next call to incrementToken() method // input.incrementToken() is called return true; } else { return false; } } Instead I would do something like this: [code] private Iterator<String> iterator; @Override public boolean incrementToken() throws IOException { String nextStem = next(); if (next == null) return false; // chain the stems; // if this is undesired, you can put them into the same position by restoring previous state termAtt.setEmpty(); termAtt.append(nextStem); termAtt.setLength(nextStem.length()); return true; } public String next() throws IOException { if ((iterator == null) || (!iterator.hasNext())) { if (!input.incrementToken()) return null; char[] buffer = termAtt.buffer(); if (buffer == null || buffer.length == 0) return null; final String tokenTerm = new String(buffer, 0, termAtt.length()); final String lcTokenTerm = tokenTerm.toLowerCase(); Collection<String> stems = new ArrayList(); Collections.addAll(stems, stemmer.stem(lcTokenTerm)); iterator = stems.iterator(); } if (iterator.hasNext()) { String next = iterator.next(); if (next != null) { return next; } } return null; } [/code] On Tue, Jun 24, 2014 at 3:00 PM, Erlend Garåsen <e.f.gara...@usit.uio.no> wrote: > > I'm trying to create a Norwegian Lemmatizer based on a dictionary, but for > some odd reason I don't get any search results even thought the Analyzer in > Solr Admin shows that it does the right thing. It works at query time if I > have reindexed everything based on another stemmer, e.g. > NorwegianMinimalStemmer. > > Here's a screenshot of how it lemmatizes the Norwegian word "studenter" > (masculine indefinite noun, plural - English: "students"). The stem is > "student". So far so good: > http://folk.uio.no/erlendfg/solr/lemmatizer.png > > But I get no/few results if I search for "studenter" compared to > "student". If I switch to solr.NorwegianMinimalStemFilterFactory in > schema.xml at index time and reindexes everything, it works as it should: > <analyzer type="index"> > <filter class="solr.NorwegianMinimalStemFilterFactory" variant="no"/> > > What is wrong with my TokenFilter and/or how can I debug this further? I > have tried a lot of different things without any luck, for example decode > everything explicitly to UTF8 (the wordlist is in iso-8859-1, but I'm > reading it properly by setting the correct character set) and trim all the > words without any help. The byte sequence also seems to be correct for the > stemmed word. My lemmatizer shows [73 74 75 64 65 6e 74], exactly the same > as when I have configured NorwegianMinimalStemFilterFactory in schema.xml. > > Here's the source code of my lemmatizer. Please note that it is not > finished: > http://folk.uio.no/erlendfg/solr/ > > Here's the line in my wordlist which contains the word "studenter": > 66235 student studenter subst mask appell fl ub normert 700 3 > > The following line returns the stem (input is "studenter"): > final String[] values = stemmer.stem(termAtt.buffer()); > > The rest of the code is in NorwegianLemmatizerFilter. If several stems are > returned, they are all added. > > Erlend > -- Dmitry Kan Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info