Re: TokenFilter not working at index time

Dmitry Kan Tue, 24 Jun 2014 09:02:24 -0700

By quickly looking at it, I think you have unreachable code in the
NorwegianLemmatizerFilter
class (certainly, attaching & debugging would be your best bet):


    @Override
    public boolean incrementToken() throws IOException {
        if (input.incrementToken()) {
            if (!keywordAttr.isKeyword()) {
                final String[] values = stemmer.stem(termAtt.buffer());
                if (values == null || values.length == 0) {
                    return false;
                } else {
                    termAtt.setEmpty().append(values[0]);
                    if (values.length > 1) {
                        for (int i = 1; i < values.length; i++) {
                            terms.add(values[i]);
                        }
                    }
                    return true;
                }
            }
            return false;
        } else if (!terms.isEmpty()) {
            termAtt.setEmpty().append(terms.poll()); // I don't think
this will exhaust terms queue at full for this token,

                                                     // because on the
next call to incrementToken() method

                                                     //
input.incrementToken() is called
            return true;
        } else {
            return false;
        }
    }


Instead I would do something like this:

[code]
private Iterator<String> iterator;

@Override public boolean incrementToken() throws IOException { String
nextStem = next(); if (next == null) return false;
// chain the stems; // if this is undesired, you can put them into the same
position by restoring previous state
termAtt.setEmpty(); termAtt.append(nextStem);
termAtt.setLength(nextStem.length()); return true; }

  public String next() throws IOException
  {
    if ((iterator == null) || (!iterator.hasNext())) {
      if (!input.incrementToken())
        return null;

      char[] buffer = termAtt.buffer();
      if (buffer == null || buffer.length == 0)
        return null;

      final String tokenTerm = new String(buffer, 0, termAtt.length());
      final String lcTokenTerm = tokenTerm.toLowerCase();

      Collection<String> stems = new ArrayList();
Collections.addAll(stems, stemmer.stem(lcTokenTerm));

      iterator = stems.iterator();
    }

    if (iterator.hasNext()) {
      String next = iterator.next();
      if (next != null) {
        return next;
      }
    }
    return null;
  }

[/code]


On Tue, Jun 24, 2014 at 3:00 PM, Erlend Garåsen <e.f.gara...@usit.uio.no>
wrote:

>
> I'm trying to create a Norwegian Lemmatizer based on a dictionary, but for
> some odd reason I don't get any search results even thought the Analyzer in
> Solr Admin shows that it does the right thing. It works at query time if I
> have reindexed everything based on another stemmer, e.g.
> NorwegianMinimalStemmer.
>
> Here's a screenshot of how it lemmatizes the Norwegian word "studenter"
> (masculine indefinite noun, plural - English: "students"). The stem is
> "student". So far so good:
> http://folk.uio.no/erlendfg/solr/lemmatizer.png
>
> But I get no/few results if I search for "studenter" compared to
> "student". If I switch to solr.NorwegianMinimalStemFilterFactory in
> schema.xml at index time and reindexes everything, it works as it should:
> <analyzer type="index">
>   <filter class="solr.NorwegianMinimalStemFilterFactory" variant="no"/>
>
> What is wrong with my TokenFilter and/or how can I debug this further? I
> have tried a lot of different things without any luck, for example decode
> everything explicitly to UTF8 (the wordlist is in iso-8859-1, but I'm
> reading it properly by setting the correct character set) and trim all the
> words without any help. The byte sequence also seems to be correct for the
> stemmed word. My lemmatizer shows [73 74 75 64 65 6e 74], exactly the same
> as when I have configured NorwegianMinimalStemFilterFactory in schema.xml.
>
> Here's the source code of my lemmatizer. Please note that it is not
> finished:
> http://folk.uio.no/erlendfg/solr/
>
> Here's the line in my wordlist which contains the word "studenter":
> 66235   student studenter       subst mask appell fl ub normert 700     3
>
> The following line returns the stem (input is "studenter"):
> final String[] values = stemmer.stem(termAtt.buffer());
>
> The rest of the code is in NorwegianLemmatizerFilter. If several stems are
> returned, they are all added.
>
> Erlend
>



-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info

Re: TokenFilter not working at index time

Reply via email to