Solr custom Tokenizer Factory works randomly

Gotz SE Thu, 26 Jun 2014 03:03:06 -0700

I am new in Solr and I have to do a filter to lemmatize text to index documents 
and also to lemmatize querys.



I created a custom Tokenizer Factory for lemmatized text before passing it to 
the Standard Tokenizer.


Making tests in Solr analysis section works fairly good (on index ok,
 but on query sometimes analyzes text two times), but when indexing 
documents it analyzes only the first documment and on querys it analyses
 randomly (It only analyzes first, and to analyze another you have to 
wait a bit time). It's not performance problem because I tried modifyng 
text instead of lemmatizing.


Here is the code:
package test.solr.analysis;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Map;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeSource.AttributeFactory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;

//import test.solr.analysis.TestLemmatizer;

    public class TestLemmatizerTokenizerFactory extends TokenizerFactory {
    //private TestLemmatizer lemmatizer = new TestLemmatizer();
    private final int maxTokenLength;

    public TestLemmatizerTokenizerFactory(Map<String,String> args) {
        super(args);
        assureMatchVersion();
        maxTokenLength = getInt(args, "maxTokenLength", 
StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
        if (!args.isEmpty()) {
            throw new IllegalArgumentException("Unknown parameters: " + args);
        }
    }

    public String readFully(Reader reader){
        char[] arr = new char[8 * 1024]; // 8K at a time
        StringBuffer buf = new StringBuffer();
        int numChars;
        try {
            while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
                buf.append(arr, 0, numChars);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println("### READFULLY ### => " + buf.toString());
        /*
            The original return with lemmatized text would be this:
            return lemmatizer.getLemma(buf.toString());

            To test it I only change the text adding "lemmatized" word
        */
        return buf.toString() + " lemmatized";
    }

    @Override
    public StandardTokenizer create(AttributeFactory factory, Reader input) {
        // I print this to see when enters to the tokenizer
        System.out.println("### Standar tokenizer ###");
        StandardTokenizer tokenizer = new StandardTokenizer(luceneMatchVersion, 
factory, new StringReader(readFully(input)));
        tokenizer.setMaxTokenLength(maxTokenLength);
        return tokenizer;
    }
}

on schema.xml

    <fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="test.solr.analysis.TestLemmatizerTokenizerFactory"/>
        <!-- <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" 
ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="test.solr.analysis.TestLemmatizerTokenizerFactory"/>
        <!-- <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
        <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/> -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
With this, it only indexes the first text adding the word 
"lemmatized" to the text.
Then on first query if I search "example" word it looks for "example" 
and "lemmatized" so it returns me the first document.
On next searches it doesn't modify the query. To make a new query adding
 "lemmatized" word to the query, I have to wait some minutes.


What happens?


Thank you all.

Solr custom Tokenizer Factory works randomly

Reply via email to