I am new in Solr and I have to do a filter to lemmatize text to index documents and also to lemmatize querys.
I created a custom Tokenizer Factory for lemmatized text before passing it to the Standard Tokenizer. Making tests in Solr analysis section works fairly good (on index ok, but on query sometimes analyzes text two times), but when indexing documents it analyzes only the first documment and on querys it analyses randomly (It only analyzes first, and to analyze another you have to wait a bit time). It's not performance problem because I tried modifyng text instead of lemmatizing. Here is the code: package test.solr.analysis; import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.util.Map; import org.apache.lucene.analysis.util.TokenizerFactory; import org.apache.lucene.util.AttributeSource.AttributeFactory; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.analysis.standard.StandardTokenizer; //import test.solr.analysis.TestLemmatizer; public class TestLemmatizerTokenizerFactory extends TokenizerFactory { //private TestLemmatizer lemmatizer = new TestLemmatizer(); private final int maxTokenLength; public TestLemmatizerTokenizerFactory(Map<String,String> args) { super(args); assureMatchVersion(); maxTokenLength = getInt(args, "maxTokenLength", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); if (!args.isEmpty()) { throw new IllegalArgumentException("Unknown parameters: " + args); } } public String readFully(Reader reader){ char[] arr = new char[8 * 1024]; // 8K at a time StringBuffer buf = new StringBuffer(); int numChars; try { while ((numChars = reader.read(arr, 0, arr.length)) > 0) { buf.append(arr, 0, numChars); } } catch (IOException e) { e.printStackTrace(); } System.out.println("### READFULLY ### => " + buf.toString()); /* The original return with lemmatized text would be this: return lemmatizer.getLemma(buf.toString()); To test it I only change the text adding "lemmatized" word */ return buf.toString() + " lemmatized"; } @Override public StandardTokenizer create(AttributeFactory factory, Reader input) { // I print this to see when enters to the tokenizer System.out.println("### Standar tokenizer ###"); StandardTokenizer tokenizer = new StandardTokenizer(luceneMatchVersion, factory, new StringReader(readFully(input))); tokenizer.setMaxTokenLength(maxTokenLength); return tokenizer; } } on schema.xml <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="test.solr.analysis.TestLemmatizerTokenizerFactory"/> <!-- <tokenizer class="solr.StandardTokenizerFactory"/> --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="test.solr.analysis.TestLemmatizerTokenizerFactory"/> <!-- <tokenizer class="solr.StandardTokenizerFactory"/> --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> With this, it only indexes the first text adding the word "lemmatized" to the text. Then on first query if I search "example" word it looks for "example" and "lemmatized" so it returns me the first document. On next searches it doesn't modify the query. To make a new query adding "lemmatized" word to the query, I have to wait some minutes. What happens? Thank you all.