Hi,
I've been trying to use the NGramTokenizer and I ran into a problem.
It seems like solr is trying to match documents with all the tokens that the
analyzer returns from the query term. So if I index a document with a title
field with the value "nice dog" and search for "dog" (where the
NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get
any results.
I can see in the Analysis tool that the tokenizer generates the right
tokens, but then when solr searches it tries to match the exact Phrase
instead of the tokens.
I tried the same in Lucene and it works as expected. So it seems to be a
Solr issue. Any hint of where should I look in order to fix it?
Here you have the lucene code that I used to test the behavior of the lucene
NGramTokenizer:
public static void main(String[] args) throws ParseException,
CorruptIndexException, LockObtainFailedException, IOException {
Analyzer n = new Analyzer() {
@Override
public TokenStream tokenStream(String s, Reader reader) {
TokenStream result = new NGramTokenizer(reader,2,2);
result = new LowerCaseFilter(result);
return result;
}
};
IndexWriter writer = new IndexWriter("sample_index", n);
Document doc = new Document();
Field f = new Field("title", new StringReader("nice dog"));
doc.add(f);
writer.addDocument(doc);
writer.close();
IndexSearcher is = new IndexSearcher("sample_index");
QueryParser qp = new QueryParser("", n);
Query parse = qp.parse("title:dog");
Hits hits = is.search(parse);
System.out.println(hits.length());
System.out.println(parse.toString());
}
Thanks!!!
Jonathan