Ok. Played a bit more with that. So I had a difference between my unit test and solr. In solr I'm actually using a solr.RemoveDuplicatesTokenFilterFactory when querying. Tried to add that to the test, and it fails. So in my case I think the error is trying to use a solr.RemoveDuplicatesTokenFilterFactory with a solr.NGramTokenizerFactory. I don't know why using solr.RemoveDuplicatesTokenFilterFactory generates "do og dog" for "dog" when not using it will just generate "do og". Either way I think that when using ngram I shouldn't use RemoveDuplicatesTokenFilterFactory. Removing duplicates might change the structure of the word.
On Thu, Jun 26, 2008 at 12:25 AM, Jonathan Ariel <[EMAIL PROTECTED]> wrote: > Well, it is working if I search just two letters, but that just tells me > that something is wrong somewhere. > The Analysis tools is showing me how "dog" is being tokenized to "do og", > so if when indexing and querying I'm using the same tokenizer/filters (which > is my case) I should get results even when searching "dog". > > I've just created a small unit test in solr to try that out. > > public void testNGram() throws IOException, Exception { > assertU("adding doc with ngram field",adoc("id", "42", > "text_ngram", "nice dog")); > assertU("commiting",commit()); > > assertQ("test query, expect one document", > req("text_ngram:dog") > ,"//[EMAIL PROTECTED]'1']" > ); > } > > As you can see I'm adding a document with the field text_ngram with the > value "nice dog". > Then I commit it and query it for "text_ngram:dog". > > text_ngram is defined in the schema as: > <fieldtype name="ngram_field" class="solr.TextField"> > <analyzer type="index"> > <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2" > minGramSize="2" /> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2" > minGramSize="2" /> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldtype> > > This test passes. That means that I am able to get results when searching > "dog" on a ngram field, where min and max are set to 2 and where the value > of that field is "nice dog". > So it doesn't seems to be a issue in solr, although I am having this error > when using solr outside the unit test. It seems very improbable to think on > an environment issue. > > Maybe I am doing something wrong. Any thoughts on that? > > Thanks! > > Jonathan > > > On Wed, Jun 25, 2008 at 9:44 PM, Norberto Meijome <[EMAIL PROTECTED]> > wrote: > >> On Wed, 25 Jun 2008 15:37:09 -0300 >> "Jonathan Ariel" <[EMAIL PROTECTED]> wrote: >> >> > I've been trying to use the NGramTokenizer and I ran into a problem. >> > It seems like solr is trying to match documents with all the tokens that >> the >> > analyzer returns from the query term. So if I index a document with a >> title >> > field with the value "nice dog" and search for "dog" (where the >> > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't >> get >> > any results. >> >> Hi Jonathan, >> I don't have the expertise yet to have gone straight into testing code >> with >> lucene, but my 'black box' testing with ngramtokenizer seems to agree with >> what >> you found - see my latest posts over the last couple of days about this. >> >> Have you tried searching for 'do' or 'ni' or any search term with size = >> minGramSize ? I've found that Solr matches results just fine then. >> >> > I can see in the Analysis tool that the tokenizer generates the right >> > tokens, but then when solr searches it tries to match the exact Phrase >> > instead of the tokens. >> >> +1 >> >> B >> >> _________________________ >> {Beto|Norberto|Numard} Meijome >> >> "Some cause happiness wherever they go; others, whenever they go." >> Oscar Wilde >> >> I speak for myself, not my employer. Contents may be hot. Slippery when >> wet. >> Reading disclaimers makes you go blind. Writing them is worse. You have >> been >> Warned. >> > >