Re: NGramTokenizer issue

Jonathan Ariel Wed, 25 Jun 2008 21:16:05 -0700

Ok. Played a bit more with that.
So I had a difference between my unit test and solr. In solr I'm actually
using a solr.RemoveDuplicatesTokenFilterFactory when querying. Tried to add
that to the test, and it fails.
So in my case I think the error is trying to use a
solr.RemoveDuplicatesTokenFilterFactory with a solr.NGramTokenizerFactory. I
don't know why using solr.RemoveDuplicatesTokenFilterFactory generates "do
og dog" for "dog" when not using it will just generate "do og".
Either way I think that when using ngram I shouldn't use
RemoveDuplicatesTokenFilterFactory. Removing duplicates might change the
structure of the word.



On Thu, Jun 26, 2008 at 12:25 AM, Jonathan Ariel <[EMAIL PROTECTED]> wrote:

> Well, it is working if I search just two letters, but that just tells me
> that something is wrong somewhere.
> The Analysis tools is showing me how "dog" is being tokenized to "do og",
> so if when indexing and querying I'm using the same tokenizer/filters (which
> is my case) I should get results even when searching "dog".
>
> I've just created a small unit test in solr to try that out.
>
>     public void testNGram() throws IOException, Exception {
>         assertU("adding doc with ngram field",adoc("id", "42",
> "text_ngram", "nice dog"));
>         assertU("commiting",commit());
>
>         assertQ("test query, expect one document",
>             req("text_ngram:dog")
>             ,"//[EMAIL PROTECTED]'1']"
>             );
>     }
>
> As you can see I'm adding a document with the field text_ngram with the
> value "nice dog".
> Then I commit it and query it for "text_ngram:dog".
>
> text_ngram is defined in the schema as:
>     <fieldtype name="ngram_field" class="solr.TextField">
>       <analyzer type="index">
>         <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2"
> minGramSize="2" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2"
> minGramSize="2" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
>
> This test passes. That means that I am able to get results when searching
> "dog" on a ngram field, where min and max are set to 2 and where the value
> of that field is "nice dog".
> So it doesn't seems to be a issue in solr, although I am having this error
> when using solr outside the unit test. It seems very improbable to think on
> an environment issue.
>
> Maybe I am doing something wrong. Any thoughts on that?
>
> Thanks!
>
> Jonathan
>
>
> On Wed, Jun 25, 2008 at 9:44 PM, Norberto Meijome <[EMAIL PROTECTED]>
> wrote:
>
>> On Wed, 25 Jun 2008 15:37:09 -0300
>> "Jonathan Ariel" <[EMAIL PROTECTED]> wrote:
>>
>> > I've been trying to use the NGramTokenizer and I ran into a problem.
>> > It seems like solr is trying to match documents with all the tokens that
>> the
>> > analyzer returns from the query term. So if I index a document with a
>> title
>> > field with the value "nice dog" and search for "dog" (where the
>> > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't
>> get
>> > any results.
>>
>> Hi Jonathan,
>> I don't have the expertise yet to have gone straight into testing code
>> with
>> lucene, but my 'black box' testing with ngramtokenizer seems to agree with
>> what
>> you found - see my latest posts over the last couple of days about this.
>>
>> Have you tried searching for 'do' or 'ni' or any search term with size =
>> minGramSize ? I've found that Solr matches results just fine then.
>>
>> > I can see in the Analysis tool that the tokenizer generates the right
>> > tokens, but then when solr searches it tries to match the exact Phrase
>> > instead of the tokens.
>>
>> +1
>>
>> B
>>
>> _________________________
>> {Beto|Norberto|Numard} Meijome
>>
>> "Some cause happiness wherever they go; others, whenever they go."
>>  Oscar Wilde
>>
>> I speak for myself, not my employer. Contents may be hot. Slippery when
>> wet.
>> Reading disclaimers makes you go blind. Writing them is worse. You have
>> been
>> Warned.
>>
>
>

Re: NGramTokenizer issue

Reply via email to