Re: Edgengram

Erick Erickson Wed, 01 Jun 2011 08:44:57 -0700

Be a little careful here. LowerCaseTokenizerFactory is different than
KeywordTokenizerFactory.


LowerCaseTokenizerFactory will give you more than one term. e.g.
the string "Intelligence can't be MeaSurEd" will give you 5 terms,
any of which may match. i.e.
"intelligence", "can", "t", "be", "measured".
whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
would give you exactly one token:
"intelligence can't be measured".

So searching for "measured" would get a hit in the first case but
not in the second. Searching for "intellig*" would hit both.

Neither is better, just make sure they do what you want!

This page will help a lot:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
as will the admin/analysis page.

Best
Erick

On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
<brian.l...@journalexperts.com> wrote:
> Hi Tomás,
>
> Thank you very much for your suggestion. I took another crack at it using
> your recommendation and it worked ideally. The only thing I had to change
> was
>
> <analyzer type="query">
>  <tokenizer class="solr.KeywordTokenizerFactory" />
> </analyzer>
>
> to
>
> <analyzer type="query">
>  <tokenizer class="solr.LowerCaseTokenizerFactory" />
> </analyzer>
>
> The first did not produce any results but the second worked beautifully.
>
> Thanks!
>
> Brian Lamb
>
> 2011/5/31 Tomás Fernández Löbbe <tomasflo...@gmail.com>
>
>> ...or also use the LowerCaseTokenizerFactory at query time for consistency,
>> but not the edge ngram filter.
>>
>> 2011/5/31 Tomás Fernández Löbbe <tomasflo...@gmail.com>
>>
>> > Hi Brian, I don't know if I understand what you are trying to achieve.
>> You
>> > want the term query "abcdefg" to have an idf of 1 insead of 7? I think
>> using
>> > the KeywordTokenizerFilterFactory at query time should work. I would be
>> > something like:
>> >
>> > <fieldType name="edgengram" class="solr.TextField"
>> > positionIncrementGap="1000">
>> >   <analyzer type="index">
>> >
>> >     <tokenizer class="solr.LowerCaseTokenizerFactory" />
>> >     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> > maxGramSize="25" side="front" />
>> >   </analyzer>
>> >   <analyzer type="query">
>> >   <tokenizer class="solr.KeywordTokenizerFactory" />
>> >   </analyzer>
>> > </fieldType>
>> >
>> > this way, at query time "abcdefg" won't be turned to "a ab abc abcd abcde
>> > abcdef abcdefg". At index time it will.
>> >
>> > Regards,
>> > Tomás
>> >
>> >
>> > On Tue, May 31, 2011 at 1:07 PM, Brian Lamb <
>> brian.l...@journalexperts.com
>> > > wrote:
>> >
>> >> <fieldType name="edgengram" class="solr.TextField"
>> >> positionIncrementGap="1000">
>> >>   <analyzer>
>> >>     <tokenizer class="solr.LowerCaseTokenizerFactory" />
>> >>     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> >> maxGramSize="25" side="front" />
>> >>   </analyzer>
>> >> </fieldType>
>> >>
>> >> I believe I used that link when I initially set up the field and it
>> worked
>> >> great (and I'm still using it in other places). In this particular
>> example
>> >> however it does not appear to be practical for me. I mentioned that I
>> have
>> >> a
>> >> similarity class that returns 1 for the idf and in the case of an
>> >> edgengram,
>> >> it returns 1 * length of the search string.
>> >>
>> >> Thanks,
>> >>
>> >> Brian Lamb
>> >>
>> >> On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com <
>> >> bmdakshinamur...@gmail.com> wrote:
>> >>
>> >> > Can you specify the analyzer you are using for your queries?
>> >> >
>> >> > May be you could use a KeywordAnalyzer for your queries so you don't
>> end
>> >> up
>> >> > matching parts of your query.
>> >> >
>> >> >
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >> > This should help you.
>> >> >
>> >> > On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
>> >> > <brian.l...@journalexperts.com>wrote:
>> >> >
>> >> > > In this particular case, I will be doing a solr search based on user
>> >> > > preferences. So I will not be depending on the user to type
>> "abcdefg".
>> >> > That
>> >> > > will be automatically generated based on user selections.
>> >> > >
>> >> > > The contents of the field do not contain spaces and since I am
>> created
>> >> > the
>> >> > > search parameters, case isn't important either.
>> >> > >
>> >> > > Thanks,
>> >> > >
>> >> > > Brian Lamb
>> >> > >
>> >> > > On Tue, May 31, 2011 at 9:44 AM, Erick Erickson <
>> >> erickerick...@gmail.com
>> >> > > >wrote:
>> >> > >
>> >> > > > That'll work for your case, although be aware that string types
>> >> aren't
>> >> > > > analyzed at all,
>> >> > > > so case matters, as do spaces etc.....
>> >> > > >
>> >> > > > What is the use-case here? If you explain it a bit there might be
>> >> > > > better answers....
>> >> > > >
>> >> > > > Best
>> >> > > > Erick
>> >> > > >
>> >> > > > On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
>> >> > > > <brian.l...@journalexperts.com> wrote:
>> >> > > > > For this, I ended up just changing it to string and using
>> >> "abcdefg*"
>> >> > to
>> >> > > > > match. That seems to work so far.
>> >> > > > >
>> >> > > > > Thanks,
>> >> > > > >
>> >> > > > > Brian Lamb
>> >> > > > >
>> >> > > > > On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
>> >> > > > > <brian.l...@journalexperts.com>wrote:
>> >> > > > >
>> >> > > > >> Hi all,
>> >> > > > >>
>> >> > > > >> I'm running into some confusion with the way edgengram works. I
>> >> have
>> >> > > the
>> >> > > > >> field set up as:
>> >> > > > >>
>> >> > > > >> <fieldType name="edgengram" class="solr.TextField"
>> >> > > > >> positionIncrementGap="1000">
>> >> > > > >>    <analyzer>
>> >> > > > >>      <tokenizer class="solr.LowerCaseTokenizerFactory" />
>> >> > > > >>        <filter class="solr.EdgeNGramFilterFactory"
>> >> minGramSize="1"
>> >> > > > >> maxGramSize="100" side="front" />
>> >> > > > >>    </analyzer>
>> >> > > > >> </fieldType>
>> >> > > > >>
>> >> > > > >> I've also set up my own similarity class that returns 1 as the
>> >> idf
>> >> > > > score.
>> >> > > > >> What I've found this does is if I match a string "abcdefg"
>> >> against a
>> >> > > > field
>> >> > > > >> containing "abcdefghijklmnop", then the idf will score that as
>> a
>> >> 7:
>> >> > > > >>
>> >> > > > >> 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
>> >> > abcdefg=2)
>> >> > > > >>
>> >> > > > >> I get why that's happening, but is there a way to avoid that?
>> Do
>> >> I
>> >> > > need
>> >> > > > to
>> >> > > > >> do a new field type to achieve the desired affect?
>> >> > > > >>
>> >> > > > >> Thanks,
>> >> > > > >>
>> >> > > > >> Brian Lamb
>> >> > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks and Regards,
>> >> > DakshinaMurthy BM
>> >> >
>> >>
>> >
>> >
>>
>

Re: Edgengram

Reply via email to