Re: KeywordTokenizerFactory - trouble with "exact" matches

Aleksander Akerø Thu, 30 Jan 2014 00:02:47 -0800

Tried the following config for setting the autoGeneratePhraseQueries but it
didn't seem to change anything. Tested both "true" and "false".


<fieldType name="keyword" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class=
"solr.KeywordTokenizerFactory"/> </analyzer> <analyzer type="query"> <
tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> </fieldType>

Still I do not get any matches when searching for "FE 009" without quotes.

Set debugQuery to "on" and this is what it shows. Definitely looks like it
does this MultiPhraseQuery thing.
<lst name="debug">
<str name="rawquerystring">FE 009</str>
<str name="querystring">FE 009</str>
<str name="parsedquery">
(+(DisjunctionMaxQuery((number:FE))
DisjunctionMaxQuery((number:009))))/no_coord
</str>
<str name="parsedquery_toString">+((number:FE) (number:009))</str>
<lst name="explain"/>
<str name="QParser">ExtendedDismaxQParser</str>

I also looked into these query-parsers, but as it may look like the
splitting on whitespace is something that is done by the dismax queryparser
before the terms are passed to any analyzers. And it is vital to me that I
can differentiate this on a per field basis.

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksan...@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-29 Aleksander Akerø <aleksan...@gurusoft.no>

> Thanks a lot, I'll try the autoGeneratePhraseQueries property and see how
> that works.
>
> Regarding the reindexing tip, it's a good tip but due to the my current
> "on the fly" setup on the servers at work i basically have do build a
> project with maven and deploy to tomcat, wherein the index lies, and I
> therefore have to reindex each time otherwise the index would be empty.
> Also i usually add use the "clean" parameter when testing with DIH. So that
> shouldn't be a problem.
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: aleksan...@gurusoft.no
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-29 Alexandre Rafalovitch <arafa...@gmail.com>
>
> I think the whitespace might also be the issue. The query gets parsed
>> by standard component that splits it on space before passing
>> individual components into the field searches.
>>
>> Try enabling autoGeneratePhraseQueries on the field (or field type)
>> and reindexing. See if that makes a difference.
>>
>> Regards,
>>   Alex.
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Wed, Jan 29, 2014 at 9:55 PM, Aleksander Akerø
>> <aleksan...@gurusoft.no> wrote:
>> > update:
>> >
>> > Guessing that this has nothing to do with the tokenizer. Tried to use
>> the
>> > string fieldtype as well, but still the same results. So this must have
>> to
>> > do with some other solr config.
>> >
>> > What confuses me is that when I search "1005" which is another valid
>> value
>> > to search for, it works perfectly, but then again, this query contains
>> no
>> > whitespace.
>> >
>> > Any ideas?
>> >
>> > *Aleksander Akerø*
>> > Systemkonsulent
>> > Mobil: 944 89 054
>> > E-post: aleksan...@gurusoft.no
>> >
>> > *Gurusoft AS*
>> > Telefon: 92 44 09 99
>> > Østre Kullerød
>> > www.gurusoft.no
>> >
>> >
>> > 2014-01-29 Aleksander Akerø <aleksan...@gurusoft.no>
>> >
>> >> Thanks for the quick answer, but it doesn't help if I remove the
>> lowercase
>> >> analyzer like so:
>> >>
>> >> *        <fieldType name="keyword" class="solr.TextField"
>> >> positionIncrementGap="100">*
>> >> *            <analyzer type="index">*
>> >> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> >> *            </analyzer>*
>> >> *            <analyzer type="query">*
>> >> *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> >> *            </analyzer>*
>> >> *        </fieldType>*
>> >>
>> >>  I still need to add quotes to the searchquery to get results. And the
>> >> weird thing is that if I use the analyzer and put in "FE 009" (again,
>> >> without quotes) for both index and query values, it highlights the
>> result
>> >> as to show a match, but when i search using the GUI it gives me no
>> results.
>> >> The same happens when posting directly to the /select requestHandler
>> via GET
>> >>
>> >> These is what i post using GET:
>> >> http://mysite.com/solr/corename/select?q=number:FE%20009&qf=number
>>  =>
>> >> this does not work
>> >> http://mysite.com/solr/corename/select?q=number:"FE%20009"&qf=number
>>  =>
>> >> this works
>> >>
>> >> Really starting to wonder if I am doing something terribly wrong
>> somewhere.
>> >>
>> >> This is my requestHandler btw, pretty basic:
>> >> <!-- #### Default handler #### -->
>> >>     <requestHandler name="/select" class="solr.SearchHandler">
>> >>         <lst name="defaults">
>> >>             <str name="echoParams">explicit</str>
>> >>             <str name="defType">edismax</str>
>> >>             <str name="q.alt">*:*</str>
>> >>             <str name="rows">10</str>
>> >>             <str name="fl">*,score</str>
>> >>             <str name="qf">number</str>
>> >>         </lst>
>> >>     </requestHandler>
>> >>
>> >> *Aleksander Akerø*
>> >> Systemkonsulent
>> >> Mobil: 944 89 054
>> >> E-post: aleksan...@gurusoft.no
>> >>
>> >> *Gurusoft AS*
>> >> Telefon: 92 44 09 99
>> >> Østre Kullerød
>> >> www.gurusoft.no
>> >>
>> >>
>> >> 2014-01-29 Aruna Kumar Pamulapati <apamulap...@gmail.com>
>> >>
>> >> Hi ,
>> >>>
>> >>> I think the misunderstanding you are having is about
>> >>>
>> >>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
>> >>> lowercase
>> >>> factory.
>> >>>
>> >>> You are correct about KeywordTokenizerFactory  but lowercase factory :
>> >>> Creates
>> >>> tokens by lowercasing all letters and dropping non-letters.
>> >>>
>> >>> The best place to play and learn these pipelines is Solr admin panel
>> =>
>> >>> analysis page.
>> >>>
>> >>>
>> >>> thanks,
>> >>> Arun
>> >>>
>> >>>
>> >>> On Wed, Jan 29, 2014 at 9:05 AM, Aleksander Akerø <
>> aleksan...@gurusoft.no
>> >>> >wrote:
>> >>>
>> >>> > Hi, I'll try properly this time.
>> >>> >
>> >>> > According to solr documentation the solr.KeywordTokenizerFactory
>> should
>> >>> not
>> >>> > do any tokenizing at all. Thus, if I understand this correctly, it
>> >>> should
>> >>> > only return exact matches given that this is the only analyzer
>> defined
>> >>> in
>> >>> > the field type. Such as the following config:
>> >>> >
>> >>> > Fieldtypes:
>> >>> > *       <fieldType name="keyword" class="solr.TextField"
>> >>> > positionIncrementGap="100">*
>> >>> > *            <analyzer type="index">*
>> >>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> >>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>> >>> > *            </analyzer>*
>> >>> > *            <analyzer type="query">*
>> >>> > *                <tokenizer class="solr.KeywordTokenizerFactory"/>*
>> >>> > *                <filter class="solr.LowerCaseFilterFactory"/>*
>> >>> > *            </analyzer>*
>> >>> > *        </fieldType>*
>> >>> >
>> >>> > Fields:
>> >>> > *        <field name="number" type="keyword" indexed="true"
>> >>> stored="true"
>> >>> > required="false" />*
>> >>> >
>> >>> > But it seems not to be this way for me. In the index i have values
>> like
>> >>> "FE
>> >>> > 009", "EE 009", "ED 009" and "FE 009-1" (without the quotes of
>> course.
>> >>> But
>> >>> > when i search "FE 009" (without quotes), I get no results. It seems
>> >>> that I
>> >>> > have to add quotes to the searchquery in order to retrieve any
>> results,
>> >>> but
>> >>> > that wont't work for me, as I later on have to expand the index with
>> >>> other
>> >>> > fields that need whitespace-tokenization and such, or would that
>> work
>> >>> > regardless of quotes? I have come to understand that wrapping the
>> query
>> >>> in
>> >>> > quotes forces it to be analyzed as one token, no matter what.
>> >>> >
>> >>> > If I get this to work I would also like to add the
>> >>> > "solr.EdgeNGramFilterFactory" to the index side analyzer, thus
>> adding
>> >>> > trailing wildcard matches. E.g. return "FE 009-1", "FE 009-2" as
>> well as
>> >>> > "FE 009" when searching for "FE 009", but not "EE 009", and "ED
>> 009".
>> >>> Would
>> >>> > that be an ok way to do it?
>> >>> >
>> >>> > *Aleksander Akerø*
>> >>> > Systemkonsulent
>> >>> > Mobil: 944 89 054
>> >>> > E-post: aleksan...@gurusoft.no
>> >>> >
>> >>> > *Gurusoft AS*
>> >>> > Telefon: 92 44 09 99
>> >>> > Østre Kullerød
>> >>> > www.gurusoft.no
>> >>> >
>> >>>
>> >>
>> >>
>>
>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Reply via email to