Re: KeywordTokenizerFactory - trouble with "exact" matches

Aleksander Akerø Thu, 30 Jan 2014 07:28:28 -0800

I've come across something like this as well, can't remember where, but it
was often related to synonym functionality.


The following link shows a 3rd party QueryParser that seems to deal with
synonyms alongside edismax, and may be interesting to look at:
http://wiki.apache.org/solr/QueryParser

It is also mentioned as an issue while using the synonymFilterFactory:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
"The Lucene QueryParser tokenizes on white space before giving any text to
the Analyzer, so if a person searches for the words sea biscit the analyzer
will be given the words "sea" and "biscit" seperately, and will not know
that they match a synonym".

Maybe the extended support for synonym handling is what will give us the
solution one day. For now I have solved my problem and will leave it at
that.

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: [email protected]

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-30 Jack Krupansky <[email protected]>:

> I vaguely recall that there was a Jira floating around for multi-word
> synonyms that dealt with parsing of spaces as well. And Robert Muir has
> (repeatedly) referred to this query parser feature as a "bug". Somehow,
> eventually, I think it will be dealt with, but the "difficulty" remains for
> now.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Aleksander Akerø
> Sent: Thursday, January 30, 2014 9:31 AM
>
> To: [email protected]
> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>
> Yes, I actually noted that about the filter vs. tokenizer. It's easy to get
> confused if you don't have a good understanding of the differences between
> tokenizers and filters.
>
> As for the query parser problem, there's always a workaround, but it was
> nice to be made aware of. It sort of was a ghost-like problem before.
> Allthough it would be great to have the opportunity to "disable" the
> splitting on whitespace even for DisMax, I understand that it probably not
> the most wanted feature for next solr release :)
>
> *Aleksander Akerø*
> Systemkonsulent
> Mobil: 944 89 054
> E-post: [email protected]
>
> *Gurusoft AS*
> Telefon: 92 44 09 99
> Østre Kullerød
> www.gurusoft.no
>
>
> 2014-01-30 Erick Erickson <[email protected]>:
>
>  Note, the comments about lowercasetokenizer were a red herring. You were
>> using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it
>> would
>> just do what you expected, lowercase the entire input. You would have used
>> LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a
>> Filter.
>>
>> As for the rest, I expect Jack is right, it's the query parsing above
>> the field input.
>>
>> Best
>> Erick
>>
>> On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø
>> <[email protected]> wrote:
>> > Hi Srinivasa
>> >
>> > Yes I've come to understand that the analyzers will never "see" the
>> > whitespace, thus no need for patternreplacement, like Jack points out.
>> > So
>> > the solution would be to set wich parser to use for the query. Also Jack
>> > has pointed out that the "field" queryparser should work in this
>> particular
>> > setting -> http://wiki.apache.org/solr/QueryParser
>> >
>> > My problem was though, that it was only for one of the fields in the
>> schema
>> > that i needed this for, but for all the other fields, e.g. name,
>> > description etc., I would very much like to make use of the eDisMax
>> > functionality. And it seems that there can only be defined one query
>> parser
>> > per query. in other words: for all fields. Jack, you may correct me if
>> I'm
>> > wrong here :)
>> >
>> > This particular customer wanted a wildcard search at both ends of the
>> > phrase, and that sort of ambiguated the problem. And therefore I chose
>> > to
>> > replace all whitespace for this field in sql at index time, using the
>> DIH.
>> > And then using EdgeNGramFilterFactory on both sides of the keyword like
>> the
>> > config below, and that seemed to work pretty nicely.
>> >
>> > <!-- #### WildCard search number #### --> <fieldType name="keyword"
>> class=
>> > "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
>> > tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
>> > "solr.LowerCaseFilterFactory"/> <filter
>> class="solr.EdgeNGramFilterFactory"
>> > minGramSize="2" maxGramSize="25" side="front"/> <filter class=
>> > "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"
>> side="back"/>
>> > </analyzer> <analyzer type="query"> <tokenizer class=
>> > "solr.KeywordTokenizerFactory"/> <filter
>> class="solr.LowerCaseFilterFactory"
>> > /> </analyzer> </fieldType>
>> >
>> > I also added a bit of extra weighting for the "keyword" field so that
>> exact
>> > matches recieved a higher score.
>> >
>> > What this solution doesn't do is to exclude values like "EE 009", when
>> > searching for "FE 009", but they return far down on the list, which for
>> the
>> > customer is ok, because usually these results are somewhat related og
>> > within the same category.
>> >
>> > *Aleksander Akerø*
>> > Systemkonsulent
>> > Mobil: 944 89 054
>> > E-post: [email protected]
>> >
>> > *Gurusoft AS*
>> > Telefon: 92 44 09 99
>> > Østre Kullerød
>> > www.gurusoft.no
>> >
>> >
>> > 2014-01-30 Jack Krupansky <[email protected]>
>> >
>> >> The standard, keyword-oriented query parsers will all treat unquoted,
>> >> unescaped white space as term delimiters and ignore the what space.
>> There
>> >> is no way to bypass that behavior. So, your regex will never even see
>> the
>> >> white space - unless you enclose the text and white space in quotes or
>> use
>> >> a backslash to quote each white space character.
>> >>
>> >> You can use the "field" and "term" query parsers to pass a query string
>> as
>> >> if it were fully enclosed in quotes, but that only handles a single >>
>> term
>> >> and does not allow for multiple terms or any query operators. For
>> example:
>> >>
>> >> {!field f=myfield}Foo Bar
>> >>
>> >> See:
>> >> http://wiki.apache.org/solr/QueryParser
>> >>
>> >> You can also pre-configure the field query parser with the >>
>> defType=field
>> >> parameter.
>> >>
>> >> -- Jack Krupansky
>> >>
>> >>
>> >> -----Original Message----- From: Srinivasa7
>> >> Sent: Thursday, January 30, 2014 6:37 AM
>> >>
>> >> To: [email protected]
>> >> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
>> >>
>> >> Hi,
>> >>
>> >> I  have similar kind of problem  where I want search for a words with
>> >> spaces
>> >> in that. And I wanted to search by stripping all the spaces .
>> >>
>> >> I have used following schema for that
>> >>
>> >> <fieldType name="nospaces" class="solr.TextField"
>> >> autoGeneratePhraseQueries="true"  >
>> >>            <analyzer type="index">
>> >>              <tokenizer class="solr.KeywordTokenizerFactory"/>
>> >>                <filter class="solr.LowerCaseFilterFactory"/>
>> >>                <filter class="solr.PatternReplaceFilterFactory"
>> >> pattern="[^\w]+"  replacement="" replace="all"/>
>> >>            </analyzer>
>> >>            <analyzer type="query">
>> >>
>> >>                <tokenizer class="solr.KeywordTokenizerFactory"/>
>> >>                <filter class="solr.LowerCaseFilterFactory"/>
>> >>                <filter class="solr.PatternReplaceFilterFactory"
>> >> pattern="[^\w]+"  replacement="" replace="all"/>
>> >>            </analyzer>
>> >>        </fieldType>
>> >>
>> >>
>> >> And
>> >>
>> >>
>> >> <field name="text_nospaces" type="nospaces"  indexed="true"
>> stored="true"
>> >> omitNorms="true" />
>> >>        <copyField source="text" dest="text_nospaces" />
>> >>
>> >>
>> >>
>> >> But it is not searching the right terms . we are stripping the spaces
>> and
>> >> indexing lowercase values when we do that.
>> >>
>> >>
>> >> Like : East Enders
>> >>
>> >> when I seach for   'east end ers'  text, its not returning any values
>> >> saying
>> >> no document found.
>> >>
>> >> I realised the solr uses QueryParser before passing query string to the
>> >> QueryAnalyzer in defined in schema.
>> >>
>> >> And The Query parser is tokenizing the query string providing in query
>> . So
>> >> it is sending each token to the QueryAnalyser that is defined in >>
>> schema.
>> >>
>> >>
>> >> SO is there anyway that I can by pass this query parser or use a >>
>> correct
>> >> query processor which can consider the entire string as single pharse.
>> >>
>> >> At the moment I am using dismax query processor.
>> >>
>> >> Any suggestion would be much appreciated.
>> >>
>> >> Thanks
>> >> Srinivasa
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context: http://lucene.472066.n3.nabble.com/
>> >>
>> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>>
>>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Reply via email to