Re: KeywordTokenizerFactory - trouble with "exact" matches

Aleksander Akerø Thu, 30 Jan 2014 06:36:17 -0800

Yes, I actually noted that about the filter vs. tokenizer. It's easy to get
confused if you don't have a good understanding of the differences between
tokenizers and filters.


As for the query parser problem, there's always a workaround, but it was
nice to be made aware of. It sort of was a ghost-like problem before.
Allthough it would be great to have the opportunity to "disable" the
splitting on whitespace even for DisMax, I understand that it probably not
the most wanted feature for next solr release :)

*Aleksander Akerø*
Systemkonsulent
Mobil: 944 89 054
E-post: aleksan...@gurusoft.no

*Gurusoft AS*
Telefon: 92 44 09 99
Østre Kullerød
www.gurusoft.no


2014-01-30 Erick Erickson <erickerick...@gmail.com>:

> Note, the comments about lowercasetokenizer were a red herring. You were
> using LowerCaseFilterFactory. note "Filter" rather than "Tokenizer". So it
> would
> just do what you expected, lowercase the entire input. You would have used
> LowerCaseTokenizerFactory in place of KeywordTokenizerFactory, not as a
> Filter.
>
> As for the rest, I expect Jack is right, it's the query parsing above
> the field input.
>
> Best
> Erick
>
> On Thu, Jan 30, 2014 at 6:29 AM, Aleksander Akerø
> <aleksan...@gurusoft.no> wrote:
> > Hi Srinivasa
> >
> > Yes I've come to understand that the analyzers will never "see" the
> > whitespace, thus no need for patternreplacement, like Jack points out. So
> > the solution would be to set wich parser to use for the query. Also Jack
> > has pointed out that the "field" queryparser should work in this
> particular
> > setting -> http://wiki.apache.org/solr/QueryParser
> >
> > My problem was though, that it was only for one of the fields in the
> schema
> > that i needed this for, but for all the other fields, e.g. name,
> > description etc., I would very much like to make use of the eDisMax
> > functionality. And it seems that there can only be defined one query
> parser
> > per query. in other words: for all fields. Jack, you may correct me if
> I'm
> > wrong here :)
> >
> > This particular customer wanted a wildcard search at both ends of the
> > phrase, and that sort of ambiguated the problem. And therefore I chose to
> > replace all whitespace for this field in sql at index time, using the
> DIH.
> > And then using EdgeNGramFilterFactory on both sides of the keyword like
> the
> > config below, and that seemed to work pretty nicely.
> >
> > <!-- #### WildCard search number #### --> <fieldType name="keyword"
> class=
> > "solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <
> > tokenizer class="solr.KeywordTokenizerFactory"/> <filter class=
> > "solr.LowerCaseFilterFactory"/> <filter
> class="solr.EdgeNGramFilterFactory"
> > minGramSize="2" maxGramSize="25" side="front"/> <filter class=
> > "solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25"
> side="back"/>
> > </analyzer> <analyzer type="query"> <tokenizer class=
> > "solr.KeywordTokenizerFactory"/> <filter
> class="solr.LowerCaseFilterFactory"
> > /> </analyzer> </fieldType>
> >
> > I also added a bit of extra weighting for the "keyword" field so that
> exact
> > matches recieved a higher score.
> >
> > What this solution doesn't do is to exclude values like "EE 009", when
> > searching for "FE 009", but they return far down on the list, which for
> the
> > customer is ok, because usually these results are somewhat related og
> > within the same category.
> >
> > *Aleksander Akerø*
> > Systemkonsulent
> > Mobil: 944 89 054
> > E-post: aleksan...@gurusoft.no
> >
> > *Gurusoft AS*
> > Telefon: 92 44 09 99
> > Østre Kullerød
> > www.gurusoft.no
> >
> >
> > 2014-01-30 Jack Krupansky <j...@basetechnology.com>
> >
> >> The standard, keyword-oriented query parsers will all treat unquoted,
> >> unescaped white space as term delimiters and ignore the what space.
> There
> >> is no way to bypass that behavior. So, your regex will never even see
> the
> >> white space - unless you enclose the text and white space in quotes or
> use
> >> a backslash to quote each white space character.
> >>
> >> You can use the "field" and "term" query parsers to pass a query string
> as
> >> if it were fully enclosed in quotes, but that only handles a single term
> >> and does not allow for multiple terms or any query operators. For
> example:
> >>
> >> {!field f=myfield}Foo Bar
> >>
> >> See:
> >> http://wiki.apache.org/solr/QueryParser
> >>
> >> You can also pre-configure the field query parser with the defType=field
> >> parameter.
> >>
> >> -- Jack Krupansky
> >>
> >>
> >> -----Original Message----- From: Srinivasa7
> >> Sent: Thursday, January 30, 2014 6:37 AM
> >>
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: KeywordTokenizerFactory - trouble with "exact" matches
> >>
> >> Hi,
> >>
> >> I  have similar kind of problem  where I want search for a words with
> >> spaces
> >> in that. And I wanted to search by stripping all the spaces .
> >>
> >> I have used following schema for that
> >>
> >> <fieldType name="nospaces" class="solr.TextField"
> >> autoGeneratePhraseQueries="true"  >
> >>            <analyzer type="index">
> >>              <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>                <filter class="solr.LowerCaseFilterFactory"/>
> >>                <filter class="solr.PatternReplaceFilterFactory"
> >> pattern="[^\w]+"  replacement="" replace="all"/>
> >>            </analyzer>
> >>            <analyzer type="query">
> >>
> >>                <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>                <filter class="solr.LowerCaseFilterFactory"/>
> >>                <filter class="solr.PatternReplaceFilterFactory"
> >> pattern="[^\w]+"  replacement="" replace="all"/>
> >>            </analyzer>
> >>        </fieldType>
> >>
> >>
> >> And
> >>
> >>
> >> <field name="text_nospaces" type="nospaces"  indexed="true"
> stored="true"
> >> omitNorms="true" />
> >>        <copyField source="text" dest="text_nospaces" />
> >>
> >>
> >>
> >> But it is not searching the right terms . we are stripping the spaces
> and
> >> indexing lowercase values when we do that.
> >>
> >>
> >> Like : East Enders
> >>
> >> when I seach for   'east end ers'  text, its not returning any values
> >> saying
> >> no document found.
> >>
> >> I realised the solr uses QueryParser before passing query string to the
> >> QueryAnalyzer in defined in schema.
> >>
> >> And The Query parser is tokenizing the query string providing in query
> . So
> >> it is sending each token to the QueryAnalyser that is defined in schema.
> >>
> >>
> >> SO is there anyway that I can by pass this query parser or use a correct
> >> query processor which can consider the entire string as single pharse.
> >>
> >> At the moment I am using dismax query processor.
> >>
> >> Any suggestion would be much appreciated.
> >>
> >> Thanks
> >> Srinivasa
> >>
> >>
> >>
> >> --
> >> View this message in context: http://lucene.472066.n3.nabble.com/
> >>
> KeywordTokenizerFactory-trouble-with-exact-matches-tp4114193p4114432.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>

Re: KeywordTokenizerFactory - trouble with "exact" matches

Reply via email to