Re: Getting a hit on "the}" but not on "the" or "}"

Steven White Wed, 06 Jul 2016 06:29:32 -0700

Thanks Erick.  Moving stopword factory to after WDFF fixed the problem; I
no longer get a hit on "the}" or the variations of "the]", "the.", etc., I
did not have to change preserverOriginal from 1 to 0.


Regarding preserverOriginal in WDFF, I have it set to 1 because my
understanding of it means that if I have the text "a...@apache.org"
with preserverOriginal
set to 1 means WDFF will give me "abc", "apache", "org" and "a...@apache.org"
 In effect, if someone searches on "abc" or "apache" or "org" as well as on
"a...@apache.org" I will get a hit.  That is, if I set preserverOriginal to
0, then searching for "a...@apache.org" will not give me a hit.  My goal is
to still get a hit on the original word not just the break down that WDFF
gives me.  Is my understanding correct?

Steve


On Tue, Jul 5, 2016 at 7:47 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Either that's a typo or your problem is it should be terms.fl, not
> terms.f1 (lower case ell as
> opposed to the number one). You should be seeing the raw terms in your
> index
> with TermsComponent, similar to the "load terms" in the schema browser
> except it
> allows you to query specific terms starting with terms.prefix.
>
> WordDelimiterFilterFactory (WDFF) is what's stripping off your non
> alpha-numeric
> characters. Your stopword factory is before WDFF so
> anything like be. (notice the period) would NOT be stripped. Then when that
> token is passed through WDFF the period disappears. Order matters.
>
> You have preserverOriginal="1" in WDFF, which means the original token
> is preserved
> intact so "the}" gets changed to two tokens, "the" and "the}".
>
> So you really have to look more closely at your analysis chain, that's
> pretty much where
> your problems appear to be.
>
> Best,
> Erick
>
> On Tue, Jul 5, 2016 at 4:30 PM, Steven White <swhite4...@gmail.com> wrote:
> > Hi Erick,
> >
> > By TermsCoponent, I think you meant me to try the following?
> >
> >
> >
> http://vottopg15.ottawa.ibm.com:8983/solr/testdata/terms?terms.f1=ALL_FIELDS&terms.prefix=the
> >
> > If so, I tried it and I'm getting 0 hits:
> >
> >   <response>
> >     <lst name="responseHeader">
> >       <int name="status">0</int>
> >       <int name="QTime">0</int>
> >     </lst>
> >     <lst name="terms"/>
> >   </response>
> >
> > In fact, I'm getting 0 hits on anything I pass to "terms.prefix"
> >
> > Another thing I noticed is this.  Using Solr Admin Console's Schema
> > Browser, after selecting the field "ALL_FIELDS and clicking on Load Term
> > Info button, I'm seeing "be" in the list!!  Like so:
> >
> >   4 localhost
> >     abc
> >     a...@localhost.com
> >     com
> >     intern
> >     be
> >     /intern
> >     abclocalhostcom
> >     user
> >
> > I don't understand what I'm looking at here (in the schema browser) or if
> > this is at all related to my issue (I'm seeing "be" listed here and
> > wandering if it has something to do with my issue).  If I click on any of
> > the listed words, I get a hit, but I get 0 hits when I click on "be".
> >
> > Thanks.
> >
> > Steve
> >
> >
> > On Tue, Jul 5, 2016 at 7:07 PM, Steven White <swhite4...@gmail.com>
> wrote:
> >
> >> Thanks for the quick reply Erick.
> >>
> >> Here is the analyzer I'm using:
> >>
> >>   <fieldType name="all_raw_text" class="solr.TextField"
> >> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >>     <analyzer>
> >>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>       <filter class="solr.StopFilterFactory"
> words="lang/stopwords_en.txt"
> >> ignoreCase="true"/>
> >>       <filter class="solr.WordDelimiterFilterFactory"
> preserveOriginal="1"
> >> generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
> >> splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
> >> catenateAll="1" catenateNumbers="1"/>
> >>       <filter class="solr.LowerCaseFilterFactory"/>
> >>       <filter class="solr.EnglishPossessiveFilterFactory"/>
> >>       <filter class="solr.KeywordMarkerFilterFactory"
> >> protected="protwords.txt"/>
> >>       <filter class="solr.PorterStemFilterFactory"/>
> >>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>     </analyzer>
> >>
> >> If in fact it is my analyzer, what part of it is causing this?  If not,
> >> I'm not clear about the "TermsComponent" that you suggested having me
> look
> >> into.  How do I "point" it at my field?  I have zero knowledge about
> this.
> >> Is this something I do from Solr's Admin Console via Schema Browser
> link?
> >>
> >> Steve
> >>
> >>
> >> On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <erickerick...@gmail.com
> >
> >> wrote:
> >>
> >>> My guess is that your field analysis isn't stripping the various non
> >>> alpha-num
> >>> characters, thus "the]" is actually a token in your index, square
> bracket
> >>> and
> >>> all. If that's true, it certainly doesn't match the stopword "the".
> >>>
> >>> You can check by using the TermsComponent, pointing it at your field
> >>> and setting terms.prefix=the
> >>>
> >>> See:
> >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <swhite4...@gmail.com>
> >>> wrote:
> >>> > HI Everyone,
> >>> >
> >>> > I'm trying to understand why I get a hit when I search for "the}" but
> >>> not
> >>> > when I search for "the" (searches are done without the quotes and
> "the"
> >>> is
> >>> > a stopword in my case).
> >>> >
> >>> > Here is the debugQuery output using "the}":
> >>> >   "debug": {
> >>> >     "rawquerystring": "the}",
> >>> >     "querystring": "the}",
> >>> >     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
> >>> > ALL_FIELDS:the))~1.0))/no_coord",
> >>> >     "parsedquery_toString": "+((ALL_FIELDS:the}
> ALL_FIELDS:the))~1.0",
> >>> >     "explain": {
> >>> >       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> >>> 0.14220011
> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
> >>> of:\n
> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.078125 = fieldNorm(doc=0)\n",
> >>> >       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> >>> 0.14220011
> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
> >>> of:\n
> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.078125 = fieldNorm(doc=0)\n",
> >>> >       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
> >>> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n
> >>> 0.14220011
> >>> > = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product
> >>> of:\n
> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.078125 = fieldNorm(doc=1)\n",
> >>> >       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
> >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
> >>> 0.1137601
> >>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
> >>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product
> >>> of:\n
> >>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
> >>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
> >>> >  0.0625 = fieldNorm(doc=0)\n"
> >>> >     },
> >>> >     "QParser": "ExtendedDismaxQParser",
> >>> >     "altquerystring": null,
> >>> >     "boost_queries": null,
> >>> >     "parsed_boost_queries": [],
> >>> >     "boostfuncs": null,
> >>> >     "filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >     "parsed_filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >
> >>> > Here is the debugQuery output using "the"
> >>> >   "debug": {
> >>> >     "rawquerystring": "the",
> >>> >     "querystring": "the",
> >>> >     "parsedquery": "(+())/no_coord",
> >>> >     "parsedquery_toString": "+()",
> >>> >     "explain": {},
> >>> >     "QParser": "ExtendedDismaxQParser",
> >>> >     "altquerystring": null,
> >>> >     "boost_queries": null,
> >>> >     "parsed_boost_queries": [],
> >>> >     "boostfuncs": null,
> >>> >     "filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >     "parsed_filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >
> >>> > As expected, I get no hits when I search for just "}":
> >>> >   "debug": {
> >>> >     "rawquerystring": "}",
> >>> >     "querystring": "}",
> >>> >     "parsedquery":
> >>> "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
> >>> >     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
> >>> >     "explain": {},
> >>> >     "QParser": "ExtendedDismaxQParser",
> >>> >     "altquerystring": null,
> >>> >     "boost_queries": null,
> >>> >     "parsed_boost_queries": [],
> >>> >     "boostfuncs": null,
> >>> >     "filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >     "parsed_filter_queries": [
> >>> >       "ISBN_GROUP_ID:2"
> >>> >     ],
> >>> >
> >>> > In case it matters, I'm also getting a hit when I search for "the."
> or
> >>> > "the]" or "the/" or "the," or "the=" etc.
> >>> >
> >>> > Thanks in advanced.
> >>> >
> >>> > Steve
> >>>
> >>
> >>
>

Re: Getting a hit on "the}" but not on "the" or "}"

Reply via email to