Re: Getting a hit on "the}" but not on "the" or "}"

Erick Erickson Tue, 05 Jul 2016 16:48:43 -0700

Either that's a typo or your problem is it should be terms.fl, not
terms.f1 (lower case ell as
opposed to the number one). You should be seeing the raw terms in your index
with TermsComponent, similar to the "load terms" in the schema browser except it
allows you to query specific terms starting with terms.prefix.


WordDelimiterFilterFactory (WDFF) is what's stripping off your non alpha-numeric
characters. Your stopword factory is before WDFF so
anything like be. (notice the period) would NOT be stripped. Then when that
token is passed through WDFF the period disappears. Order matters.

You have preserverOriginal="1" in WDFF, which means the original token
is preserved
intact so "the}" gets changed to two tokens, "the" and "the}".

So you really have to look more closely at your analysis chain, that's
pretty much where
your problems appear to be.

Best,
Erick

On Tue, Jul 5, 2016 at 4:30 PM, Steven White <swhite4...@gmail.com> wrote:
> Hi Erick,
>
> By TermsCoponent, I think you meant me to try the following?
>
>
> http://vottopg15.ottawa.ibm.com:8983/solr/testdata/terms?terms.f1=ALL_FIELDS&terms.prefix=the
>
> If so, I tried it and I'm getting 0 hits:
>
>   <response>
>     <lst name="responseHeader">
>       <int name="status">0</int>
>       <int name="QTime">0</int>
>     </lst>
>     <lst name="terms"/>
>   </response>
>
> In fact, I'm getting 0 hits on anything I pass to "terms.prefix"
>
> Another thing I noticed is this.  Using Solr Admin Console's Schema
> Browser, after selecting the field "ALL_FIELDS and clicking on Load Term
> Info button, I'm seeing "be" in the list!!  Like so:
>
>   4 localhost
>     abc
>     a...@localhost.com
>     com
>     intern
>     be
>     /intern
>     abclocalhostcom
>     user
>
> I don't understand what I'm looking at here (in the schema browser) or if
> this is at all related to my issue (I'm seeing "be" listed here and
> wandering if it has something to do with my issue).  If I click on any of
> the listed words, I get a hit, but I get 0 hits when I click on "be".
>
> Thanks.
>
> Steve
>
>
> On Tue, Jul 5, 2016 at 7:07 PM, Steven White <swhite4...@gmail.com> wrote:
>
>> Thanks for the quick reply Erick.
>>
>> Here is the analyzer I'm using:
>>
>>   <fieldType name="all_raw_text" class="solr.TextField"
>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>     <analyzer>
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
>> ignoreCase="true"/>
>>       <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
>> generateNumberParts="1" splitOnCaseChange="0" catenateWords="1"
>> splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1"
>> catenateAll="1" catenateNumbers="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>>       <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>       <filter class="solr.PorterStemFilterFactory"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>
>> If in fact it is my analyzer, what part of it is causing this?  If not,
>> I'm not clear about the "TermsComponent" that you suggested having me look
>> into.  How do I "point" it at my field?  I have zero knowledge about this.
>> Is this something I do from Solr's Admin Console via Schema Browser link?
>>
>> Steve
>>
>>
>> On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>>> My guess is that your field analysis isn't stripping the various non
>>> alpha-num
>>> characters, thus "the]" is actually a token in your index, square bracket
>>> and
>>> all. If that's true, it certainly doesn't match the stopword "the".
>>>
>>> You can check by using the TermsComponent, pointing it at your field
>>> and setting terms.prefix=the
>>>
>>> See:
>>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <swhite4...@gmail.com>
>>> wrote:
>>> > HI Everyone,
>>> >
>>> > I'm trying to understand why I get a hit when I search for "the}" but
>>> not
>>> > when I search for "the" (searches are done without the quotes and "the"
>>> is
>>> > a stopword in my case).
>>> >
>>> > Here is the debugQuery output using "the}":
>>> >   "debug": {
>>> >     "rawquerystring": "the}",
>>> >     "querystring": "the}",
>>> >     "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the}
>>> > ALL_FIELDS:the))~1.0))/no_coord",
>>> >     "parsedquery_toString": "+((ALL_FIELDS:the} ALL_FIELDS:the))~1.0",
>>> >     "explain": {
>>> >       "-1.5.1804": "\n0.14220011 = sum of:\n  0.14220011 =
>>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>>> 0.14220011
>>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
>>> of:\n
>>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.078125 = fieldNorm(doc=0)\n",
>>> >       "-1.5.3552": "\n0.14220011 = sum of:\n  0.14220011 =
>>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>>> 0.14220011
>>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 0, product
>>> of:\n
>>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.078125 = fieldNorm(doc=0)\n",
>>> >       "-1.5.3554": "\n0.14220011 = sum of:\n  0.14220011 =
>>> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n
>>> 0.14220011
>>> > = score(doc=1,freq=2.0), product of:\n      0.51863563 = queryWeight,
>>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.20899205 = queryNorm\n      0.27418116 = fieldWeight in 1, product
>>> of:\n
>>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.078125 = fieldNorm(doc=1)\n",
>>> >       "-1.5.1802": "\n0.1137601 = sum of:\n  0.1137601 =
>>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n
>>> 0.1137601
>>> > = score(doc=0,freq=2.0), product of:\n      0.51863563 = queryWeight,
>>> > product of:\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.20899205 = queryNorm\n      0.21934493 = fieldWeight in 0, product
>>> of:\n
>>> >        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 =
>>> > termFreq=2.0\n        2.4816046 = idf(docFreq=4, maxDocs=22)\n
>>> >  0.0625 = fieldNorm(doc=0)\n"
>>> >     },
>>> >     "QParser": "ExtendedDismaxQParser",
>>> >     "altquerystring": null,
>>> >     "boost_queries": null,
>>> >     "parsed_boost_queries": [],
>>> >     "boostfuncs": null,
>>> >     "filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >     "parsed_filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >
>>> > Here is the debugQuery output using "the"
>>> >   "debug": {
>>> >     "rawquerystring": "the",
>>> >     "querystring": "the",
>>> >     "parsedquery": "(+())/no_coord",
>>> >     "parsedquery_toString": "+()",
>>> >     "explain": {},
>>> >     "QParser": "ExtendedDismaxQParser",
>>> >     "altquerystring": null,
>>> >     "boost_queries": null,
>>> >     "parsed_boost_queries": [],
>>> >     "boostfuncs": null,
>>> >     "filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >     "parsed_filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >
>>> > As expected, I get no hits when I search for just "}":
>>> >   "debug": {
>>> >     "rawquerystring": "}",
>>> >     "querystring": "}",
>>> >     "parsedquery":
>>> "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord",
>>> >     "parsedquery_toString": "+(ALL_FIELDS:})~1.0",
>>> >     "explain": {},
>>> >     "QParser": "ExtendedDismaxQParser",
>>> >     "altquerystring": null,
>>> >     "boost_queries": null,
>>> >     "parsed_boost_queries": [],
>>> >     "boostfuncs": null,
>>> >     "filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >     "parsed_filter_queries": [
>>> >       "ISBN_GROUP_ID:2"
>>> >     ],
>>> >
>>> > In case it matters, I'm also getting a hit when I search for "the." or
>>> > "the]" or "the/" or "the," or "the=" etc.
>>> >
>>> > Thanks in advanced.
>>> >
>>> > Steve
>>>
>>
>>

Re: Getting a hit on "the}" but not on "the" or "}"

Reply via email to