Thanks Erick. Moving stopword factory to after WDFF fixed the problem; I no longer get a hit on "the}" or the variations of "the]", "the.", etc., I did not have to change preserverOriginal from 1 to 0.
Regarding preserverOriginal in WDFF, I have it set to 1 because my understanding of it means that if I have the text "a...@apache.org" with preserverOriginal set to 1 means WDFF will give me "abc", "apache", "org" and "a...@apache.org" In effect, if someone searches on "abc" or "apache" or "org" as well as on "a...@apache.org" I will get a hit. That is, if I set preserverOriginal to 0, then searching for "a...@apache.org" will not give me a hit. My goal is to still get a hit on the original word not just the break down that WDFF gives me. Is my understanding correct? Steve On Tue, Jul 5, 2016 at 7:47 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Either that's a typo or your problem is it should be terms.fl, not > terms.f1 (lower case ell as > opposed to the number one). You should be seeing the raw terms in your > index > with TermsComponent, similar to the "load terms" in the schema browser > except it > allows you to query specific terms starting with terms.prefix. > > WordDelimiterFilterFactory (WDFF) is what's stripping off your non > alpha-numeric > characters. Your stopword factory is before WDFF so > anything like be. (notice the period) would NOT be stripped. Then when that > token is passed through WDFF the period disappears. Order matters. > > You have preserverOriginal="1" in WDFF, which means the original token > is preserved > intact so "the}" gets changed to two tokens, "the" and "the}". > > So you really have to look more closely at your analysis chain, that's > pretty much where > your problems appear to be. > > Best, > Erick > > On Tue, Jul 5, 2016 at 4:30 PM, Steven White <swhite4...@gmail.com> wrote: > > Hi Erick, > > > > By TermsCoponent, I think you meant me to try the following? > > > > > > > http://vottopg15.ottawa.ibm.com:8983/solr/testdata/terms?terms.f1=ALL_FIELDS&terms.prefix=the > > > > If so, I tried it and I'm getting 0 hits: > > > > <response> > > <lst name="responseHeader"> > > <int name="status">0</int> > > <int name="QTime">0</int> > > </lst> > > <lst name="terms"/> > > </response> > > > > In fact, I'm getting 0 hits on anything I pass to "terms.prefix" > > > > Another thing I noticed is this. Using Solr Admin Console's Schema > > Browser, after selecting the field "ALL_FIELDS and clicking on Load Term > > Info button, I'm seeing "be" in the list!! Like so: > > > > 4 localhost > > abc > > a...@localhost.com > > com > > intern > > be > > /intern > > abclocalhostcom > > user > > > > I don't understand what I'm looking at here (in the schema browser) or if > > this is at all related to my issue (I'm seeing "be" listed here and > > wandering if it has something to do with my issue). If I click on any of > > the listed words, I get a hit, but I get 0 hits when I click on "be". > > > > Thanks. > > > > Steve > > > > > > On Tue, Jul 5, 2016 at 7:07 PM, Steven White <swhite4...@gmail.com> > wrote: > > > >> Thanks for the quick reply Erick. > >> > >> Here is the analyzer I'm using: > >> > >> <fieldType name="all_raw_text" class="solr.TextField" > >> positionIncrementGap="100" autoGeneratePhraseQueries="true"> > >> <analyzer> > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >> <filter class="solr.StopFilterFactory" > words="lang/stopwords_en.txt" > >> ignoreCase="true"/> > >> <filter class="solr.WordDelimiterFilterFactory" > preserveOriginal="1" > >> generateNumberParts="1" splitOnCaseChange="0" catenateWords="1" > >> splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1" > >> catenateAll="1" catenateNumbers="1"/> > >> <filter class="solr.LowerCaseFilterFactory"/> > >> <filter class="solr.EnglishPossessiveFilterFactory"/> > >> <filter class="solr.KeywordMarkerFilterFactory" > >> protected="protwords.txt"/> > >> <filter class="solr.PorterStemFilterFactory"/> > >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > >> </analyzer> > >> > >> If in fact it is my analyzer, what part of it is causing this? If not, > >> I'm not clear about the "TermsComponent" that you suggested having me > look > >> into. How do I "point" it at my field? I have zero knowledge about > this. > >> Is this something I do from Solr's Admin Console via Schema Browser > link? > >> > >> Steve > >> > >> > >> On Tue, Jul 5, 2016 at 6:51 PM, Erick Erickson <erickerick...@gmail.com > > > >> wrote: > >> > >>> My guess is that your field analysis isn't stripping the various non > >>> alpha-num > >>> characters, thus "the]" is actually a token in your index, square > bracket > >>> and > >>> all. If that's true, it certainly doesn't match the stopword "the". > >>> > >>> You can check by using the TermsComponent, pointing it at your field > >>> and setting terms.prefix=the > >>> > >>> See: > >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component > >>> > >>> Best, > >>> Erick > >>> > >>> On Tue, Jul 5, 2016 at 2:34 PM, Steven White <swhite4...@gmail.com> > >>> wrote: > >>> > HI Everyone, > >>> > > >>> > I'm trying to understand why I get a hit when I search for "the}" but > >>> not > >>> > when I search for "the" (searches are done without the quotes and > "the" > >>> is > >>> > a stopword in my case). > >>> > > >>> > Here is the debugQuery output using "the}": > >>> > "debug": { > >>> > "rawquerystring": "the}", > >>> > "querystring": "the}", > >>> > "parsedquery": "(+DisjunctionMaxQuery(((ALL_FIELDS:the} > >>> > ALL_FIELDS:the))~1.0))/no_coord", > >>> > "parsedquery_toString": "+((ALL_FIELDS:the} > ALL_FIELDS:the))~1.0", > >>> > "explain": { > >>> > "-1.5.1804": "\n0.14220011 = sum of:\n 0.14220011 = > >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n > >>> 0.14220011 > >>> > = score(doc=0,freq=2.0), product of:\n 0.51863563 = queryWeight, > >>> > product of:\n 2.4816046 = idf(docFreq=4, maxDocs=22)\n > >>> > 0.20899205 = queryNorm\n 0.27418116 = fieldWeight in 0, product > >>> of:\n > >>> > 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = > >>> > termFreq=2.0\n 2.4816046 = idf(docFreq=4, maxDocs=22)\n > >>> > 0.078125 = fieldNorm(doc=0)\n", > >>> > "-1.5.3552": "\n0.14220011 = sum of:\n 0.14220011 = > >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n > >>> 0.14220011 > >>> > = score(doc=0,freq=2.0), product of:\n 0.51863563 = queryWeight, > >>> > product of:\n 2.4816046 = idf(docFreq=4, maxDocs=22)\n > >>> > 0.20899205 = queryNorm\n 0.27418116 = fieldWeight in 0, product > >>> of:\n > >>> > 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = > >>> > termFreq=2.0\n 2.4816046 = idf(docFreq=4, maxDocs=22)\n > >>> > 0.078125 = fieldNorm(doc=0)\n", > >>> > "-1.5.3554": "\n0.14220011 = sum of:\n 0.14220011 = > >>> > weight(ALL_FIELDS:the in 1) [DefaultSimilarity], result of:\n > >>> 0.14220011 > >>> > = score(doc=1,freq=2.0), product of:\n 0.51863563 = queryWeight, > >>> > product of:\n 2.4816046 = idf(docFreq=4, maxDocs=22)\n > >>> > 0.20899205 = queryNorm\n 0.27418116 = fieldWeight in 1, product > >>> of:\n > >>> > 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = > >>> > termFreq=2.0\n 2.4816046 = idf(docFreq=4, maxDocs=22)\n > >>> > 0.078125 = fieldNorm(doc=1)\n", > >>> > "-1.5.1802": "\n0.1137601 = sum of:\n 0.1137601 = > >>> > weight(ALL_FIELDS:the in 0) [DefaultSimilarity], result of:\n > >>> 0.1137601 > >>> > = score(doc=0,freq=2.0), product of:\n 0.51863563 = queryWeight, > >>> > product of:\n 2.4816046 = idf(docFreq=4, maxDocs=22)\n > >>> > 0.20899205 = queryNorm\n 0.21934493 = fieldWeight in 0, product > >>> of:\n > >>> > 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = > >>> > termFreq=2.0\n 2.4816046 = idf(docFreq=4, maxDocs=22)\n > >>> > 0.0625 = fieldNorm(doc=0)\n" > >>> > }, > >>> > "QParser": "ExtendedDismaxQParser", > >>> > "altquerystring": null, > >>> > "boost_queries": null, > >>> > "parsed_boost_queries": [], > >>> > "boostfuncs": null, > >>> > "filter_queries": [ > >>> > "ISBN_GROUP_ID:2" > >>> > ], > >>> > "parsed_filter_queries": [ > >>> > "ISBN_GROUP_ID:2" > >>> > ], > >>> > > >>> > Here is the debugQuery output using "the" > >>> > "debug": { > >>> > "rawquerystring": "the", > >>> > "querystring": "the", > >>> > "parsedquery": "(+())/no_coord", > >>> > "parsedquery_toString": "+()", > >>> > "explain": {}, > >>> > "QParser": "ExtendedDismaxQParser", > >>> > "altquerystring": null, > >>> > "boost_queries": null, > >>> > "parsed_boost_queries": [], > >>> > "boostfuncs": null, > >>> > "filter_queries": [ > >>> > "ISBN_GROUP_ID:2" > >>> > ], > >>> > "parsed_filter_queries": [ > >>> > "ISBN_GROUP_ID:2" > >>> > ], > >>> > > >>> > As expected, I get no hits when I search for just "}": > >>> > "debug": { > >>> > "rawquerystring": "}", > >>> > "querystring": "}", > >>> > "parsedquery": > >>> "(+DisjunctionMaxQuery((ALL_FIELDS:})~1.0))/no_coord", > >>> > "parsedquery_toString": "+(ALL_FIELDS:})~1.0", > >>> > "explain": {}, > >>> > "QParser": "ExtendedDismaxQParser", > >>> > "altquerystring": null, > >>> > "boost_queries": null, > >>> > "parsed_boost_queries": [], > >>> > "boostfuncs": null, > >>> > "filter_queries": [ > >>> > "ISBN_GROUP_ID:2" > >>> > ], > >>> > "parsed_filter_queries": [ > >>> > "ISBN_GROUP_ID:2" > >>> > ], > >>> > > >>> > In case it matters, I'm also getting a hit when I search for "the." > or > >>> > "the]" or "the/" or "the," or "the=" etc. > >>> > > >>> > Thanks in advanced. > >>> > > >>> > Steve > >>> > >> > >> >