Look at what’s returned when you specify &debug=query. Particularly the parsed query. That’ll show you the results of parsing. My bet: you’ll see something unexpected...
Best, Erick > On Apr 8, 2020, at 17:59, Teresa McMains <ter...@t14-consulting.com> wrote: > > I am still really struggling with this. > > Current field type as defined in schema.xml: > > <!-- String replace field for account number searches --> > <fieldType name="TrimmedString" class="solr.TextField" omitNorms="true"> > <analyzer> > <!-- Removes anything that isn't a letter or digit --> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="([^A-Za-z0-9])" replacement=""/> > <tokenizer class="solr.KeywordTokenizerFactory" /> > > <!-- Normalizes token text to upper case --> > <filter class="solr.UpperCaseFilterFactory" /> > > </analyzer> > </fieldType> > > Two fields are defined using this field type: > <field name="account_number" type="TrimmedString" indexed="true" > stored="true" multivalued="false" required="false"/> > <field name="transaction_key" type="TrimmedString" indexed="true" > stored="true" multivalued="false" required="false"/> > > A transaction key may look like: 107986541-85487JY_X4528745 > An account number may look like: 1258458-0659841 > > After making this change, I stopped solr, deleted the data directory, > restarted solr and ran indexing for all data. > > Transaction Key: > Searching for: 107986541-85487JY_X4528745 Returns: > 107986541-85487JY_X4528745 (good) > Searching for: "107986541-85487JY_X4528745" Returns: > 107986541-85487JY_X4528745 (good) > Searching for: 10798654185487JYX4528745 Returns: > 107986541-85487JY_X4528745 (good) > Searching for: "107986541-85487JY_X4528745" Returns: > 107986541-85487JY_X4528745 (good) > Searching for: 107986541 Returns: 107986541-85487JY_X4528745 > (unexpected) > Searching for: 107986541* Returns: MANY MANY hits that all > start with 107986541 (unexpected) > > Account Number: > Searching for: 1258458-0659841 Returns: NOTHING (bad) > Searching for: "1258458-0659841" Returns: 1258458-0659841 (good) > Searching for: 12584580659841 Returns: 1258458-0659841 (good) > Searching for: "12584580659841" Returns: 1258458-0659841 (good) > Searching for: 1258458-0659 Returns: 1258458-0659841 (good) > Searching for: 1258458-0659* Returns: NOTHING (bad) > > So my questions are: > 1) Why does searching for 107986541 And 107986541* For transaction_key return > different results? > 2) Why does searching for a full account number without quotes fail? > 3) Why does specifying the wildcard character in the last account_number > search return nothing? > > Many many thanks. > I'll get this some day, > Teresa > > >> >>>> On Apr 6, 2020, at 12:38 PM, Teresa McMains <ter...@t14-consulting.com> >>>> wrote: >>> >>> Erick, thank you so much for this. I'm going to try to implement with >>> PatternReplaceCharFilterFactory as you recommended. >>> What you mentioned about re-indexing from an empty state made sense to me >>> (in terms of the observed behavior) but also surprised me. If I select >>> "Clean" on the reindex, does it *not* start from an empty state? >>> >>> Thanks!! >>> Teresa >>> >>> >>> -----Original Message----- >>> From: Erick Erickson <erickerick...@gmail.com> >>> Sent: Friday, April 3, 2020 7:16 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: match string fields with embedded hyphens >>> >>> First, thanks for taking the time to write up a clear problem statement. >>> Putting in the field type is _really_ helpful. >>> >>> By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. >>> The problem is that wildcards are tricky, and this trips everybody up at >>> one time or another. >>> >>> The quick background is that if there’s any possibility that the filter can >>> produce multiple tokens for a single input token, that filter is skipped >>> during analysis at _query_ time. Imagine that your replacement was a space >>> rather than an empty string. Then 123--456 would become _two_ tokens in >>> subsequent processing. Now anything you do is wrong sometime, somewhere. >>> >>> For instance, 123*456 would fail because it’d be looking for one token, >>> which you wouldn’t expect. 12345* would also fail because there’s no single >>> token like that. 123 would succeed (note no wildcard). You can see where >>> this is going. >>> >>> Which doesn’t help you solve your use-case. There are several options: >>> >>> - use <charFilter class="solr.PatternReplaceCharFilterFactory" >>> pattern="[^A-Za-z0-9]" replacement="”/> instead of >>> PatternReplaceFilterFactory. charFilters are applied to the raw input >>> before analysis and don’t have the same problem with producing multiple >>> tokens. >>> >>> - WordDelimiter(Graph)FilterFactory is built for this kind of thing. There >>> are a number of options, and this is one of the few filters that’s often >>> different between index and query analysis chains. It can be tricky to >>> understand all the interactions of the parameters though. >>> >>> And as an aside, I don’t know how large your index is, but wildcards for >>> one or two leading characters can get very expensive, i.e. 1*, 12* can get >>> very costly. If you can require 3 or more leading characters there are >>> rarely problems. You can also do a time/space tradeoff by including >>> EdgeNgramFilterFactory in your chain at the cost of a larger index. >>> >>> And finally, (and this is a total nit) there.s no reason to specify >>> lower-case characters in your existing pattern because the upper-case >>> filter is first. You _will_ have to specify uppercase characters if you use >>> the charfilter. >>> >>> As for why production is different than QA, my guess is that you overlaid >>> the schema changes on an _existing_ index. Most of the time, to get >>> consistent results, you must re-index everything starting from an _empty_ >>> index. This is a long and complicated explanation that I won’t go into >>> here. In fact, I usually do one of two things: >>> >>> 1> define a new collection/core and index to that. If using SolrCloud, you >>> can re-index and use collection aliasing to seamlessly switch. >>> >>> 2> stop Solr. Delete all the datadirs (the parent of tlog and index) >>> associated with any of my replicas, restart with Solr and index. You may be >>> able to get away with using delete-by-query to remove everything in your >>> index then optimize (one of the very few times I’ll recommend optimizing), >>> reloading your collection and indexing. The point is to get rid of all >>> traces of anything generated from the old schema. >>> >>> Best, >>> Erick >>> >>>> On Apr 3, 2020, at 3:40 PM, Teresa McMains <ter...@t14-consulting.com> >>>> wrote: >>>> >>>> Forgive me if this is unclear, I am very much new here. >>>> >>>> I am working with a customer who needs to be able to query various >>>> account/customer ID fields which may or may not have embedded dashes. But >>>> they want to be able to search by entering the dashes or not and by >>>> entering partial values or not. >>>> >>>> So we may have an account or customer ID like >>>> >>>> 1234-56AB45 >>>> >>>> And they would like to retrieve this by searching for any of the following: >>>> 1234-56AB45 (full string match) >>>> 1234-56 (partial string match) >>>> 123456AB45 (full string but no dashes) >>>> 123456 (partial string no dashes) >>>> >>>> I've defined this field type in schema.xml as: >>>> >>>> >>>> <!-- String replace field for account number searches --> >>>> >>>> <fieldType name="TrimmedString" class="solr.TextField" >>>> omitNorms="true"> >>>> >>>> <analyzer> >>>> >>>> <tokenizer class="solr.KeywordTokenizerFactory" /> >>>> >>>> >>>> <!-- Normalizes token text to upper case --> >>>> >>>> <filter class="solr.UpperCaseFilterFactory" /> >>>> >>>> <!-- Removes anything that isn't a letter or digit --> >>>> >>>> <filter class="solr.PatternReplaceFilterFactory" >>>> pattern="[^A-Za-z0-9]" replacement="" replace="all"/> >>>> >>>> >>>> >>>> </analyzer> >>>> >>>> </fieldType> >>>> >>>> But the behavior I see is completely unexpected. >>>> Full string match works fine on the customer's DEV environment but >>>> not in QA (which is running the same version of SOLR) Partial >>>> string match works for some ID fields but not others A Partial >>>> string match when the user does not enter the dashes just never >>>> works >>>> >>>> I don't even know where to begin. The behavior is not consistent enough >>>> to give me a sense. >>>> >>>> So perhaps I will just ask - how would you define a fieldType which should >>>> ignore special characters like hyphens or underscores (or anything >>>> non-alphanumeric) and works for full string or partial string search? >>>> >>>> Thank you. >>>> >>>> >>> >