I am still really struggling with this. Current field type as defined in schema.xml:
<!-- String replace field for account number searches --> <fieldType name="TrimmedString" class="solr.TextField" omitNorms="true"> <analyzer> <!-- Removes anything that isn't a letter or digit --> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^A-Za-z0-9])" replacement=""/> <tokenizer class="solr.KeywordTokenizerFactory" /> <!-- Normalizes token text to upper case --> <filter class="solr.UpperCaseFilterFactory" /> </analyzer> </fieldType> Two fields are defined using this field type: <field name="account_number" type="TrimmedString" indexed="true" stored="true" multivalued="false" required="false"/> <field name="transaction_key" type="TrimmedString" indexed="true" stored="true" multivalued="false" required="false"/> A transaction key may look like: 107986541-85487JY_X4528745 An account number may look like: 1258458-0659841 After making this change, I stopped solr, deleted the data directory, restarted solr and ran indexing for all data. Transaction Key: Searching for: 107986541-85487JY_X4528745 Returns: 107986541-85487JY_X4528745 (good) Searching for: "107986541-85487JY_X4528745" Returns: 107986541-85487JY_X4528745 (good) Searching for: 10798654185487JYX4528745 Returns: 107986541-85487JY_X4528745 (good) Searching for: "107986541-85487JY_X4528745" Returns: 107986541-85487JY_X4528745 (good) Searching for: 107986541 Returns: 107986541-85487JY_X4528745 (unexpected) Searching for: 107986541* Returns: MANY MANY hits that all start with 107986541 (unexpected) Account Number: Searching for: 1258458-0659841 Returns: NOTHING (bad) Searching for: "1258458-0659841" Returns: 1258458-0659841 (good) Searching for: 12584580659841 Returns: 1258458-0659841 (good) Searching for: "12584580659841" Returns: 1258458-0659841 (good) Searching for: 1258458-0659 Returns: 1258458-0659841 (good) Searching for: 1258458-0659* Returns: NOTHING (bad) So my questions are: 1) Why does searching for 107986541 And 107986541* For transaction_key return different results? 2) Why does searching for a full account number without quotes fail? 3) Why does specifying the wildcard character in the last account_number search return nothing? Many many thanks. I'll get this some day, Teresa > > > On Apr 6, 2020, at 12:38 PM, Teresa McMains <ter...@t14-consulting.com> > > wrote: > > > > Erick, thank you so much for this. I'm going to try to implement with > > PatternReplaceCharFilterFactory as you recommended. > > What you mentioned about re-indexing from an empty state made sense to me > > (in terms of the observed behavior) but also surprised me. If I select > > "Clean" on the reindex, does it *not* start from an empty state? > > > > Thanks!! > > Teresa > > > > > > -----Original Message----- > > From: Erick Erickson <erickerick...@gmail.com> > > Sent: Friday, April 3, 2020 7:16 PM > > To: solr-user@lucene.apache.org > > Subject: Re: match string fields with embedded hyphens > > > > First, thanks for taking the time to write up a clear problem statement. > > Putting in the field type is _really_ helpful. > > > > By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. > > The problem is that wildcards are tricky, and this trips everybody up at > > one time or another. > > > > The quick background is that if there’s any possibility that the filter can > > produce multiple tokens for a single input token, that filter is skipped > > during analysis at _query_ time. Imagine that your replacement was a space > > rather than an empty string. Then 123--456 would become _two_ tokens in > > subsequent processing. Now anything you do is wrong sometime, somewhere. > > > > For instance, 123*456 would fail because it’d be looking for one token, > > which you wouldn’t expect. 12345* would also fail because there’s no single > > token like that. 123 would succeed (note no wildcard). You can see where > > this is going. > > > > Which doesn’t help you solve your use-case. There are several options: > > > > - use <charFilter class="solr.PatternReplaceCharFilterFactory" > > pattern="[^A-Za-z0-9]" replacement="”/> instead of > > PatternReplaceFilterFactory. charFilters are applied to the raw input > > before analysis and don’t have the same problem with producing multiple > > tokens. > > > > - WordDelimiter(Graph)FilterFactory is built for this kind of thing. There > > are a number of options, and this is one of the few filters that’s often > > different between index and query analysis chains. It can be tricky to > > understand all the interactions of the parameters though. > > > > And as an aside, I don’t know how large your index is, but wildcards for > > one or two leading characters can get very expensive, i.e. 1*, 12* can get > > very costly. If you can require 3 or more leading characters there are > > rarely problems. You can also do a time/space tradeoff by including > > EdgeNgramFilterFactory in your chain at the cost of a larger index. > > > > And finally, (and this is a total nit) there.s no reason to specify > > lower-case characters in your existing pattern because the upper-case > > filter is first. You _will_ have to specify uppercase characters if you use > > the charfilter. > > > > As for why production is different than QA, my guess is that you overlaid > > the schema changes on an _existing_ index. Most of the time, to get > > consistent results, you must re-index everything starting from an _empty_ > > index. This is a long and complicated explanation that I won’t go into > > here. In fact, I usually do one of two things: > > > > 1> define a new collection/core and index to that. If using SolrCloud, you > > can re-index and use collection aliasing to seamlessly switch. > > > > 2> stop Solr. Delete all the datadirs (the parent of tlog and index) > > associated with any of my replicas, restart with Solr and index. You may be > > able to get away with using delete-by-query to remove everything in your > > index then optimize (one of the very few times I’ll recommend optimizing), > > reloading your collection and indexing. The point is to get rid of all > > traces of anything generated from the old schema. > > > > Best, > > Erick > > > >> On Apr 3, 2020, at 3:40 PM, Teresa McMains <ter...@t14-consulting.com> > >> wrote: > >> > >> Forgive me if this is unclear, I am very much new here. > >> > >> I am working with a customer who needs to be able to query various > >> account/customer ID fields which may or may not have embedded dashes. But > >> they want to be able to search by entering the dashes or not and by > >> entering partial values or not. > >> > >> So we may have an account or customer ID like > >> > >> 1234-56AB45 > >> > >> And they would like to retrieve this by searching for any of the following: > >> 1234-56AB45 (full string match) > >> 1234-56 (partial string match) > >> 123456AB45 (full string but no dashes) > >> 123456 (partial string no dashes) > >> > >> I've defined this field type in schema.xml as: > >> > >> > >> <!-- String replace field for account number searches --> > >> > >> <fieldType name="TrimmedString" class="solr.TextField" > >> omitNorms="true"> > >> > >> <analyzer> > >> > >> <tokenizer class="solr.KeywordTokenizerFactory" /> > >> > >> > >> <!-- Normalizes token text to upper case --> > >> > >> <filter class="solr.UpperCaseFilterFactory" /> > >> > >> <!-- Removes anything that isn't a letter or digit --> > >> > >> <filter class="solr.PatternReplaceFilterFactory" > >> pattern="[^A-Za-z0-9]" replacement="" replace="all"/> > >> > >> > >> > >> </analyzer> > >> > >> </fieldType> > >> > >> But the behavior I see is completely unexpected. > >> Full string match works fine on the customer's DEV environment but > >> not in QA (which is running the same version of SOLR) Partial > >> string match works for some ID fields but not others A Partial > >> string match when the user does not enter the dashes just never > >> works > >> > >> I don't even know where to begin. The behavior is not consistent enough > >> to give me a sense. > >> > >> So perhaps I will just ask - how would you define a fieldType which should > >> ignore special characters like hyphens or underscores (or anything > >> non-alphanumeric) and works for full string or partial string search? > >> > >> Thank you. > >> > >> > >