First, thanks for taking the time to write up a clear problem statement. Putting in the field type is _really_ helpful.
By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. The problem is that wildcards are tricky, and this trips everybody up at one time or another. The quick background is that if there’s any possibility that the filter can produce multiple tokens for a single input token, that filter is skipped during analysis at _query_ time. Imagine that your replacement was a space rather than an empty string. Then 123--456 would become _two_ tokens in subsequent processing. Now anything you do is wrong sometime, somewhere. For instance, 123*456 would fail because it’d be looking for one token, which you wouldn’t expect. 12345* would also fail because there’s no single token like that. 123 would succeed (note no wildcard). You can see where this is going. Which doesn’t help you solve your use-case. There are several options: - use <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^A-Za-z0-9]" replacement="”/> instead of PatternReplaceFilterFactory. charFilters are applied to the raw input before analysis and don’t have the same problem with producing multiple tokens. - WordDelimiter(Graph)FilterFactory is built for this kind of thing. There are a number of options, and this is one of the few filters that’s often different between index and query analysis chains. It can be tricky to understand all the interactions of the parameters though. And as an aside, I don’t know how large your index is, but wildcards for one or two leading characters can get very expensive, i.e. 1*, 12* can get very costly. If you can require 3 or more leading characters there are rarely problems. You can also do a time/space tradeoff by including EdgeNgramFilterFactory in your chain at the cost of a larger index. And finally, (and this is a total nit) there.s no reason to specify lower-case characters in your existing pattern because the upper-case filter is first. You _will_ have to specify uppercase characters if you use the charfilter. As for why production is different than QA, my guess is that you overlaid the schema changes on an _existing_ index. Most of the time, to get consistent results, you must re-index everything starting from an _empty_ index. This is a long and complicated explanation that I won’t go into here. In fact, I usually do one of two things: 1> define a new collection/core and index to that. If using SolrCloud, you can re-index and use collection aliasing to seamlessly switch. 2> stop Solr. Delete all the datadirs (the parent of tlog and index) associated with any of my replicas, restart with Solr and index. You may be able to get away with using delete-by-query to remove everything in your index then optimize (one of the very few times I’ll recommend optimizing), reloading your collection and indexing. The point is to get rid of all traces of anything generated from the old schema. Best, Erick > On Apr 3, 2020, at 3:40 PM, Teresa McMains <ter...@t14-consulting.com> wrote: > > Forgive me if this is unclear, I am very much new here. > > I am working with a customer who needs to be able to query various > account/customer ID fields which may or may not have embedded dashes. But > they want to be able to search by entering the dashes or not and by entering > partial values or not. > > So we may have an account or customer ID like > > 1234-56AB45 > > And they would like to retrieve this by searching for any of the following: > 1234-56AB45 (full string match) > 1234-56 (partial string match) > 123456AB45 (full string but no dashes) > 123456 (partial string no dashes) > > I've defined this field type in schema.xml as: > > > <!-- String replace field for account number searches --> > > <fieldType name="TrimmedString" class="solr.TextField" omitNorms="true"> > > <analyzer> > > <tokenizer class="solr.KeywordTokenizerFactory" /> > > > <!-- Normalizes token text to upper case --> > > <filter class="solr.UpperCaseFilterFactory" /> > > <!-- Removes anything that isn't a letter or digit --> > > <filter class="solr.PatternReplaceFilterFactory" pattern="[^A-Za-z0-9]" > replacement="" replace="all"/> > > > > </analyzer> > > </fieldType> > > But the behavior I see is completely unexpected. > Full string match works fine on the customer's DEV environment but not in QA > (which is running the same version of SOLR) > Partial string match works for some ID fields but not others > A Partial string match when the user does not enter the dashes just never > works > > I don't even know where to begin. The behavior is not consistent enough to > give me a sense. > > So perhaps I will just ask - how would you define a fieldType which should > ignore special characters like hyphens or underscores (or anything > non-alphanumeric) and works for full string or partial string search? > > Thank you. > >