Re: match string fields with embedded hyphens

Erick Erickson Fri, 03 Apr 2020 16:16:08 -0700

First, thanks for taking the time to write up a clear problem statement. 
Putting in the field type is _really_ helpful.

By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. The 
problem is that wildcards are tricky, and this trips everybody up at one time 
or another.

The quick background is that if there’s any possibility that the filter can 
produce multiple tokens for a single input token, that filter is skipped during 
analysis at _query_ time. Imagine that your replacement was a space rather than 
an empty string. Then 123--456 would become _two_ tokens in subsequent 
processing. Now anything you do is wrong sometime, somewhere. 

For instance, 123*456 would fail because it’d be looking for one token, which 
you wouldn’t expect. 12345* would also fail because there’s no single token 
like that. 123 would succeed (note no wildcard). You can see where this is 
going.

Which doesn’t help you solve your use-case. There are several options:

- use <charFilter class="solr.PatternReplaceCharFilterFactory"  
pattern="[^A-Za-z0-9]" replacement="”/> instead of PatternReplaceFilterFactory. 
charFilters are applied to the raw input before analysis and don’t have the 
same problem with producing multiple tokens.

- WordDelimiter(Graph)FilterFactory is built for this kind of thing. There are 
a number of options, and this is one of the few filters that’s often different 
between index and query analysis chains. It can be tricky to understand all the 
interactions of the parameters though.

And as an aside, I don’t know how large your index is, but wildcards for one or 
two leading characters can get very expensive, i.e. 1*, 12* can get very 
costly. If you can require 3 or more leading characters there are rarely 
problems. You can also do a time/space tradeoff by including 
EdgeNgramFilterFactory in your chain at the cost of a larger index.

And finally, (and this is a total nit) there.s no reason to specify lower-case 
characters in your existing pattern because the upper-case filter is first. You 
_will_ have to specify uppercase characters if you use the charfilter.

As for why production is different than QA, my guess is that you overlaid the 
schema changes on an _existing_ index. Most of the time, to get consistent 
results, you must re-index everything starting from an _empty_ index. This is a 
long and complicated explanation that I won’t go into here. In fact, I usually 
do one of two things:

1> define a new collection/core and index to that. If using SolrCloud, you can 
re-index and use collection aliasing to seamlessly switch.

2> stop Solr. Delete all the datadirs (the parent of tlog and index) associated 
with any of my replicas, restart with Solr and index. You may be able to get 
away with using delete-by-query to remove everything in your index then 
optimize (one of the very few times I’ll recommend optimizing), reloading your 
collection and indexing. The point is to get rid of all traces of anything 
generated from the old schema. 

Best,
Erick

> On Apr 3, 2020, at 3:40 PM, Teresa McMains <ter...@t14-consulting.com> wrote:
> 
> Forgive me if this is unclear, I am very much new here.
> 
> I am working with a customer who needs to be able to query various 
> account/customer ID fields which may or may not have embedded dashes.  But 
> they want to be able to search by entering the dashes or not and by entering 
> partial values or not.
> 
> So we may have an account or customer ID like
> 
> 1234-56AB45
> 
> And they would like to retrieve this by searching for any of the following:
> 1234-56AB45     (full string match)
> 1234-56                (partial string match)
> 123456AB45        (full string but no dashes)
> 123456                  (partial string no dashes)
> 
> I've defined this field type in schema.xml as:
> 
> 
> <!-- String replace field for account number searches -->
> 
> <fieldType name="TrimmedString" class="solr.TextField" omitNorms="true">
> 
> <analyzer>
> 
>  <tokenizer class="solr.KeywordTokenizerFactory" />
> 
> 
>  <!-- Normalizes token text to upper case -->
> 
>  <filter class="solr.UpperCaseFilterFactory" />
> 
>  <!-- Removes anything that isn't a letter or digit -->
> 
>  <filter class="solr.PatternReplaceFilterFactory" pattern="[^A-Za-z0-9]" 
> replacement="" replace="all"/>
> 
> 
> 
> </analyzer>
> 
> </fieldType>
> 
> But the behavior I see is completely unexpected.
> Full string match works fine on the customer's DEV environment but not in QA 
> (which is running the same version of SOLR)
> Partial string match works for some ID fields but not others
> A Partial string match when the user does not enter the dashes just never 
> works
> 
> I don't even know where to begin.  The behavior is not consistent enough to 
> give me a sense.
> 
> So perhaps I will just ask - how would you define a fieldType which should 
> ignore special characters like hyphens or underscores (or anything 
> non-alphanumeric) and works for full string or partial string search?
> 
> Thank you.
> 
>

Re: match string fields with embedded hyphens

Reply via email to