Re: match string fields with embedded hyphens

Erick Erickson Wed, 08 Apr 2020 15:06:27 -0700

Look at what’s returned when you specify &debug=query. Particularly the parsed 
query. That’ll show you the results of parsing. My bet: you’ll see something 
unexpected...


Best,
Erick

> On Apr 8, 2020, at 17:59, Teresa McMains <ter...@t14-consulting.com> wrote:
> 
> I am still really struggling with this.
> 
> Current field type as defined in schema.xml:
> 
> <!-- String replace field for account number searches -->
> <fieldType name="TrimmedString" class="solr.TextField" omitNorms="true"> 
> <analyzer> 
>  <!-- Removes anything that isn't a letter or digit --> 
>  <charFilter class="solr.PatternReplaceCharFilterFactory" 
> pattern="([^A-Za-z0-9])" replacement=""/>
>  <tokenizer class="solr.KeywordTokenizerFactory" />
> 
>  <!-- Normalizes token text to upper case -->
>  <filter class="solr.UpperCaseFilterFactory" /> 
> 
> </analyzer> 
> </fieldType>  
> 
> Two fields are defined using this field type:
> <field name="account_number" type="TrimmedString" indexed="true" 
> stored="true" multivalued="false" required="false"/>
> <field name="transaction_key" type="TrimmedString" indexed="true" 
> stored="true" multivalued="false" required="false"/>
> 
> A transaction key may look like: 107986541-85487JY_X4528745
> An account number may look like: 1258458-0659841
> 
> After making this change, I stopped solr, deleted the data directory, 
> restarted solr and ran indexing for all data.
> 
> Transaction Key:
> Searching for:    107986541-85487JY_X4528745        Returns: 
> 107986541-85487JY_X4528745 (good)
> Searching for: "107986541-85487JY_X4528745"        Returns: 
> 107986541-85487JY_X4528745 (good)
> Searching for: 10798654185487JYX4528745        Returns: 
> 107986541-85487JY_X4528745 (good)
> Searching for: "107986541-85487JY_X4528745"        Returns: 
> 107986541-85487JY_X4528745 (good)
> Searching for: 107986541                Returns: 107986541-85487JY_X4528745 
> (unexpected)
> Searching for: 107986541*                Returns: MANY MANY hits that all 
> start with 107986541 (unexpected)
> 
> Account Number: 
> Searching for: 1258458-0659841        Returns: NOTHING (bad)
> Searching for: "1258458-0659841"        Returns: 1258458-0659841 (good)
> Searching for: 12584580659841            Returns: 1258458-0659841 (good)
> Searching for: "12584580659841"        Returns: 1258458-0659841 (good)
> Searching for: 1258458-0659            Returns: 1258458-0659841 (good)
> Searching for: 1258458-0659*            Returns: NOTHING (bad)
> 
> So my questions are:
> 1) Why does searching for 107986541 And 107986541* For transaction_key return 
> different results?
> 2) Why does searching for a full account number without quotes fail?
> 3) Why does specifying the wildcard character in the last account_number 
> search return nothing?
> 
> Many many thanks.
> I'll get this some day,
> Teresa
> 
> 
>> 
>>>> On Apr 6, 2020, at 12:38 PM, Teresa McMains <ter...@t14-consulting.com> 
>>>> wrote:
>>> 
>>> Erick, thank you so much for this.  I'm going to try to implement with 
>>> PatternReplaceCharFilterFactory as you recommended.
>>> What you mentioned about re-indexing from an empty state made sense to me 
>>> (in terms of the observed behavior) but also surprised me.  If I select 
>>> "Clean" on the reindex, does it *not* start from an empty state?
>>> 
>>> Thanks!!
>>> Teresa
>>> 
>>> 
>>> -----Original Message-----
>>> From: Erick Erickson <erickerick...@gmail.com>
>>> Sent: Friday, April 3, 2020 7:16 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: match string fields with embedded hyphens
>>> 
>>> First, thanks for taking the time to write up a clear problem statement. 
>>> Putting in the field type is _really_ helpful.
>>> 
>>> By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. 
>>> The problem is that wildcards are tricky, and this trips everybody up at 
>>> one time or another.
>>> 
>>> The quick background is that if there’s any possibility that the filter can 
>>> produce multiple tokens for a single input token, that filter is skipped 
>>> during analysis at _query_ time. Imagine that your replacement was a space 
>>> rather than an empty string. Then 123--456 would become _two_ tokens in 
>>> subsequent processing. Now anything you do is wrong sometime, somewhere.
>>> 
>>> For instance, 123*456 would fail because it’d be looking for one token, 
>>> which you wouldn’t expect. 12345* would also fail because there’s no single 
>>> token like that. 123 would succeed (note no wildcard). You can see where 
>>> this is going.
>>> 
>>> Which doesn’t help you solve your use-case. There are several options:
>>> 
>>> - use <charFilter class="solr.PatternReplaceCharFilterFactory"  
>>> pattern="[^A-Za-z0-9]" replacement="”/> instead of 
>>> PatternReplaceFilterFactory. charFilters are applied to the raw input 
>>> before analysis and don’t have the same problem with producing multiple 
>>> tokens.
>>> 
>>> - WordDelimiter(Graph)FilterFactory is built for this kind of thing. There 
>>> are a number of options, and this is one of the few filters that’s often 
>>> different between index and query analysis chains. It can be tricky to 
>>> understand all the interactions of the parameters though.
>>> 
>>> And as an aside, I don’t know how large your index is, but wildcards for 
>>> one or two leading characters can get very expensive, i.e. 1*, 12* can get 
>>> very costly. If you can require 3 or more leading characters there are 
>>> rarely problems. You can also do a time/space tradeoff by including 
>>> EdgeNgramFilterFactory in your chain at the cost of a larger index.
>>> 
>>> And finally, (and this is a total nit) there.s no reason to specify 
>>> lower-case characters in your existing pattern because the upper-case 
>>> filter is first. You _will_ have to specify uppercase characters if you use 
>>> the charfilter.
>>> 
>>> As for why production is different than QA, my guess is that you overlaid 
>>> the schema changes on an _existing_ index. Most of the time, to get 
>>> consistent results, you must re-index everything starting from an _empty_ 
>>> index. This is a long and complicated explanation that I won’t go into 
>>> here. In fact, I usually do one of two things:
>>> 
>>> 1> define a new collection/core and index to that. If using SolrCloud, you 
>>> can re-index and use collection aliasing to seamlessly switch.
>>> 
>>> 2> stop Solr. Delete all the datadirs (the parent of tlog and index) 
>>> associated with any of my replicas, restart with Solr and index. You may be 
>>> able to get away with using delete-by-query to remove everything in your 
>>> index then optimize (one of the very few times I’ll recommend optimizing), 
>>> reloading your collection and indexing. The point is to get rid of all 
>>> traces of anything generated from the old schema.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Apr 3, 2020, at 3:40 PM, Teresa McMains <ter...@t14-consulting.com> 
>>>> wrote:
>>>> 
>>>> Forgive me if this is unclear, I am very much new here.
>>>> 
>>>> I am working with a customer who needs to be able to query various 
>>>> account/customer ID fields which may or may not have embedded dashes.  But 
>>>> they want to be able to search by entering the dashes or not and by 
>>>> entering partial values or not.
>>>> 
>>>> So we may have an account or customer ID like
>>>> 
>>>> 1234-56AB45
>>>> 
>>>> And they would like to retrieve this by searching for any of the following:
>>>> 1234-56AB45     (full string match)
>>>> 1234-56                (partial string match)
>>>> 123456AB45        (full string but no dashes)
>>>> 123456                  (partial string no dashes)
>>>> 
>>>> I've defined this field type in schema.xml as:
>>>> 
>>>> 
>>>> <!-- String replace field for account number searches -->
>>>> 
>>>> <fieldType name="TrimmedString" class="solr.TextField"
>>>> omitNorms="true">
>>>> 
>>>> <analyzer>
>>>> 
>>>> <tokenizer class="solr.KeywordTokenizerFactory" />
>>>> 
>>>> 
>>>> <!-- Normalizes token text to upper case -->
>>>> 
>>>> <filter class="solr.UpperCaseFilterFactory" />
>>>> 
>>>> <!-- Removes anything that isn't a letter or digit -->
>>>> 
>>>> <filter class="solr.PatternReplaceFilterFactory" 
>>>> pattern="[^A-Za-z0-9]" replacement="" replace="all"/>
>>>> 
>>>> 
>>>> 
>>>> </analyzer>
>>>> 
>>>> </fieldType>
>>>> 
>>>> But the behavior I see is completely unexpected.
>>>> Full string match works fine on the customer's DEV environment but 
>>>> not in QA (which is running the same version of SOLR) Partial 
>>>> string match works for some ID fields but not others A Partial 
>>>> string match when the user does not enter the dashes just never 
>>>> works
>>>> 
>>>> I don't even know where to begin.  The behavior is not consistent enough 
>>>> to give me a sense.
>>>> 
>>>> So perhaps I will just ask - how would you define a fieldType which should 
>>>> ignore special characters like hyphens or underscores (or anything 
>>>> non-alphanumeric) and works for full string or partial string search?
>>>> 
>>>> Thank you.
>>>> 
>>>> 
>>> 
>

Re: match string fields with embedded hyphens

Reply via email to