RE: match string fields with embedded hyphens

Teresa McMains Wed, 08 Apr 2020 14:59:22 -0700

I am still really struggling with this.

Current field type as defined in schema.xml:


<!-- String replace field for account number searches -->
<fieldType name="TrimmedString" class="solr.TextField" omitNorms="true"> 
<analyzer> 
  <!-- Removes anything that isn't a letter or digit --> 
  <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([^A-Za-z0-9])" replacement=""/>
  <tokenizer class="solr.KeywordTokenizerFactory" />

  <!-- Normalizes token text to upper case -->
  <filter class="solr.UpperCaseFilterFactory" /> 

</analyzer> 
</fieldType>  

Two fields are defined using this field type:
<field name="account_number" type="TrimmedString" indexed="true" stored="true" 
multivalued="false" required="false"/>
<field name="transaction_key" type="TrimmedString" indexed="true" stored="true" 
multivalued="false" required="false"/>

A transaction key may look like: 107986541-85487JY_X4528745
An account number may look like: 1258458-0659841

After making this change, I stopped solr, deleted the data directory, restarted 
solr and ran indexing for all data.

Transaction Key:
Searching for:  107986541-85487JY_X4528745              Returns: 
107986541-85487JY_X4528745 (good)
Searching for: "107986541-85487JY_X4528745"             Returns: 
107986541-85487JY_X4528745 (good)
Searching for: 10798654185487JYX4528745         Returns: 
107986541-85487JY_X4528745 (good)
Searching for: "107986541-85487JY_X4528745"             Returns: 
107986541-85487JY_X4528745 (good)
Searching for: 107986541                                Returns: 
107986541-85487JY_X4528745 (unexpected)
Searching for: 107986541*                               Returns: MANY MANY hits 
that all start with 107986541 (unexpected)

Account Number: 
Searching for: 1258458-0659841          Returns: NOTHING (bad)
Searching for: "1258458-0659841"                Returns: 1258458-0659841 (good)
Searching for: 12584580659841                   Returns: 1258458-0659841 (good)
Searching for: "12584580659841"         Returns: 1258458-0659841 (good)
Searching for: 1258458-0659                     Returns: 1258458-0659841 (good)
Searching for: 1258458-0659*                    Returns: NOTHING (bad)

So my questions are:
1) Why does searching for 107986541 And 107986541* For transaction_key return 
different results?
2) Why does searching for a full account number without quotes fail?
3) Why does specifying the wildcard character in the last account_number search 
return nothing?

Many many thanks.
I'll get this some day,
Teresa


>  
> > On Apr 6, 2020, at 12:38 PM, Teresa McMains <ter...@t14-consulting.com> 
> > wrote:
> > 
> > Erick, thank you so much for this.  I'm going to try to implement with 
> > PatternReplaceCharFilterFactory as you recommended.
> > What you mentioned about re-indexing from an empty state made sense to me 
> > (in terms of the observed behavior) but also surprised me.  If I select 
> > "Clean" on the reindex, does it *not* start from an empty state?
> > 
> > Thanks!!
> > Teresa
> > 
> > 
> > -----Original Message-----
> > From: Erick Erickson <erickerick...@gmail.com>
> > Sent: Friday, April 3, 2020 7:16 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: match string fields with embedded hyphens
> > 
> > First, thanks for taking the time to write up a clear problem statement. 
> > Putting in the field type is _really_ helpful.
> > 
> > By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. 
> > The problem is that wildcards are tricky, and this trips everybody up at 
> > one time or another.
> > 
> > The quick background is that if there’s any possibility that the filter can 
> > produce multiple tokens for a single input token, that filter is skipped 
> > during analysis at _query_ time. Imagine that your replacement was a space 
> > rather than an empty string. Then 123--456 would become _two_ tokens in 
> > subsequent processing. Now anything you do is wrong sometime, somewhere.
> > 
> > For instance, 123*456 would fail because it’d be looking for one token, 
> > which you wouldn’t expect. 12345* would also fail because there’s no single 
> > token like that. 123 would succeed (note no wildcard). You can see where 
> > this is going.
> > 
> > Which doesn’t help you solve your use-case. There are several options:
> > 
> > - use <charFilter class="solr.PatternReplaceCharFilterFactory"  
> > pattern="[^A-Za-z0-9]" replacement="”/> instead of 
> > PatternReplaceFilterFactory. charFilters are applied to the raw input 
> > before analysis and don’t have the same problem with producing multiple 
> > tokens.
> > 
> > - WordDelimiter(Graph)FilterFactory is built for this kind of thing. There 
> > are a number of options, and this is one of the few filters that’s often 
> > different between index and query analysis chains. It can be tricky to 
> > understand all the interactions of the parameters though.
> > 
> > And as an aside, I don’t know how large your index is, but wildcards for 
> > one or two leading characters can get very expensive, i.e. 1*, 12* can get 
> > very costly. If you can require 3 or more leading characters there are 
> > rarely problems. You can also do a time/space tradeoff by including 
> > EdgeNgramFilterFactory in your chain at the cost of a larger index.
> > 
> > And finally, (and this is a total nit) there.s no reason to specify 
> > lower-case characters in your existing pattern because the upper-case 
> > filter is first. You _will_ have to specify uppercase characters if you use 
> > the charfilter.
> > 
> > As for why production is different than QA, my guess is that you overlaid 
> > the schema changes on an _existing_ index. Most of the time, to get 
> > consistent results, you must re-index everything starting from an _empty_ 
> > index. This is a long and complicated explanation that I won’t go into 
> > here. In fact, I usually do one of two things:
> > 
> > 1> define a new collection/core and index to that. If using SolrCloud, you 
> > can re-index and use collection aliasing to seamlessly switch.
> > 
> > 2> stop Solr. Delete all the datadirs (the parent of tlog and index) 
> > associated with any of my replicas, restart with Solr and index. You may be 
> > able to get away with using delete-by-query to remove everything in your 
> > index then optimize (one of the very few times I’ll recommend optimizing), 
> > reloading your collection and indexing. The point is to get rid of all 
> > traces of anything generated from the old schema.
> > 
> > Best,
> > Erick
> > 
> >> On Apr 3, 2020, at 3:40 PM, Teresa McMains <ter...@t14-consulting.com> 
> >> wrote:
> >> 
> >> Forgive me if this is unclear, I am very much new here.
> >> 
> >> I am working with a customer who needs to be able to query various 
> >> account/customer ID fields which may or may not have embedded dashes.  But 
> >> they want to be able to search by entering the dashes or not and by 
> >> entering partial values or not.
> >> 
> >> So we may have an account or customer ID like
> >> 
> >> 1234-56AB45
> >> 
> >> And they would like to retrieve this by searching for any of the following:
> >> 1234-56AB45     (full string match)
> >> 1234-56                (partial string match)
> >> 123456AB45        (full string but no dashes)
> >> 123456                  (partial string no dashes)
> >> 
> >> I've defined this field type in schema.xml as:
> >> 
> >> 
> >> <!-- String replace field for account number searches -->
> >> 
> >> <fieldType name="TrimmedString" class="solr.TextField"
> >> omitNorms="true">
> >> 
> >> <analyzer>
> >> 
> >> <tokenizer class="solr.KeywordTokenizerFactory" />
> >> 
> >> 
> >> <!-- Normalizes token text to upper case -->
> >> 
> >> <filter class="solr.UpperCaseFilterFactory" />
> >> 
> >> <!-- Removes anything that isn't a letter or digit -->
> >> 
> >> <filter class="solr.PatternReplaceFilterFactory" 
> >> pattern="[^A-Za-z0-9]" replacement="" replace="all"/>
> >> 
> >> 
> >> 
> >> </analyzer>
> >> 
> >> </fieldType>
> >> 
> >> But the behavior I see is completely unexpected.
> >> Full string match works fine on the customer's DEV environment but 
> >> not in QA (which is running the same version of SOLR) Partial 
> >> string match works for some ID fields but not others A Partial 
> >> string match when the user does not enter the dashes just never 
> >> works
> >> 
> >> I don't even know where to begin.  The behavior is not consistent enough 
> >> to give me a sense.
> >> 
> >> So perhaps I will just ask - how would you define a fieldType which should 
> >> ignore special characters like hyphens or underscores (or anything 
> >> non-alphanumeric) and works for full string or partial string search?
> >> 
> >> Thank you.
> >> 
> >> 
> >

RE: match string fields with embedded hyphens

Reply via email to