Re: match string fields with embedded hyphens

Chris Hostetter Fri, 03 Apr 2020 16:24:07 -0700


: I am working with a customer who needs to be able to query various 
: account/customer ID fields which may or may not have embedded dashes.  
: But they want to be able to search by entering the dashes or not and by 
: entering partial values or not.
: 
: So we may have an account or customer ID like
: 
: 1234-56AB45
: 
: And they would like to retrieve this by searching for any of the following:
: 1234-56AB45     (full string match)
: 1234-56                (partial string match)
: 123456AB45        (full string but no dashes)
: 123456                  (partial string no dashes)

To answer your lsat question first...

: So perhaps I will just ask - how would you define a fieldType which
: should ignore special characters like hyphens or underscores (or
: anything non-alphanumeric) and works for full string or partial string
: search?

This is pretty much exactly what the "Word Delimiter Filter" was designed
for, and i encourage you to play with it and it's various options and
see what happens...

https://lucene.apache.org/solr/guide/8_5/filter-descriptions.html#word-delimiter-graph-filter

You've definitely need to enable som "non-default" options (like
"catenateNumbers=true") to ensure that you'd get indexed terms like
"123456" from input "1234-56AB45"

Once thing that's not entirely clear from your question & input is how you
define "partial string" ... for example: are you expecting a query of "12"
to match your input document? because WDF won't help with that.

: But the behavior I see is completely unexpected. Full string match works
: fine on the customer's DEV environment but not in QA (which is running
: the same version of SOLR)

I garuntee you there is some difference between your DEV and QA
environments. Either in terms of the documents in the index, or the
schema THAT WAS USED WHEN INDEXING THE DOCS --
which might have been changed after the indexing happened, or
the "current" schema being used when the queries are getting
parsed, or the default request options in solrconfig.xml ... something is
absolutely different.

: Partial string match works for some ID fields but not others
: A Partial string match when the user does not enter the dashes just never
works

I'm assuming these last 2 comments refer to behavior you see on *both*
your DEV and QA instances?

Depending on your definition of "partial string" (see the question i asked
above) then I _think_ the analyzer you have should work -- at least for
all the examples you've provided.

The missing piece of information is *how* you are querying: what query
parser you are using, what exactly the iput looks like; and also: the
output: what does "never works" mean? ... does it match 0 docs? does it
match docs you don't expect?

seeing the exact request URLs you are trying, with
"debug=true&echoParams=all" added, and the full output of those requests
so we can see things like the header where we can confirm what
default params might be getting added, and the query parrser debug info to
doble check how your query is being parsed, and the "explain" info to see
what docs that are matching (unexpectedly) are there.

More tips on details that can be useful to include to "help us help
you"...

https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists

-Hoss
http://www.lucidworks.com/

Re: match string fields with embedded hyphens

Reply via email to