Re: catch alls and nuances

Erick Erickson Tue, 02 Feb 2016 14:22:00 -0800

bq: Have now begun writing my own.

I hope by that you mean defining your own <fieldType>,
at least until you're sure that none of the zillion things
you can do with an analysis chain don't suit your needs.

If you haven't already looked _seriously_ at the admin/analysis
page (you have to choose a core to have that available). Fuzzy
matching won't help you with the 1234-LT example at all.

BTW, you (perhaps unintentionally) changed the problem
1234LT as input is vastly different from 1234-LT. The latter
will be made into two tokens by some tokenizers. Whereas
1234LT is always passed through the tokenizers as a single
"word", _then_ broken up by WordDelimiterFilterFactory if
its a filter in the analysis chain.

Do note that when I use "tokenizer" I'm referring to the
specific class that breaks the incoming stream up. The
simplest example is WhitespaceTokenizer, which.. you
guessed it, breaks up the stream on whitespace.

Once something gets through the one and only tokenizer
in an analysis chain, each token passes through 0
or more "Filters", and WordDelimiterFilterFactory is
one of these.

Pardon me for being somewhat pedantic here but unless the
analysis chain is understood, you'll go through endless
thrashing. This is where the admin/analysis page is
invaluable.

Best,
Erick

On Tue, Feb 2, 2016 at 12:49 PM, John Blythe <j...@curvolabs.com> wrote:
> I had been using text_general at the time of my email's writing. Have tried
> a couple of the other stock ones (text_en, text_en_splitting, _tight). Have
> now begun writing my own. I began to wonder if simply doing one of the
> above, such as text_general, with a fuzzy distance (probably just ~1) would
> be best suited. Another example would be an indexed value of "Phasaix"
> (which is a typo in the original data) being searched for with the correct
> spelling of "Phasix" and returning nothing. Adding ~1 in that case works.
> For some reason it doesn't in the case of the 1234-L and 1234-LT example.
>
> Thanks for any insight-
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Mon, Feb 1, 2016 at 3:30 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Likely you also have WordDelimiterFilterFactory in
>> your fieldType, that's what will split on alphanumeric
>> transitions.
>>
>> So you should be able to use wildcards here, i.e. 1234L*
>>
>> However, that'll only work if you have preserveOriginal set in
>> WordDelimiterFilterFactory in your indexing chain.
>>
>> And just to make life "interesting", there are some peculiarities
>> with parsing wildcards at query time, so be sure to see the
>> admin/analysis page....
>>
>> Best,
>> Erick
>>
>> On Mon, Feb 1, 2016 at 12:20 PM, John Blythe <j...@curvolabs.com> wrote:
>> > Hi there
>> >
>> > I have a a catch all field called 'text' that I copy my item description,
>> > manufacturer name, and the item's catalog number into. I'm having an
>> issue
>> > with keeping the broadness of the tokenizers in place whilst still
>> allowing
>> > some good precision in the case of very specific queries.
>> >
>> > The results are generally good. But, for instance, the products named
>> 1234L
>> > and 1234LT aren't behaving how i would like. If I search 1234 they both
>> > show. If I search 1234L only the first one is returned. I'm guessing this
>> > is due to the splitting of the numeric and string portions. The "1234"
>> and
>> > the "L" both hit in the first case ("1234" and "L") but the L is of no
>> > value in the "1234" and "LT" indexed item.
>> >
>> > What is the best way around this so that a small levenstein distance, for
>> > instance, is picked up?
>>

Re: catch alls and nuances

Reply via email to