yo, erick: thanks for the reply. yes, i was only meaning my own custom fieldType. my bad on not sticking w my original example. i've been using the StandardTokenizerFactory to break out the stream. While I understand the tokenization/stream on paper, perhaps I'm not connecting all the dots I need to. In this case, maybe I need not break the tokens down so much before WDF starts operating.
susheel: thanks, i'll continue sharing as i explore and run into various walls. -- *John Blythe* Product Manager & Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Tue, Feb 2, 2016 at 5:21 PM, Erick Erickson <erickerick...@gmail.com> wrote: > bq: Have now begun writing my own. > > I hope by that you mean defining your own <fieldType>, > at least until you're sure that none of the zillion things > you can do with an analysis chain don't suit your needs. > > If you haven't already looked _seriously_ at the admin/analysis > page (you have to choose a core to have that available). Fuzzy > matching won't help you with the 1234-LT example at all. > > BTW, you (perhaps unintentionally) changed the problem > 1234LT as input is vastly different from 1234-LT. The latter > will be made into two tokens by some tokenizers. Whereas > 1234LT is always passed through the tokenizers as a single > "word", _then_ broken up by WordDelimiterFilterFactory if > its a filter in the analysis chain. > > Do note that when I use "tokenizer" I'm referring to the > specific class that breaks the incoming stream up. The > simplest example is WhitespaceTokenizer, which.. you > guessed it, breaks up the stream on whitespace. > > Once something gets through the one and only tokenizer > in an analysis chain, each token passes through 0 > or more "Filters", and WordDelimiterFilterFactory is > one of these. > > Pardon me for being somewhat pedantic here but unless the > analysis chain is understood, you'll go through endless > thrashing. This is where the admin/analysis page is > invaluable. > > Best, > Erick > > On Tue, Feb 2, 2016 at 12:49 PM, John Blythe <j...@curvolabs.com> wrote: > > I had been using text_general at the time of my email's writing. Have > tried > > a couple of the other stock ones (text_en, text_en_splitting, _tight). > Have > > now begun writing my own. I began to wonder if simply doing one of the > > above, such as text_general, with a fuzzy distance (probably just ~1) > would > > be best suited. Another example would be an indexed value of "Phasaix" > > (which is a typo in the original data) being searched for with the > correct > > spelling of "Phasix" and returning nothing. Adding ~1 in that case works. > > For some reason it doesn't in the case of the 1234-L and 1234-LT example. > > > > Thanks for any insight- > > > > -- > > *John Blythe* > > Product Manager & Lead Developer > > > > 251.605.3071 | j...@curvolabs.com > > www.curvolabs.com > > > > 58 Adams Ave > > Evansville, IN 47713 > > > > On Mon, Feb 1, 2016 at 3:30 PM, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> Likely you also have WordDelimiterFilterFactory in > >> your fieldType, that's what will split on alphanumeric > >> transitions. > >> > >> So you should be able to use wildcards here, i.e. 1234L* > >> > >> However, that'll only work if you have preserveOriginal set in > >> WordDelimiterFilterFactory in your indexing chain. > >> > >> And just to make life "interesting", there are some peculiarities > >> with parsing wildcards at query time, so be sure to see the > >> admin/analysis page.... > >> > >> Best, > >> Erick > >> > >> On Mon, Feb 1, 2016 at 12:20 PM, John Blythe <j...@curvolabs.com> > wrote: > >> > Hi there > >> > > >> > I have a a catch all field called 'text' that I copy my item > description, > >> > manufacturer name, and the item's catalog number into. I'm having an > >> issue > >> > with keeping the broadness of the tokenizers in place whilst still > >> allowing > >> > some good precision in the case of very specific queries. > >> > > >> > The results are generally good. But, for instance, the products named > >> 1234L > >> > and 1234LT aren't behaving how i would like. If I search 1234 they > both > >> > show. If I search 1234L only the first one is returned. I'm guessing > this > >> > is due to the splitting of the numeric and string portions. The "1234" > >> and > >> > the "L" both hit in the first case ("1234" and "L") but the L is of no > >> > value in the "1234" and "LT" indexed item. > >> > > >> > What is the best way around this so that a small levenstein distance, > for > >> > instance, is picked up? > >> >