yo,

erick: thanks for the reply. yes, i was only meaning my own custom
fieldType. my bad on not sticking w my original example. i've been using
the StandardTokenizerFactory to break out the stream. While I understand
the tokenization/stream on paper, perhaps I'm not connecting all the dots I
need to. In this case, maybe I need not break the tokens down so much
before WDF starts operating.

susheel: thanks, i'll continue sharing as i explore and run into various
walls.

-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Tue, Feb 2, 2016 at 5:21 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: Have now begun writing my own.
>
> I hope by that you mean defining your own <fieldType>,
> at least until you're sure that none of the zillion things
> you can do with an analysis chain don't suit your needs.
>
> If you haven't already looked _seriously_ at the admin/analysis
> page (you have to choose a core to have that available). Fuzzy
> matching won't help you with the 1234-LT example at all.
>
> BTW, you (perhaps unintentionally) changed the problem
> 1234LT as input is vastly different from 1234-LT. The latter
> will be made into two tokens by some tokenizers. Whereas
> 1234LT is always passed through the tokenizers as a single
> "word", _then_ broken up by WordDelimiterFilterFactory if
> its a filter in the analysis chain.
>
> Do note that when I use "tokenizer" I'm referring to the
> specific class that breaks the incoming stream up. The
> simplest example is WhitespaceTokenizer, which.. you
> guessed it, breaks up the stream on whitespace.
>
> Once something gets through the one and only tokenizer
> in an analysis chain, each token passes through 0
> or more "Filters", and WordDelimiterFilterFactory is
> one of these.
>
> Pardon me for being somewhat pedantic here but unless the
> analysis chain is understood, you'll go through endless
> thrashing. This is where the admin/analysis page is
> invaluable.
>
> Best,
> Erick
>
> On Tue, Feb 2, 2016 at 12:49 PM, John Blythe <j...@curvolabs.com> wrote:
> > I had been using text_general at the time of my email's writing. Have
> tried
> > a couple of the other stock ones (text_en, text_en_splitting, _tight).
> Have
> > now begun writing my own. I began to wonder if simply doing one of the
> > above, such as text_general, with a fuzzy distance (probably just ~1)
> would
> > be best suited. Another example would be an indexed value of "Phasaix"
> > (which is a typo in the original data) being searched for with the
> correct
> > spelling of "Phasix" and returning nothing. Adding ~1 in that case works.
> > For some reason it doesn't in the case of the 1234-L and 1234-LT example.
> >
> > Thanks for any insight-
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Mon, Feb 1, 2016 at 3:30 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> Likely you also have WordDelimiterFilterFactory in
> >> your fieldType, that's what will split on alphanumeric
> >> transitions.
> >>
> >> So you should be able to use wildcards here, i.e. 1234L*
> >>
> >> However, that'll only work if you have preserveOriginal set in
> >> WordDelimiterFilterFactory in your indexing chain.
> >>
> >> And just to make life "interesting", there are some peculiarities
> >> with parsing wildcards at query time, so be sure to see the
> >> admin/analysis page....
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 1, 2016 at 12:20 PM, John Blythe <j...@curvolabs.com>
> wrote:
> >> > Hi there
> >> >
> >> > I have a a catch all field called 'text' that I copy my item
> description,
> >> > manufacturer name, and the item's catalog number into. I'm having an
> >> issue
> >> > with keeping the broadness of the tokenizers in place whilst still
> >> allowing
> >> > some good precision in the case of very specific queries.
> >> >
> >> > The results are generally good. But, for instance, the products named
> >> 1234L
> >> > and 1234LT aren't behaving how i would like. If I search 1234 they
> both
> >> > show. If I search 1234L only the first one is returned. I'm guessing
> this
> >> > is due to the splitting of the numeric and string portions. The "1234"
> >> and
> >> > the "L" both hit in the first case ("1234" and "L") but the L is of no
> >> > value in the "1234" and "LT" indexed item.
> >> >
> >> > What is the best way around this so that a small levenstein distance,
> for
> >> > instance, is picked up?
> >>
>

Reply via email to