Thanks David for sharing! The custom attribute approach sounds interesting 
indeed. 
Markus

 
 
-----Original message-----
> From:david.w.smi...@gmail.com <david.w.smi...@gmail.com>
> Sent: Tuesday 10th March 2015 16:53
> To: solr-user@lucene.apache.org
> Subject: Re: Delimited payloads input issue
> 
> Hi Markus,
> 
> I’ve found this problem too. I’ve worked around it:
> * write a custom attribute that has the data you want to carry-forward, one
> with a no-op clear().  The no-op clear defeats WDF and other ill-behaved
> filters (e.g. common-grams).
> * Use a custom tokenizer that populates the attribute, and which can truly
> clear the custom attribute
> * Write a custom filter at the end of the chain that actually encodes the
> attribute data into the payload.
> 
> This scheme has worked for me for sentence/paragraph IDs, which effectively
> hold constant throughout the sentence.  It may be more complicated when the
> data varies word-by-word since some Filters won’t work well.
> 
> I suppose the real solution is better Filters that use captureState and/or
> an improved tokenStream design to clone attributes.  It’s better to clone a
> state when introducing a new token than to clear it!
> 
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley
> 
> On Fri, Mar 6, 2015 at 1:16 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Well, the only work-around we found to actually work properly is to
> > override the problem causing tokenizer implementations on by one. Regarding
> > the WordDelimiterFilter, the quickest fix is enabling keepOriginal, if you
> > don't want the original to stick around, the filter implementation must be
> > modified to carry the original PayloadAttribute to its descendants.
> >
> > Markus
> >
> >
> > -----Original message-----
> > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > Sent: Friday 27th February 2015 17:28
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Delimited payloads input issue
> > >
> > > Hi - we attempt to use payloads to identify different parts of extracted
> > HTML pages and use the DelimitedPayloadTokenFilter to assign the correct
> > payload to the tokens. However, we are having issues for some language
> > analyzers and issues with some types of content for most regular analyzers.
> > >
> > > If we, for example, want to assign payloads to the text within an H1
> > field that contains non-alphanumerics such as `Hello, i am a heading!`, and
> > use |5 as delimiter and payload, we send the following to Solr, `Hello,|5
> > i|5 am|5 a|5 heading!|5`.
> > > This is not going to work because due to a WordDelimiterFilter, the
> > tokens Hello and heading obviously loose their payload. We also cannot put
> > the payload between the last alphanumeric and the following comma or
> > exlamation mark because then those characters would become part of the
> > payload if we use identity encoder, or it should fail if we use another
> > encoder. We could solve this using a custom encoder that only takes the
> > first character and ignores the rest, but this seems rather ugly.
> > >
> > > On the other hand, we have issues using language specific tokenizers
> > such as Kuromoji, i will immediately dump the delimited payload so it never
> > reaches the DelimitedPayloadTokenFilter. And if we try chinese and have the
> > StandardTokenizer enabled, we also loose the delimited payload.
> > >
> > > Any of you have dealt with this before? Hints to share?
> > >
> > > Many thanks,
> > > Markus
> > >
> >
> 

Reply via email to