Thanks David for sharing! The custom attribute approach sounds interesting indeed. Markus
-----Original message----- > From:david.w.smi...@gmail.com <david.w.smi...@gmail.com> > Sent: Tuesday 10th March 2015 16:53 > To: solr-user@lucene.apache.org > Subject: Re: Delimited payloads input issue > > Hi Markus, > > I’ve found this problem too. I’ve worked around it: > * write a custom attribute that has the data you want to carry-forward, one > with a no-op clear(). The no-op clear defeats WDF and other ill-behaved > filters (e.g. common-grams). > * Use a custom tokenizer that populates the attribute, and which can truly > clear the custom attribute > * Write a custom filter at the end of the chain that actually encodes the > attribute data into the payload. > > This scheme has worked for me for sentence/paragraph IDs, which effectively > hold constant throughout the sentence. It may be more complicated when the > data varies word-by-word since some Filters won’t work well. > > I suppose the real solution is better Filters that use captureState and/or > an improved tokenStream design to clone attributes. It’s better to clone a > state when introducing a new token than to clear it! > > ~ David Smiley > Freelance Apache Lucene/Solr Search Consultant/Developer > http://www.linkedin.com/in/davidwsmiley > > On Fri, Mar 6, 2015 at 1:16 PM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Well, the only work-around we found to actually work properly is to > > override the problem causing tokenizer implementations on by one. Regarding > > the WordDelimiterFilter, the quickest fix is enabling keepOriginal, if you > > don't want the original to stick around, the filter implementation must be > > modified to carry the original PayloadAttribute to its descendants. > > > > Markus > > > > > > -----Original message----- > > > From:Markus Jelsma <markus.jel...@openindex.io> > > > Sent: Friday 27th February 2015 17:28 > > > To: solr-user <solr-user@lucene.apache.org> > > > Subject: Delimited payloads input issue > > > > > > Hi - we attempt to use payloads to identify different parts of extracted > > HTML pages and use the DelimitedPayloadTokenFilter to assign the correct > > payload to the tokens. However, we are having issues for some language > > analyzers and issues with some types of content for most regular analyzers. > > > > > > If we, for example, want to assign payloads to the text within an H1 > > field that contains non-alphanumerics such as `Hello, i am a heading!`, and > > use |5 as delimiter and payload, we send the following to Solr, `Hello,|5 > > i|5 am|5 a|5 heading!|5`. > > > This is not going to work because due to a WordDelimiterFilter, the > > tokens Hello and heading obviously loose their payload. We also cannot put > > the payload between the last alphanumeric and the following comma or > > exlamation mark because then those characters would become part of the > > payload if we use identity encoder, or it should fail if we use another > > encoder. We could solve this using a custom encoder that only takes the > > first character and ignores the rest, but this seems rather ugly. > > > > > > On the other hand, we have issues using language specific tokenizers > > such as Kuromoji, i will immediately dump the delimited payload so it never > > reaches the DelimitedPayloadTokenFilter. And if we try chinese and have the > > StandardTokenizer enabled, we also loose the delimited payload. > > > > > > Any of you have dealt with this before? Hints to share? > > > > > > Many thanks, > > > Markus > > > > > >