Well, the only work-around we found to actually work properly is to override 
the problem causing tokenizer implementations on by one. Regarding the 
WordDelimiterFilter, the quickest fix is enabling keepOriginal, if you don't 
want the original to stick around, the filter implementation must be modified 
to carry the original PayloadAttribute to its descendants.

Markus
 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Friday 27th February 2015 17:28
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Delimited payloads input issue
> 
> Hi - we attempt to use payloads to identify different parts of extracted HTML 
> pages and use the DelimitedPayloadTokenFilter to assign the correct payload 
> to the tokens. However, we are having issues for some language analyzers and 
> issues with some types of content for most regular analyzers.
> 
> If we, for example, want to assign payloads to the text within an H1 field 
> that contains non-alphanumerics such as `Hello, i am a heading!`, and use |5 
> as delimiter and payload, we send the following to Solr, `Hello,|5 i|5 am|5 
> a|5 heading!|5`.
> This is not going to work because due to a WordDelimiterFilter, the tokens 
> Hello and heading obviously loose their payload. We also cannot put the 
> payload between the last alphanumeric and the following comma or exlamation 
> mark because then those characters would become part of the payload if we use 
> identity encoder, or it should fail if we use another encoder. We could solve 
> this using a custom encoder that only takes the first character and ignores 
> the rest, but this seems rather ugly.
> 
> On the other hand, we have issues using language specific tokenizers such as 
> Kuromoji, i will immediately dump the delimited payload so it never reaches 
> the DelimitedPayloadTokenFilter. And if we try chinese and have the 
> StandardTokenizer enabled, we also loose the delimited payload.
> 
> Any of you have dealt with this before? Hints to share?
> 
> Many thanks,
> Markus
> 

Reply via email to