Hi - we attempt to use payloads to identify different parts of extracted HTML 
pages and use the DelimitedPayloadTokenFilter to assign the correct payload to 
the tokens. However, we are having issues for some language analyzers and 
issues with some types of content for most regular analyzers.

If we, for example, want to assign payloads to the text within an H1 field that 
contains non-alphanumerics such as `Hello, i am a heading!`, and use |5 as 
delimiter and payload, we send the following to Solr, `Hello,|5 i|5 am|5 a|5 
heading!|5`.
This is not going to work because due to a WordDelimiterFilter, the tokens 
Hello and heading obviously loose their payload. We also cannot put the payload 
between the last alphanumeric and the following comma or exlamation mark 
because then those characters would become part of the payload if we use 
identity encoder, or it should fail if we use another encoder. We could solve 
this using a custom encoder that only takes the first character and ignores the 
rest, but this seems rather ugly.

On the other hand, we have issues using language specific tokenizers such as 
Kuromoji, i will immediately dump the delimited payload so it never reaches the 
DelimitedPayloadTokenFilter. And if we try chinese and have the 
StandardTokenizer enabled, we also loose the delimited payload.

Any of you have dealt with this before? Hints to share?

Many thanks,
Markus

Reply via email to