RE: Tokenizers and DelimitedPayloadTokenFilterFactory

Markus Jelsma Tue, 25 Aug 2015 11:53:03 -0700

Well, if i remember correctly (i have no testing facility at hand) 
WordDelimiterFilter maintains payloads on emitted sub terms. So if you use a 
KeywordTokenizer, input 'some text^PAYLOAD', and have a DelimitedPayloadFilter, 
the entire string gets a payload. You can then split that string up again in 
individual tokens. It is possible to abuse WordDelimiterFilter for it because 
it has a types parameter that you can use to split it on whitespace if its 
input is not trimmed. Otherwise you can use any other character instead of a 
space as your input.


This is a crazy idea, but it might work. 
 
-----Original message-----
> From:Jamie Johnson <jej2...@gmail.com>
> Sent: Tuesday 25th August 2015 19:37
> To: solr-user@lucene.apache.org
> Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
> 
> To be clear, we are using payloads as a way to attach authorizations to
> individual tokens within Solr.  The payloads are normal Solr Payloads
> though we are not using floats, we are using the identity payload encoder
> (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
> storing a byte[] of our choosing into the payload field.
> 
> This works great for text, but now that I'm indexing more than just text I
> need a way to specify the payload on the other field types.  Does that make
> more sense?
> 
> On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
> > This really sounds like an XY problem. Or when you use
> > "payload" it's not the Solr payload.
> >
> > So Solr Payloads are a float value that you can attach to
> > individual terms to influence the scoring. Attaching the
> > _same_ payload to all terms in a field is much the same
> > thing as boosting on any matches in the field at query time
> > or boosting on the field at index time (this latter assuming
> > that different docs would have different boosts).
> >
> > So can you back up a bit and tell us what you're trying to
> > accomplish maybe we can be sure we're both talking about
> > the same thing ;)
> >
> > Best,
> > Erick
> >
> > On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson <jej2...@gmail.com> wrote:
> > > I would like to specify a particular payload for all tokens emitted from
> > a
> > > tokenizer, but don't see a clear way to do this.  Ideally I could specify
> > > that something like the DelimitedPayloadTokenFilter be run on the entire
> > > field and then standard analysis be done on the rest of the field, so in
> > > the case that I had the following text
> > >
> > > this is a test\Foo
> > >
> > > I would like to create tokens "this", "is", "a", "test" each with a
> > payload
> > > of Foo.  From what I'm seeing though only test get's the payload.  Is
> > there
> > > anyway to accomplish this or will I need to implement a custom tokenizer?
> >
>

RE: Tokenizers and DelimitedPayloadTokenFilterFactory

Reply via email to