Re: Tokenizers and DelimitedPayloadTokenFilterFactory

Jamie Johnson Tue, 25 Aug 2015 14:44:22 -0700

We were originally using this approach, i.e. run things through the
KeywordTokenizer -> DelimitedPayloadFilter -> WordDelimiterFilter.  Again
this works fine for text, though I had wanted to use the StandardTokenizer
in the chain.  Is there an equivalent filter that does what the
StandardTokenizer does?


All of this said this doesn't address the issue of the primitive field
types, which at this point is the bigger issue.  Given this use case should
there be another way to provide payloads?

My current thinking is that I will need to provide custom implementations
for all of the field types I would like to support payloads on which will
essentially be copies of the standard versions with some extra "sugar" to
read/write the payloads (I don't see a way to wrap/delegate these at this
point because AttributeSource has the attribute retrieval related methods
as final so I can't simply wrap another tokenizer and return my added
attributes + the wrapped attributes).  I know my use case is a bit strange,
but I had not expected to need to do this given that Lucene/Solr supports
payloads on these field types, they just aren't exposed.

As always I appreciate any ideas if I'm barking up the wrong tree here.

On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Well, if i remember correctly (i have no testing facility at hand)
> WordDelimiterFilter maintains payloads on emitted sub terms. So if you use
> a KeywordTokenizer, input 'some text^PAYLOAD', and have a
> DelimitedPayloadFilter, the entire string gets a payload. You can then
> split that string up again in individual tokens. It is possible to abuse
> WordDelimiterFilter for it because it has a types parameter that you can
> use to split it on whitespace if its input is not trimmed. Otherwise you
> can use any other character instead of a space as your input.
>
> This is a crazy idea, but it might work.
>
> -----Original message-----
> > From:Jamie Johnson <jej2...@gmail.com>
> > Sent: Tuesday 25th August 2015 19:37
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
> >
> > To be clear, we are using payloads as a way to attach authorizations to
> > individual tokens within Solr.  The payloads are normal Solr Payloads
> > though we are not using floats, we are using the identity payload encoder
> > (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
> > storing a byte[] of our choosing into the payload field.
> >
> > This works great for text, but now that I'm indexing more than just text
> I
> > need a way to specify the payload on the other field types.  Does that
> make
> > more sense?
> >
> > On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> > > This really sounds like an XY problem. Or when you use
> > > "payload" it's not the Solr payload.
> > >
> > > So Solr Payloads are a float value that you can attach to
> > > individual terms to influence the scoring. Attaching the
> > > _same_ payload to all terms in a field is much the same
> > > thing as boosting on any matches in the field at query time
> > > or boosting on the field at index time (this latter assuming
> > > that different docs would have different boosts).
> > >
> > > So can you back up a bit and tell us what you're trying to
> > > accomplish maybe we can be sure we're both talking about
> > > the same thing ;)
> > >
> > > Best,
> > > Erick
> > >
> > > On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> > > > I would like to specify a particular payload for all tokens emitted
> from
> > > a
> > > > tokenizer, but don't see a clear way to do this.  Ideally I could
> specify
> > > > that something like the DelimitedPayloadTokenFilter be run on the
> entire
> > > > field and then standard analysis be done on the rest of the field,
> so in
> > > > the case that I had the following text
> > > >
> > > > this is a test\Foo
> > > >
> > > > I would like to create tokens "this", "is", "a", "test" each with a
> > > payload
> > > > of Foo.  From what I'm seeing though only test get's the payload.  Is
> > > there
> > > > anyway to accomplish this or will I need to implement a custom
> tokenizer?
> > >
> >
>

Re: Tokenizers and DelimitedPayloadTokenFilterFactory

Reply via email to