Re: Payloads in Solr

Tricia Williams Sun, 18 Nov 2007 11:26:13 -0800

Thanks for your comments, Yonik!

All for it... depending on what one means by "payload functionality" of course.
We should probably hold off on adding a new lucene version to Solr
until the Payload API has stabilized (it will most likely be changing
very soon).

It sounds like Lucene 2.3 is going to be released soonish(http://www.nabble.com/How%27s-2.3-doing--tf4802426.html#a13740605). Asbest I can tell it will include the Payload stuff marked experimental.The new Lucene version will have many improvements besides Payloadswhich would benefit Solr (examples galore in CHANGES.txthttp://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=log).So I find it hard to believe that the new release will not be included.I recognize that the experimental status would be worrisome. What willit take to get Payloads to the place that they would be excepted for usein the Solr community? You probably know more about the projectedchanges to the API than I. Care to fill me in or suggest who I shouldask? On the [EMAIL PROTECTED] list Grant Ingersollsuggested that the Payload object would be done away with and the APIwould just deal with byte arrays directly.

That's a lot of data to associate with every token... I wonder how
others have accomplished this?
One could compress it with a dictionary somewhere.
I wonder if one could index special begin_tag and end_tag tokens, and
somehow use span queries?

I agree that is a lot of data to associate with every token - especiallysince the data is repetitive in nature. Erik Hatcher suggested I storea representation of the structure of the document in a separate field,store a numeric representation of the mapping of the token to thestructure as the payload for each token, and do a lookup at query timebased on the numeric mapping in the payload at the position hit to getthe structure/context back for the token.

I'm also wondering how others have accomplished this. Grant Ingersollnoted that one of the original use cases was XPath queries so I'mparticularly interested in finding out if anyone has implemented that,and how.

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

I suppose that it is only fair to take this on a case by case basis.Maybe we will have to write new TokenFilters for each Tokenzier thatuses Payloads (but I sure hope not!). Maybe we can build some optionalconfiguration options into the TokenFilter constructor that guide theirbehavior with regard to Payloads. Maybe there is something stored inthe TokenStream that dictates how the Payloads are handled by theTokenFilters. Maybe there is no case where identical payloads would notbe created for new tokens and we can just change the TokenFilter to dealwith payloads directly in a uniform way.


Tricia

Re: Payloads in Solr

Reply via email to