Thanks for your comments, Yonik!
All for it... depending on what one means by "payload functionality" of course.
We should probably hold off on adding a new lucene version to Solr
until the Payload API has stabilized (it will most likely be changing
very soon).

It sounds like Lucene 2.3 is going to be released soonish (http://www.nabble.com/How%27s-2.3-doing--tf4802426.html#a13740605). As best I can tell it will include the Payload stuff marked experimental. The new Lucene version will have many improvements besides Payloads which would benefit Solr (examples galore in CHANGES.txt http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=log). So I find it hard to believe that the new release will not be included. I recognize that the experimental status would be worrisome. What will it take to get Payloads to the place that they would be excepted for use in the Solr community? You probably know more about the projected changes to the API than I. Care to fill me in or suggest who I should ask? On the [EMAIL PROTECTED] list Grant Ingersoll suggested that the Payload object would be done away with and the API would just deal with byte arrays directly.
That's a lot of data to associate with every token... I wonder how
others have accomplished this?
One could compress it with a dictionary somewhere.
I wonder if one could index special begin_tag and end_tag tokens, and
somehow use span queries?

I agree that is a lot of data to associate with every token - especially since the data is repetitive in nature. Erik Hatcher suggested I store a representation of the structure of the document in a separate field, store a numeric representation of the mapping of the token to the structure as the payload for each token, and do a lookup at query time based on the numeric mapping in the payload at the position hit to get the structure/context back for the token.

I'm also wondering how others have accomplished this. Grant Ingersoll noted that one of the original use cases was XPath queries so I'm particularly interested in finding out if anyone has implemented that, and how.
Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

I suppose that it is only fair to take this on a case by case basis. Maybe we will have to write new TokenFilters for each Tokenzier that uses Payloads (but I sure hope not!). Maybe we can build some optional configuration options into the TokenFilter constructor that guide their behavior with regard to Payloads. Maybe there is something stored in the TokenStream that dictates how the Payloads are handled by the TokenFilters. Maybe there is no case where identical payloads would not be created for new tokens and we can just change the TokenFilter to deal with payloads directly in a uniform way.

Tricia

Reply via email to