Thanks for your comments, Yonik!
All for it... depending on what one means by "payload functionality" of course.
We should probably hold off on adding a new lucene version to Solr
until the Payload API has stabilized (it will most likely be changing
very soon).
It sounds like Lucene 2.3 is going to be released soonish
(http://www.nabble.com/How%27s-2.3-doing--tf4802426.html#a13740605). As
best I can tell it will include the Payload stuff marked experimental.
The new Lucene version will have many improvements besides Payloads
which would benefit Solr (examples galore in CHANGES.txt
http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=log).
So I find it hard to believe that the new release will not be included.
I recognize that the experimental status would be worrisome. What will
it take to get Payloads to the place that they would be excepted for use
in the Solr community? You probably know more about the projected
changes to the API than I. Care to fill me in or suggest who I should
ask? On the [EMAIL PROTECTED] list Grant Ingersoll
suggested that the Payload object would be done away with and the API
would just deal with byte arrays directly.
That's a lot of data to associate with every token... I wonder how
others have accomplished this?
One could compress it with a dictionary somewhere.
I wonder if one could index special begin_tag and end_tag tokens, and
somehow use span queries?
I agree that is a lot of data to associate with every token - especially
since the data is repetitive in nature. Erik Hatcher suggested I store
a representation of the structure of the document in a separate field,
store a numeric representation of the mapping of the token to the
structure as the payload for each token, and do a lookup at query time
based on the numeric mapping in the payload at the position hit to get
the structure/context back for the token.
I'm also wondering how others have accomplished this. Grant Ingersoll
noted that one of the original use cases was XPath queries so I'm
particularly interested in finding out if anyone has implemented that,
and how.
Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens. It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.
I suppose that it is only fair to take this on a case by case basis.
Maybe we will have to write new TokenFilters for each Tokenzier that
uses Payloads (but I sure hope not!). Maybe we can build some optional
configuration options into the TokenFilter constructor that guide their
behavior with regard to Payloads. Maybe there is something stored in
the TokenStream that dictates how the Payloads are handled by the
TokenFilters. Maybe there is no case where identical payloads would not
be created for new tokens and we can just change the TokenFilter to deal
with payloads directly in a uniform way.
Tricia