Hi All,
I was wondering how Solr people feel about the inclusion of Payload
functionality in the Solr codebase?
From a recent message to the [EMAIL PROTECTED] mailing list:
I'm working on the issue
https://issues.apache.org/jira/browse/SOLR-380 which is a feature
request that allows one to index a "Structured Document" which is
anything that can be represented by XML in order to provide more
context to hits in the result set. This allows us to do things like
query the index for "Canada" and be able to not only say that that
query matched a document titled "Some Nonsense" but also that the
query term appeared on page 7 of chapter 1. We can then take this one
step further and markup/highlight the image of this page based on our
OCR and position hit.
For example:
<book title='Some Nonsense'><chapter title='One'><page name='1'>Some
text from page one of a book.</page><page name='7'>Some more text from
page seven of a book. Oh and I'm from Canada.</page></chapter></book>
I accomplished this by creating a custom Tokenizer which strips the
xml elements and stores them as a Payload at each of the Tokens
created from the character data in the input. The payload is the
string that describes the XPath at that location. So for <Canada> the
payload is "/book[title='Some
Nonsense']/chapter[title='One']/page[name='7']"
The other part of this work is the SolrHighlighter which is less
important to this list. I retrieve the TermPositions for the Query's
Terms and use the TermPosition functionality to get back the payload
for the hits and build output which shows hit positions categorized by
the payload they are associated with.
Using Payloads requires me to include lucene-core-2.3-dev.jar which
might be a barrier. Also, using my Tokenizer with Solr specific
TokenFilter(s) looses the Payload at modified tokens. I probably
shouldn't generalize this but I suspect it is true. My only issue has
come from the WordDelimiterFilter so far.
In the following example I will denote a token by {pos,<term
text>,<payload>}:
input: <class name='mammalia'>Dog, and Cat</class>
XmlPayloadTokenizer:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</class[name='mammalia'][startPos='0']>},{3,<Cat>,</class[name='mammalia'][startPos='0']>}
StopFilter:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</class[name='mammalia'][startPos='0']>}
WordDelimiterFilter:
{1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
LowerCaseFilter:
{1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}
Should I create an JIRA issue about the Filters and post a patch?
Thanks,
Tricia