Payloads in Solr

Tricia Williams Sat, 17 Nov 2007 11:19:23 -0800

Hi All,

I was wondering how Solr people feel about the inclusion of Payloadfunctionality in the Solr codebase?


   From a recent message to the [EMAIL PROTECTED] mailing list:

I'm working on the issuehttps://issues.apache.org/jira/browse/SOLR-380 which is a featurerequest that allows one to index a "Structured Document" which isanything that can be represented by XML in order to provide morecontext to hits in the result set. This allows us to do things likequery the index for "Canada" and be able to not only say that thatquery matched a document titled "Some Nonsense" but also that thequery term appeared on page 7 of chapter 1. We can then take this onestep further and markup/highlight the image of this page based on ourOCR and position hit.
For example:
<book title='Some Nonsense'><chapter title='One'><page name='1'>Sometext from page one of a book.</page><page name='7'>Some more text frompage seven of a book. Oh and I'm from Canada.</page></chapter></book>
I accomplished this by creating a custom Tokenizer which strips thexml elements and stores them as a Payload at each of the Tokenscreated from the character data in the input. The payload is thestring that describes the XPath at that location. So for <Canada> thepayload is "/book[title='SomeNonsense']/chapter[title='One']/page[name='7']"
The other part of this work is the SolrHighlighter which is lessimportant to this list. I retrieve the TermPositions for the Query'sTerms and use the TermPosition functionality to get back the payloadfor the hits and build output which shows hit positions categorized bythe payload they are associated with.

Using Payloads requires me to include lucene-core-2.3-dev.jar whichmight be a barrier. Also, using my Tokenizer with Solr specificTokenFilter(s) looses the Payload at modified tokens. I probablyshouldn't generalize this but I suspect it is true. My only issue hascome from the WordDelimiterFilter so far.

In the following example I will denote a token by {pos,<termtext>,<payload>}:
input: <class name='mammalia'>Dog, and Cat</class>

XmlPayloadTokenizer:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</class[name='mammalia'][startPos='0']>},{3,<Cat>,</class[name='mammalia'][startPos='0']>}
StopFilter:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</class[name='mammalia'][startPos='0']>}
WordDelimiterFilter:
{1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
LowerCaseFilter:
{1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}

   Should I create an JIRA issue about the Filters and post a patch?

Thanks,
Tricia

Payloads in Solr

Reply via email to