(I'm taking this discussion to solr-user, as Mike Klaas suggested; sorry
for using JIRA for it. Previous discussion is at
https://issues.apache.org/jira/browse/SOLR-380).

I think the requirements I mentioned in a comment
(https://issues.apache.org/jira/browse/SOLR-380#action_12535296) justify
abandoning the one-page-per-document approach. The increment-gap
approach would break the cross-page searching, and would involve about
as much work as the stored map, since the gap would have to vary
depending on the number of terms on each page, wouldn't it? (if there
are 100 terms on page one, the gap has to be 900 to get page two to
start at 1000 - or can you specify the absolute position you want for a
term?). 

I think the problem of indexing books (or any text with arbitrary
subdivisions) is common enough that a generic approach like this would
be useful to more people than just me, and justifies some enhancements
within Solr to make the solution easy to reuse; but maybe when we've
figured out the best approach it will become clear how much of it is
worth packing into Solr.

Assuming the two-field approach works
(https://issues.apache.org/jira/browse/SOLR-380#action_12535755), then
what we're really talking about is two things: a token filter to
generate and store the map, and a process like the highlighter to
generate the output. Suppose the map is stored in tokens with the
starting term position for each page, like this:

0:1
345:2
827:3

The output function would then imitate the highlighter to discover term
positions, use the map (either by loading all its terms or by doing
lookups) to convert them to page positions, and generate the appropriate
output. I'm not clear where that output process should live, but we can
just imitate the highlighter.

(and just to clarify roles: Tricia's the one who'll actually be coding
this, if it's feasible; I'm just helping to think out requirements and
approaches based on a project in hand.)

Peter


-----Original Message-----
From: Mike Klaas (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 17, 2007 4:26 PM
To: Binkley, Peter
Subject: [jira] Commented: (SOLR-380) There's no way to convert search
results into page-level hits of a "structured document".


    [
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.p
lugin.system.issuetabpanels:comment-tabpanel#action_12535768 ] 

Mike Klaas commented on SOLR-380:
---------------------------------

In my opinion the best solution is to create one solr document per page
and denormalize the container data across each page.

If I had to implement it the other way, I would probably index the pages
as a multivalued field with a large position increment gap (say 1000),
store term vectors, and use the position information from the term
vectors to determine the page hits (e.g., pos 4668 -> page 5; pos 668 ->
page 1; pos 9999 -> page 10).  Assumes < 1000 tokens per page, of
course.

Incidentally, this discussion doesn't really belong here.  It would be
better to sketch out ideas on solr-user, then move to JIRA to track a
resulting patch (if it gets that far).  I actually don't think that
there is anything to add to Solr here--it seems more of a question of
customization.



> There's no way to convert search results into page-level hits of a
"structured document".
> ----------------------------------------------------------------------
> -------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a
monograph in Solr, there's no way to convert search results into
page-level hits. The solution: have a "paged-text" fieldtype which keeps
track of page divisions as it indexes, and reports page-level hits in
the search results.
> The input would contain page milestones: <page id="234"/>. As Solr
processed the tokens (using its standard tokenizers and filters), it
would concurrently build a structural map of the item, indicating which
term position marked the beginning of which page: <page id="234"
firstterm="14324"/>. This map would be stored in an unindexed field in
some efficient format.
> At search time, Solr would retrieve term positions for all hits that
are returned in the current request, and use the stored map to determine
page ids for each term position. The results would imitate the results
for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int
name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to