(I'm taking this discussion to solr-user, as Mike Klaas suggested; sorry for using JIRA for it. Previous discussion is at https://issues.apache.org/jira/browse/SOLR-380).
I think the requirements I mentioned in a comment (https://issues.apache.org/jira/browse/SOLR-380#action_12535296) justify abandoning the one-page-per-document approach. The increment-gap approach would break the cross-page searching, and would involve about as much work as the stored map, since the gap would have to vary depending on the number of terms on each page, wouldn't it? (if there are 100 terms on page one, the gap has to be 900 to get page two to start at 1000 - or can you specify the absolute position you want for a term?). I think the problem of indexing books (or any text with arbitrary subdivisions) is common enough that a generic approach like this would be useful to more people than just me, and justifies some enhancements within Solr to make the solution easy to reuse; but maybe when we've figured out the best approach it will become clear how much of it is worth packing into Solr. Assuming the two-field approach works (https://issues.apache.org/jira/browse/SOLR-380#action_12535755), then what we're really talking about is two things: a token filter to generate and store the map, and a process like the highlighter to generate the output. Suppose the map is stored in tokens with the starting term position for each page, like this: 0:1 345:2 827:3 The output function would then imitate the highlighter to discover term positions, use the map (either by loading all its terms or by doing lookups) to convert them to page positions, and generate the appropriate output. I'm not clear where that output process should live, but we can just imitate the highlighter. (and just to clarify roles: Tricia's the one who'll actually be coding this, if it's feasible; I'm just helping to think out requirements and approaches based on a project in hand.) Peter -----Original Message----- From: Mike Klaas (JIRA) [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 17, 2007 4:26 PM To: Binkley, Peter Subject: [jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document". [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.p lugin.system.issuetabpanels:comment-tabpanel#action_12535768 ] Mike Klaas commented on SOLR-380: --------------------------------- In my opinion the best solution is to create one solr document per page and denormalize the container data across each page. If I had to implement it the other way, I would probably index the pages as a multivalued field with a large position increment gap (say 1000), store term vectors, and use the position information from the term vectors to determine the page hits (e.g., pos 4668 -> page 5; pos 668 -> page 1; pos 9999 -> page 10). Assumes < 1000 tokens per page, of course. Incidentally, this discussion doesn't really belong here. It would be better to sketch out ideas on solr-user, then move to JIRA to track a resulting patch (if it gets that far). I actually don't think that there is anything to add to Solr here--it seems more of a question of customization. > There's no way to convert search results into page-level hits of a "structured document". > ---------------------------------------------------------------------- > ------------------- > > Key: SOLR-380 > URL: https://issues.apache.org/jira/browse/SOLR-380 > Project: Solr > Issue Type: New Feature > Components: search > Reporter: Tricia Williams > Priority: Minor > > "Paged-Text" FieldType for Solr > A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. > The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format. > At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: > <lst name="pages"> > <lst name="doc1"> > <int name="pageid">234</int> > <int name="pageid">236</int> > </lst> > <lst name="doc2"> > <int name="pageid">19</int> > </lst> > </lst> > <lst name="hitpos"> > <lst name="doc1"> > <lst name="234"> > <int name="pos">14325</int> > </lst> > </lst> > ... > </lst> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.