You could index both pages and chapters, with a type field. You could index by chapter with the page number as a payload for each token.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati <zacch...@gmail.com> wrote: > > Thank you, Jack for your answer. > There are 2 reasons: > 1. the requirement is to show in the result list both books and chapters > grouped, so I would have to execute the query grouping by book, retrieve > first, let's say, 10 books (sorted by relevance) and then for each book > repeat the query grouping by chapter (always ordering by relevance) in > order to obtain what we need (unfortunately it is not up to me defining the > requirements... but it however make sense). Unless there exist some SOLR > feature to do this in only one call (and that would be great!). > 2. searching on pages will not match phrases that spans across 2 pages > (e.g. if last word of page 1 is "broken" and first word of page 2 is > "sentence" searching for "broken sentence" will not match) > However if we will not find a better solution I think that your proposal is > not so bad... I hope that reason #2 could be negligible and that #1 > performs quite fast though we are multiplying queries. > > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky < > jack.krupan...@gmail.com> ha scritto: > >> Any reason not to use the simplest structure - each page is one Solr >> document with a book field, a chapter field, and a page text field? You can >> then use grouping to group results by book (title text) or even chapter >> (title text and/or number). Maybe initially group by book and then if the >> user selects a book group you can re-query with the specific book and then >> group by chapter. >> >> >> -- Jack Krupansky >> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zacch...@gmail.com> >> wrote: >> >>> Original data is quite well structured: it comes in XML with chapters and >>> tags to mark the original page breaks on the paper version. In this way >> we >>> have the possibility to restructure it almost as we want before creating >>> SOLR index. >>> >>> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky < >>> jack.krupan...@gmail.com> ha scritto: >>> >>>> To start, what is the form of your input data - is it already divided >>> into >>>> chapters and pages? Or... are you starting with raw PDF files? >>>> >>>> >>>> -- Jack Krupansky >>>> >>>> On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zacch...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> I'm searching for ideas on how to define schema and how to perform >>>> queries >>>>> in this use case: we have to index books, each book is split into >>>> chapters >>>>> and chapters are split into pages (pages represent original page >>> cutting >>>> in >>>>> printed version). We should show the result grouped by books and >>> chapters >>>>> (for the same book) and pages (for the same chapter). As far as I >> know, >>>> we >>>>> have 2 options: >>>>> >>>>> 1. index pages as SOLR documents. In this way we could theoretically >>>>> retrieve chapters (and books?) using grouping but >>>>> a. we will miss matches across two contiguous pages (page cutting >>> is >>>>> only due to typographical needs so concepts could be split... as in >>>> printed >>>>> books) >>>>> b. I don't know if it is possible in SOLR to group results on two >>>>> different levels (books and chapters) >>>>> >>>>> 2. index chapters as SOLR documents. In this case we will have the >>> right >>>>> matches but how to obtain the matching pages? (we need pages because >>> the >>>>> client can only display pages) >>>>> >>>>> we have been struggling on this problem for a lot of time and we're >>> not >>>>> able to find a suitable solution so I'm looking if someone has ideas >> or >>>> has >>>>> already solved a similar issue. >>>>> Thanks >>>>> >>>> >>> >>