Thanks Alexandre, your solution seems very good: I'll surely try it and let you know. I like the Idea of mixing blockjoins and grouping!
Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch < arafa...@gmail.com> ha scritto: > Here is an - untested - possible approach. I might be missing > something by combining these things in too many layers, but..... > > 1) Have chapter as parent documents and pages as children within that. > Block index them together. > 2) On pages, include page text (probably not stored) as one field. > Also include a second field that has last paragraph of that page as > well as first paragraph of the next page. This gives you phrase > matches across boundaries. Also include pageId, etc. > 3) On chapters, include book id as a string field. > 4) Use block join query to search against pages, but return (parent) > chapters > https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers > 5) Use grouping or collapsing+expanding by book id to group chapters > within a book: > https://cwiki.apache.org/confluence/display/solr/Result+Grouping > or > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results > 6) Use [child] DocumentTransformer to get pages back with childFilter > to re-limit them by your query: > > https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory > > The main question is whether 6) will be able to piggyback on the > output of 5)...... And, of course, the performance... > > I would love to know if this works, even partially. Either on the > mailing list or directly. > > Regards, > Alex. > > ---- > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > On 2 March 2016 at 00:50, Zaccheo Bagnati <zacch...@gmail.com> wrote: > > Thank you, Jack for your answer. > > There are 2 reasons: > > 1. the requirement is to show in the result list both books and chapters > > grouped, so I would have to execute the query grouping by book, retrieve > > first, let's say, 10 books (sorted by relevance) and then for each book > > repeat the query grouping by chapter (always ordering by relevance) in > > order to obtain what we need (unfortunately it is not up to me defining > the > > requirements... but it however make sense). Unless there exist some SOLR > > feature to do this in only one call (and that would be great!). > > 2. searching on pages will not match phrases that spans across 2 pages > > (e.g. if last word of page 1 is "broken" and first word of page 2 is > > "sentence" searching for "broken sentence" will not match) > > However if we will not find a better solution I think that your proposal > is > > not so bad... I hope that reason #2 could be negligible and that #1 > > performs quite fast though we are multiplying queries. > > > > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky < > > jack.krupan...@gmail.com> ha scritto: > > > >> Any reason not to use the simplest structure - each page is one Solr > >> document with a book field, a chapter field, and a page text field? You > can > >> then use grouping to group results by book (title text) or even chapter > >> (title text and/or number). Maybe initially group by book and then if > the > >> user selects a book group you can re-query with the specific book and > then > >> group by chapter. > >> > >> > >> -- Jack Krupansky > >> > >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zacch...@gmail.com> > >> wrote: > >> > >> > Original data is quite well structured: it comes in XML with chapters > and > >> > tags to mark the original page breaks on the paper version. In this > way > >> we > >> > have the possibility to restructure it almost as we want before > creating > >> > SOLR index. > >> > > >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky < > >> > jack.krupan...@gmail.com> ha scritto: > >> > > >> > > To start, what is the form of your input data - is it already > divided > >> > into > >> > > chapters and pages? Or... are you starting with raw PDF files? > >> > > > >> > > > >> > > -- Jack Krupansky > >> > > > >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <zacch...@gmail.com > > > >> > > wrote: > >> > > > >> > > > Hi all, > >> > > > I'm searching for ideas on how to define schema and how to perform > >> > > queries > >> > > > in this use case: we have to index books, each book is split into > >> > > chapters > >> > > > and chapters are split into pages (pages represent original page > >> > cutting > >> > > in > >> > > > printed version). We should show the result grouped by books and > >> > chapters > >> > > > (for the same book) and pages (for the same chapter). As far as I > >> know, > >> > > we > >> > > > have 2 options: > >> > > > > >> > > > 1. index pages as SOLR documents. In this way we could > theoretically > >> > > > retrieve chapters (and books?) using grouping but > >> > > > a. we will miss matches across two contiguous pages (page > cutting > >> > is > >> > > > only due to typographical needs so concepts could be split... as > in > >> > > printed > >> > > > books) > >> > > > b. I don't know if it is possible in SOLR to group results on > two > >> > > > different levels (books and chapters) > >> > > > > >> > > > 2. index chapters as SOLR documents. In this case we will have the > >> > right > >> > > > matches but how to obtain the matching pages? (we need pages > because > >> > the > >> > > > client can only display pages) > >> > > > > >> > > > we have been struggling on this problem for a lot of time and > we're > >> > not > >> > > > able to find a suitable solution so I'm looking if someone has > ideas > >> or > >> > > has > >> > > > already solved a similar issue. > >> > > > Thanks > >> > > > > >> > > > >> > > >> >