Thanks Jack,
the chapter is definitely the optimal unit to search into and your solution
seems a quite good approach. The counterpart is that, depending on how
we'll choose the amount of text shared on two adjacent pages we will
experience some errors. For example will be always possible finding a
matching chapter but not finding any matching page (because searched terms
are too much far away). Let's see if this could be tolerable.

Il giorno mar 1 mar 2016 alle ore 17:44 Jack Krupansky <
jack.krupan...@gmail.com> ha scritto:

> The chapter seems like the optimal unit for initial searches - just combine
> the page text with a line break between them or index as a multivalued
> field and set the position increment gap to be 1 so that phrases work.
>
> You could have a separate collection for pages, with each page as a Solr
> document, but include the last line of text from the previous page and the
> first line of text from the next page so that phrases will match across
> page boundaries. Unfortunately, that may also result in false hits if the
> full phrase is found on the two adopted lines. That would require some
> special filtering to eliminate those false positives.
>
> There is also the question of maximum phrase size - most phrases tend to be
> reasonably short, but sometimes people may want to search for an entire
> paragraph (e.g., a quote) that may span multiple lines on two adjacent
> pages.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 11:30 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
> > Hi,
> > From the top of my head - probably does not solve problem completely, but
> > may trigger brainstorming: Index chapters and include page break tokens.
> > Use highlighting to return matches and make sure fragment size is large
> > enough to get page break token. In such scenario you should use slop for
> > phrase searches...
> >
> > More I write it, less I like it, but will not delete...
> >
> > Regards,
> > Emir
> >
> >
> > On 01.03.2016 12:56, Zaccheo Bagnati wrote:
> >
> >> Hi all,
> >> I'm searching for ideas on how to define schema and how to perform
> queries
> >> in this use case: we have to index books, each book is split into
> chapters
> >> and chapters are split into pages (pages represent original page cutting
> >> in
> >> printed version). We should show the result grouped by books and
> chapters
> >> (for the same book) and pages (for the same chapter). As far as I know,
> we
> >> have 2 options:
> >>
> >> 1. index pages as SOLR documents. In this way we could theoretically
> >> retrieve chapters (and books?)  using grouping but
> >>      a. we will miss matches across two contiguous pages (page cutting
> is
> >> only due to typographical needs so concepts could be split... as in
> >> printed
> >> books)
> >>      b. I don't know if it is possible in SOLR to group results on two
> >> different levels (books and chapters)
> >>
> >> 2. index chapters as SOLR documents. In this case we will have the right
> >> matches but how to obtain the matching pages? (we need pages because the
> >> client can only display pages)
> >>
> >> we have been struggling on this problem for a lot of time and we're  not
> >> able to find a suitable solution so I'm looking if someone has ideas or
> >> has
> >> already solved a similar issue.
> >> Thanks
> >>
> >>
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
>

Reply via email to