I echo the apology for using JIRA to work out ideas on this.
Just thinking out loud here:
* Is there any reason why the page id should be an integer? I mean
could the page identifier be an alphanumeric string?
* Ideally our project would like to store some page level meta-data
(especially a URL link to page content). Would this be contorting
the use of a field too much? If we stored the URL in a dynamic
field URL_*, how would we retrieve this at query time?
* Is there a way to alter FieldType to use the Composite design
pattern? (http://en.wikipedia.org/wiki/Composite_pattern) In
this way a document could be composed of fields, which could be
composed of fields. For example: The monograph is a document, a
page in the monograph is a field in the document, the text on the
page is a field in the field, a single piece of metadata for the
page is a field in the field, etc. ( monograph
( page ( fulltext, page_metadata_1, page_metadata_2, etc ),
monograph_metadata_1, monograph_metadata_2, etc ) ). Maybe what
I'm trying to describe is that Documents can contain Documents?
Following the path of least resistance, I think the first step is to
create a highlighter which returns positions instead of highlighted
text. The next step would be to create an Analyzer and/or Filter and/or
Tokenizer, as well as a FieldType which creates the page mappings. The
last step (and the one I am least certain of how it could work) is to
evolve the position highlighter to get the position to page mapping and
group the positions by page (number or id) or alternately just write out
the page (number or id) and drop the position.
Tricia
Binkley, Peter wrote:
(I'm taking this discussion to solr-user, as Mike Klaas suggested; sorry
for using JIRA for it. Previous discussion is at
https://issues.apache.org/jira/browse/SOLR-380).
I think the requirements I mentioned in a comment
(https://issues.apache.org/jira/browse/SOLR-380#action_12535296) justify
abandoning the one-page-per-document approach. The increment-gap
approach would break the cross-page searching, and would involve about
as much work as the stored map, since the gap would have to vary
depending on the number of terms on each page, wouldn't it? (if there
are 100 terms on page one, the gap has to be 900 to get page two to
start at 1000 - or can you specify the absolute position you want for a
term?).
I think the problem of indexing books (or any text with arbitrary
subdivisions) is common enough that a generic approach like this would
be useful to more people than just me, and justifies some enhancements
within Solr to make the solution easy to reuse; but maybe when we've
figured out the best approach it will become clear how much of it is
worth packing into Solr.
Assuming the two-field approach works
(https://issues.apache.org/jira/browse/SOLR-380#action_12535755), then
what we're really talking about is two things: a token filter to
generate and store the map, and a process like the highlighter to
generate the output. Suppose the map is stored in tokens with the
starting term position for each page, like this:
0:1
345:2
827:3
The output function would then imitate the highlighter to discover term
positions, use the map (either by loading all its terms or by doing
lookups) to convert them to page positions, and generate the appropriate
output. I'm not clear where that output process should live, but we can
just imitate the highlighter.
(and just to clarify roles: Tricia's the one who'll actually be coding
this, if it's feasible; I'm just helping to think out requirements and
approaches based on a project in hand.)
Peter