Martin,
You may want to follow Mark Miller's effort
https://issues.apache.org/jira/browse/LUCENE-1286 as it develops --
perhaps even help with it. He's developing a Lucene highlighter which
would "run through query terms by using their offsets" making
highlighting large documents much more time efficient. I would be
interested to see something like this end up as a Solr highlighting option.
Revisiting some of your original thoughts:
What I see though is that the highlighting functionality is heavily tied
to the fragment (highlight context) functionality. This actually makes
it interesting to write a plane highlight method that just returns meta
data (so some other process can do the actual highlighting in some
custom fashion).
So is it worth while to make sure that solr is able to do multiple
different kinds of highlighting, even if it means passing meta data back
in the request? Should we have standard ways to index and read back
payload information if we're dealing with pages, books, co-ordinates
(for highlighting images) and other meta data which is used for
highlights (chat offset, term offset eccettera). I also noticed much of
the highlighting code to do with fragments being duplicated in custom
code.
My idea for highlighting based on
https://issues.apache.org/jira/browse/SOLR-380 was to include the
coordinates for highlighting images as just another attribute in the
input xml. Then the PayloadComponent will give the coordinates
associated with a given query as part of the xpath. I have written some
code beyond what is posted there that takes some extra parameters and
reconstructs the xpath into useful results based on the granularity of
the information that is requested (roughly based on xquery). Is that a
"standard" enough way or is there something else you're thinking about?
If you find anything thing I've contributed useful feel free to improve
it for the benefit of those that use Solr and Lucene.
Tricia