Ludovic,

>> how do you index your html files ? I mean do you create fields for
different
parts of your document (for different stop words lists, stemming, etc) ?
with DIH or solrj or something else ?  <<

We are sending them over http, and using Tika to strip the HTML, at
present.

We do not split the document itself into separate fields, but what we
index includes a bunch of metadata that has been extracted by processes
earlier in the pipeline. These fields don't enter into the
HTML-hit-highlighting question.

>> I developed this week a new highlighter module which transfers the
fields
highlighting to the original document (xml in my case) (I use payloads to
store offsets and lenghts of fields in the index). This way, I use the
good
analyzers to do the highlighting correctly and then, I replace the
different
field parts in the document by the highlighted parts. It is not finished
yet, but I already have some good results. <<

Yes, I have been thinking along very similar lines. If you arrive at
something you're happy with, I encourage you to share it.

>> This is a client request too. Let me know if the iorixxx's solution is
not enought for your particular use case.<<

I'm enough of a Solr newb that I'll need to study his suggestion for a
bit, to figure out what it does and does not do. When I've done so, I'll
respond to his message.

Thanks,

-- Bryan

Reply via email to