Ludovic, >> how do you index your html files ? I mean do you create fields for different parts of your document (for different stop words lists, stemming, etc) ? with DIH or solrj or something else ? <<
We are sending them over http, and using Tika to strip the HTML, at present. We do not split the document itself into separate fields, but what we index includes a bunch of metadata that has been extracted by processes earlier in the pipeline. These fields don't enter into the HTML-hit-highlighting question. >> I developed this week a new highlighter module which transfers the fields highlighting to the original document (xml in my case) (I use payloads to store offsets and lenghts of fields in the index). This way, I use the good analyzers to do the highlighting correctly and then, I replace the different field parts in the document by the highlighted parts. It is not finished yet, but I already have some good results. << Yes, I have been thinking along very similar lines. If you arrive at something you're happy with, I encourage you to share it. >> This is a client request too. Let me know if the iorixxx's solution is not enought for your particular use case.<< I'm enough of a Solr newb that I'll need to study his suggestion for a bit, to figure out what it does and does not do. When I've done so, I'll respond to his message. Thanks, -- Bryan