--- On Thu, 6/9/11, Bryan Loofbourrow <bloofbour...@knowledgemosaic.com> wrote:
> From: Bryan Loofbourrow <bloofbour...@knowledgemosaic.com> > Subject: Displaying highlights in formatted HTML document > To: solr-user@lucene.apache.org > Date: Thursday, June 9, 2011, 2:14 AM > Here is my use case: > > > > I have a large number of HTML documents, sizes in the > 0.5K-50M range, most > around, say, 10M. > > > > I want to be able to present the user with the formatted > HTML document, with > the hits tagged, so that he may iterate through them, and > see them in the > context of the document, with the document looking as it > would be presented > by a browser; that is, fully formatted, with its tables and > italics and font > sizes and all. > > > > This is something that the user would explicitly request > from within a set > of search results, not something I’d expect to have > returned from an initial > search – the initial search merely returns the snippets > around the hits. But > if the user wants to dive into one of the returned results > and see them in > context, I need to be able to go get that. > > > > We are currently solving this problem by using an entirely > separate search > engine (dtSearch), which performs the tagging of the hits > in the HTML just > fine. But the solution is unsatisfactory because there are > Solr searches > that dtSearch’s capabilities cannot reasonably match. > > > > Can anyone suggest a good way to use Solr/Lucene for this > instead? I’m > thinking a separate core for this purpose might make sense, > so as not to > burden the primary search core with the full contents of > the document. But > after that, I’m stuck. How can I get Solr to express the > highlighting in the > context of the formatted HTML document? > > > > If Solr does not do this currently, and anyone can suggest > ways to add the > feature, any tips on how this might best be incorporated > into the > implementation would be welcome. I am doing the same thing (solr trunk) using the following field type: <fieldType name="HTMLText" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/> <charFilter class="solr.HTMLStripCharFilterFactory" mapping="mappings.txt"/><tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.TurkishLowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" expand="true"/> </analyzer><analyzer type="query"> <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.TurkishLowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> </analyzer> In your separate core - which will is queried when the user wants to dive into one of the returned results - feed your html files in to this field. You may want to increase max analyzed chars too. <int name="hl.maxAnalyzedChars">147483647</int>