> -----Original Message----- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Wednesday, June 08, 2011 11:56 PM > To: solr-user@lucene.apache.org > Subject: Re: Displaying highlights in formatted HTML document > > > > --- On Thu, 6/9/11, Bryan Loofbourrow <bloofbour...@knowledgemosaic.com> > wrote: > > > From: Bryan Loofbourrow <bloofbour...@knowledgemosaic.com> > > Subject: Displaying highlights in formatted HTML document > > To: solr-user@lucene.apache.org > > Date: Thursday, June 9, 2011, 2:14 AM > > Here is my use case: > > > > > > > > I have a large number of HTML documents, sizes in the > > 0.5K-50M range, most > > around, say, 10M. > > > > > > > > I want to be able to present the user with the formatted > > HTML document, with > > the hits tagged, so that he may iterate through them, and > > see them in the > > context of the document, with the document looking as it > > would be presented > > by a browser; that is, fully formatted, with its tables and > > italics and font > > sizes and all. > > > > > > > > This is something that the user would explicitly request > > from within a set > > of search results, not something I'd expect to have > > returned from an initial > > search - the initial search merely returns the snippets > > around the hits. But > > if the user wants to dive into one of the returned results > > and see them in > > context, I need to be able to go get that. > > > > > > > > We are currently solving this problem by using an entirely > > separate search > > engine (dtSearch), which performs the tagging of the hits > > in the HTML just > > fine. But the solution is unsatisfactory because there are > > Solr searches > > that dtSearch's capabilities cannot reasonably match. > > > > > > > > Can anyone suggest a good way to use Solr/Lucene for this > > instead? I'm > > thinking a separate core for this purpose might make sense, > > so as not to > > burden the primary search core with the full contents of > > the document. But > > after that, I'm stuck. How can I get Solr to express the > > highlighting in the > > context of the formatted HTML document? > > > > > > > > If Solr does not do this currently, and anyone can suggest > > ways to add the > > feature, any tips on how this might best be incorporated > > into the > > implementation would be welcome. > > I am doing the same thing (solr trunk) using the following field type: > > <fieldType name="HTMLText" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/> > <charFilter class="solr.HTMLStripCharFilterFactory" > mapping="mappings.txt"/><tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.TurkishLowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" > ignoreCase="true" expand="true"/> > </analyzer><analyzer type="query"> > <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.TurkishLowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true"/> > </analyzer> > > In your separate core - which will is queried when the user wants to dive > into one of the returned results - feed your html files in to this field. > > You may want to increase max analyzed chars too. > <int name="hl.maxAnalyzedChars">147483647</int>
OK, I think see what you're up to. Might be pretty viable for me as well. Can you talk about anything in your mappings.txt files that is an important part of the solution? Also, isn't there another piece? Don't you need to force it to return the whole document, rather than its usual context chunks? Or are you somehow able to map the returned chunks into the separately-stored documents? We have another requirement I forgot to mention, about wanting to associate a sequence number with each hit, but I imagine I can deal with that by putting some sort of identifiable char sequence in a custom prefix for the highlighting, then replacing that with a sequence number in postprocessing. I'm also wondering about the performance of this approach with large documents, vs. something like what Ludovic is talking about, where you would just get positions back from Solr, and fetch the document separately from a filestore. -- Bryan