I'm not sure if I have a good suggestion, but I have a question. :) What is considered "junk"? Would it be possible to eliminate the junk before it even goes into the index in order to avoid GIGO (Garbage In Garbage Out)?
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Terence Gannon <butzi0...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Monday, January 12, 2009 11:00:31 AM > Subject: Improving Readability of Hit Highlighting > > I'm indexing text from an OCR of an old document. Many words get read > perfectly, but they're typically embedded in a lot of junk. I would > like the hit highlighting to show only the 'good' words, in the order > in which they appeared in the original document. Is it possible to > use output of the filter classes as the text used in hit highlighting? > Or do you have to all the text cleanup outside of Solr and present it > with two fields to index, one with the original text, and one with the > cleaned up text. The objective of the hit highlighting is to give the > user a *sense* of the original context, even if it's not provided > verbatim from the original document. Thanks in advance. > > TerryG