Hi, Quick note: please include copy of previous email when replying, so people can be reminded of the context.
You mentioned junk getting highlighted. In your case is CONTRACTORINMPRIMENTAYIVE getting highlighted? And that is junk? If so, why not augment your indexing to throw out junk tokens if you have some rules for what constitutes junk tokens? (e.g. token not in dictionary) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Terence Gannon <butzi0...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Monday, January 12, 2009 4:07:57 PM > Subject: Re: Improving Readability of Hit Highlighting > > To answer your questions specifically, here is an example of the raw OCR > output; > > "CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea" > > to which I would like to see; > > "mom ale access tour sheet to" > > in the hit highlight. My schema for this field is pretty much > standard, as follows; > > > > > > > > > When I examine the effect of each of these with the Analyzer, it seems > like if I could use the output after LowerCaseFilterFactory in the hit > highlight, I would come close to achieving what I want. > > I'm not averse to doing the text cleanup external to Solr before the > indexing, but only if it's *not* redundant to what the filter > factories are going to do anyway. Thanks for your help! > > TerryG