To answer your questions specifically, here is an example of the raw OCR output;
"CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea" to which I would like to see; "mom ale access tour sheet to" in the hit highlight. My schema for this field is pretty much standard, as follows; <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ... <filter class="solr.WordDelimiterFilterFactory" ... <filter class="solr.LowerCaseFilterFactory" ... <filter class="solr.EnglishPorterFilterFactory" ... <filter class="solr.RemoveDuplicatesTokenFilterFactory ... When I examine the effect of each of these with the Analyzer, it seems like if I could use the output after LowerCaseFilterFactory in the hit highlight, I would come close to achieving what I want. I'm not averse to doing the text cleanup external to Solr before the indexing, but only if it's *not* redundant to what the filter factories are going to do anyway. Thanks for your help! TerryG