To answer your questions specifically, here is an example of the raw OCR output;

"CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea"

to which I would like to see;

"mom ale access tour sheet to"

in the hit highlight.  My schema for this field is pretty much
standard, as follows;

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ...
<filter class="solr.WordDelimiterFilterFactory" ...
<filter class="solr.LowerCaseFilterFactory" ...
<filter class="solr.EnglishPorterFilterFactory" ...
<filter class="solr.RemoveDuplicatesTokenFilterFactory ...

When I examine the effect of each of these with the Analyzer, it seems
like if I could use the output after LowerCaseFilterFactory in the hit
highlight, I would come close to achieving what I want.

I'm not averse to doing the text cleanup external to Solr before the
indexing, but only if it's *not* redundant to what the filter
factories are going to do anyway.  Thanks for your help!

TerryG

Reply via email to