Re: Improving Readability of Hit Highlighting

Otis Gospodnetic Mon, 12 Jan 2009 21:24:03 -0800

Hi,

Quick note: please include copy of previous email when replying, so people can 
be reminded of the context.


You mentioned junk getting highlighted.  In your case is 
CONTRACTORINMPRIMENTAYIVE getting highlighted?  And that is junk?    If so, why 
not augment your indexing to throw out junk tokens if you have some rules for 
what constitutes junk tokens? (e.g. token not in dictionary)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Terence Gannon <butzi0...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Monday, January 12, 2009 4:07:57 PM
> Subject: Re: Improving Readability of Hit Highlighting
> 
> To answer your questions specifically, here is an example of the raw OCR 
> output;
> 
> "CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea"
> 
> to which I would like to see;
> 
> "mom ale access tour sheet to"
> 
> in the hit highlight.  My schema for this field is pretty much
> standard, as follows;
> 
> 
> 
> 
> 
> 
> 
> 
> When I examine the effect of each of these with the Analyzer, it seems
> like if I could use the output after LowerCaseFilterFactory in the hit
> highlight, I would come close to achieving what I want.
> 
> I'm not averse to doing the text cleanup external to Solr before the
> indexing, but only if it's *not* redundant to what the filter
> factories are going to do anyway.  Thanks for your help!
> 
> TerryG

Re: Improving Readability of Hit Highlighting

Reply via email to