Re: tika/pdfbox knobs & levers

Markus Jelsma Wed, 13 Apr 2011 14:17:34 -0700

Hi,

I'm not sure how Solr allows for adjusting these Tika settings to get the 
desired output. At least a few desirable Tika subsystems cannot be called from 
the ExtractingRequestHandler such as Tika's BoilerPlateContentHandler. I'm 
also not really sure if it's a good idea to normalize diacritics in Tika 
output, this way the stored data would also be normalized which is not 
desirable.


You can, however, normalize diacritics in your field analyzer. This way your 
search is normalized but the returned data still holds diacritics which is 
good.

Cheers,

> Hi all,
> 
> I'm wondering if there are any knobs or levers i can set in
> solrconfig.xml that affect how pdfbox text extraction is performed by
> the extraction handler. I would like to take advantage of pdfbox's
> ability to normalize diacritics and ligatures [1], but that doesn't
> seem to be the default behavior. Is there a way to enable this?
> 
> Thanks,
> --jay
> 
> [1]
> http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNor
> malize.html

Re: tika/pdfbox knobs & levers

Reply via email to