Hi, I'm not sure how Solr allows for adjusting these Tika settings to get the desired output. At least a few desirable Tika subsystems cannot be called from the ExtractingRequestHandler such as Tika's BoilerPlateContentHandler. I'm also not really sure if it's a good idea to normalize diacritics in Tika output, this way the stored data would also be normalized which is not desirable.
You can, however, normalize diacritics in your field analyzer. This way your search is normalized but the returned data still holds diacritics which is good. Cheers, > Hi all, > > I'm wondering if there are any knobs or levers i can set in > solrconfig.xml that affect how pdfbox text extraction is performed by > the extraction handler. I would like to take advantage of pdfbox's > ability to normalize diacritics and ligatures [1], but that doesn't > seem to be the default behavior. Is there a way to enable this? > > Thanks, > --jay > > [1] > http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNor > malize.html