Ahh, makes sense. I did have a feeling I was barking up the wrong tree since it's an Extraction issue, but I thought I'd throw it out there, anyway.
Thanks so much for the information! On Wed, Feb 17, 2016 at 4:49 PM, Rachel Lynn Underwood < r.lynn.underw...@gmail.com> wrote: > This is an error being thrown by Apache PDFBox/Tika. You're seeing it now > because Solr 4.x uses a different Tika version than Solr 3.x. > > It looks like this error is thrown when you parse a PDF with Tika, and a > font in that PDF doesn't have a ToUnicode mapping. > https://issues.apache.org/jira/browse/PDFBOX-1408 > > Another user reported that this might be related to special characters, but > PDFBox developers haven't been able to reproduce the bug. > https://issues.apache.org/jira/browse/PDFBOX-1706 > > Since this isn't an issue in the Solr code, if you're concerned about it, > you'll probably have better luck asking the PDFBox developers directly, via > Jira or their mailing list. > > > On Tue, Feb 16, 2016 at 12:08 PM, Joseph Hagerty <joa...@gmail.com> wrote: > > > Does literally nobody else see this error in their logs? I see this error > > hundreds of times per day, in occasional bursts. Should I file this as a > > bug? > > > > On Mon, Feb 15, 2016 at 4:56 PM, Joseph Hagerty <joa...@gmail.com> > wrote: > > > > > After migrating from 3.5 to 4.10.3, I'm seeing the following error with > > > alarming regularity in the master's error log: > > > > > > 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of > the > > > space character using 250 as default > > > I can't seem to glean much information about this one from the web. Has > > > anyone else fought this error? > > > > > > In case this helps, here's some technical/miscellaneous info: > > > > > > - I'm running a master-slave set-up. > > > > > > - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext > > > from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of > > > this, but I don't know the first thing about it. > > > > > > - I have the clients specifying 'autocommit=6s' in their requests, > which > > I > > > realize is a pretty aggressive commit interval, but so far that hasn't > > > caused any problems I couldn't surmount. > > > > > > - There are north of 11 million docs in my index, which is 36 gigs > thick. > > > The storage volume is only 10% full. > > > > > > - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex > due > > > to incompatibility between versions. > > > > > > - Both master and slave are running on AWS instances, C4.4XL's (16 > cores, > > > 30 gigs of RAM). > > > > > > So far, I have been unable to reproduce this error on my own: I can > only > > > observe it in the logs. I haven't been able to tie it to any specific > > > document. > > > > > > Let me know if further information would be helpful. > > > > > > > > > > > > > > > > > > -- > > - Joe > > > -- - Joe