Ahh, makes sense. I did have a feeling I was barking up the wrong tree
since it's an Extraction issue, but I thought I'd throw it out there,
anyway.

Thanks so much for the information!

On Wed, Feb 17, 2016 at 4:49 PM, Rachel Lynn Underwood <
r.lynn.underw...@gmail.com> wrote:

> This is an error being thrown by Apache PDFBox/Tika. You're seeing it now
> because Solr 4.x uses a different Tika version than Solr 3.x.
>
> It looks like this error is thrown when you parse a PDF with Tika, and a
> font in that PDF doesn't have a ToUnicode mapping.
> https://issues.apache.org/jira/browse/PDFBOX-1408
>
> Another user reported that this might be related to special characters, but
> PDFBox developers haven't been able to reproduce the bug.
> https://issues.apache.org/jira/browse/PDFBOX-1706
>
> Since this isn't an issue in the Solr code, if you're concerned about it,
> you'll probably have better luck asking the PDFBox developers directly, via
> Jira or their mailing list.
>
>
> On Tue, Feb 16, 2016 at 12:08 PM, Joseph Hagerty <joa...@gmail.com> wrote:
>
> > Does literally nobody else see this error in their logs? I see this error
> > hundreds of times per day, in occasional bursts. Should I file this as a
> > bug?
> >
> > On Mon, Feb 15, 2016 at 4:56 PM, Joseph Hagerty <joa...@gmail.com>
> wrote:
> >
> > > After migrating from 3.5 to 4.10.3, I'm seeing the following error with
> > > alarming regularity in the master's error log:
> > >
> > > 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of
> the
> > > space character using 250 as default
> > > I can't seem to glean much information about this one from the web. Has
> > > anyone else fought this error?
> > >
> > > In case this helps, here's some technical/miscellaneous info:
> > >
> > > - I'm running a master-slave set-up.
> > >
> > > - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext
> > > from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of
> > > this, but I don't know the first thing about it.
> > >
> > > - I have the clients specifying 'autocommit=6s' in their requests,
> which
> > I
> > > realize is a pretty aggressive commit interval, but so far that hasn't
> > > caused any problems I couldn't surmount.
> > >
> > > - There are north of 11 million docs in my index, which is 36 gigs
> thick.
> > > The storage volume is only 10% full.
> > >
> > > - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex
> due
> > > to incompatibility between versions.
> > >
> > > - Both master and slave are running on AWS instances, C4.4XL's (16
> cores,
> > > 30 gigs of RAM).
> > >
> > > So far, I have been unable to reproduce this error on my own: I can
> only
> > > observe it in the logs. I haven't been able to tie it to any specific
> > > document.
> > >
> > > Let me know if further information would be helpful.
> > >
> > >
> > >
> > >
> >
> >
> > --
> > - Joe
> >
>



-- 
- Joe

Reply via email to