Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

> I've been trying to bulk index about 11 million PDFs, and while profiling our 
> Solr instance, I noticed that all of the threads that are processing indexing 
> requests are constantly blocking each other during this call:
> 
> http-8080-Processor39 [BLOCKED] CPU time: 9:35
> java.util.Collections$SynchronizedMap.get(Object)
> org.pdfbox.pdmodel.font.PDFont.getAFM()
> org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
> org.pdfbox.util.PDFStreamEngine.showString(byte[])
> org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
> org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
> COSStream)
> org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
> org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
> org.pdfbox.util.PDFTextStripper.processPages(List)
> org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
> org.pdfbox.util.PDFTextStripper.getText(PDDocument)
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
> Metadata)
> org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
> Metadata)
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
>  SolrQueryResponse, ContentStream)
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
>  SolrQueryResponse)
> org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
> SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
> SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
> ServletResponse, FilterChain)
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
>  ServletResponse)
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
> ServletResponse)
> org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
> org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
> org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
> org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
> org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
> org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
> org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
>  Object[])
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
> TcpConnection, Object[])
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
> java.lang.Thread.run()
> 
> Has anyone run into this before? Any ideas on how to reduce the contention?
> 
> Thanks,
> Gio.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Reply via email to