I have found this solution in Stackoverflow from Tim Allison to be working.
http://stackoverflow.com/questions/32354209/apache- tika-extract-scanned-pdf-files Regards, Edwin On 19 March 2017 at 19:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > This is my settings in the PDFParser.properties file > under tika-parsers-1.13.jar > > enableAutoSpace true > extractAnnotationText true > sortByPosition false > suppressDuplicateOverlappingText false > extractAcroFormContent true > extractInlineImages true > extractUniqueInlineImagesOnly true > checkExtractAccessPermission false > allowExtractionForAccessibility true > ifXFAExtractOnlyXFA false > catchIntermediateIOExceptions true > > Regards, > Edwin > > > On 19 March 2017 at 09:08, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > >> Hi Rick, >> >> Thanks for your reply. >> I saw this error message for the file which has a failure. >> Am I able to index such files together with the other files which store >> text as an image together in the same indexing threads? >> >> >> 2017-03-19 01:02:26.610 INFO (qtp1543727556-19) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2 >> start commit{,optimize=false,openSearcher=true,waitSearcher=true,e >> xpungeDeletes=false,softCommit=false,prepareCommit=false} >> 2017-03-19 01:02:26.610 INFO (qtp1543727556-19) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.SolrIndexWriter Calling >> setCommitData with IW:org.apache.solr.update.SolrIndexWriter@2330f07c >> 2017-03-19 01:02:26.610 ERROR (updateExecutor-2-thread-4-processing-n: >> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1 >> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1 >> x:collection1_shard1_replica2] o.a.s.u.SolrCmdDistributor >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://192.168.99.1:8984/solr/ >> collection1_shard1_replica1: Expected mime type application/octet-stream >> but got text/html. <html> >> <head> >> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> >> <title>Error 404 </title> >> </head> >> <body> >> <h2>HTTP ERROR: 404</h2> >> <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason: >> <pre> Not Found</pre></p> >> <hr /> >> </body> >> </html> >> >> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMeth >> od(HttpSolrClient.java:578) >> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt >> pSolrClient.java:279) >> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt >> pSolrClient.java:268) >> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient >> .request(ConcurrentUpdateSolrClient.java:430) >> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219) >> at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdD >> istributor.java:293) >> at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0( >> SolrCmdDistributor.java:282) >> at java.util.concurrent.FutureTask.run(Unknown Source) >> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) >> at java.util.concurrent.FutureTask.run(Unknown Source) >> at com.codahale.metrics.InstrumentedExecutorService$Instrumente >> dRunnable.run(InstrumentedExecutorService.java:176) >> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE >> xecutor.lambda$execute$0(ExecutorUtil.java:229) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >> at java.lang.Thread.run(Unknown Source) >> >> 2017-03-19 01:02:26.657 INFO (qtp1543727556-19) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] o.a.s.s.SolrIndexSearcher >> Opening [Searcher@77e108d5[collection1_shard1_replica2] main] >> 2017-03-19 01:02:26.658 INFO (qtp1543727556-19) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2 >> end_commit_flush >> 2017-03-19 01:02:26.658 INFO (searcherExecutor-16-thread-1-processing-n: >> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1 >> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1 >> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener >> QuerySenderListener sending requests to >> Searcher@77e108d5[collection1_shard1_replica2] >> main{ExitableDirectoryReader(UninvertingDirectoryReader(Unin >> verting(_0(6.4.2):C3)))} >> 2017-03-19 01:02:26.658 INFO (searcherExecutor-16-thread-1-processing-n: >> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1 >> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1 >> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener >> QuerySenderListener done. >> 2017-03-19 01:02:26.659 INFO (searcherExecutor-16-thread-1-processing-n: >> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1 >> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1 >> x:collection1_shard1_replica2] o.a.s.c.SolrCore >> [collection1_shard1_replica2] Registered new searcher Searcher@77e108d5 >> [collection1_shard1_replica2] main{ExitableDirectoryReader(U >> ninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))} >> 2017-03-19 01:02:26.659 INFO (qtp1543727556-19) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] >> o.a.s.u.p.LogUpdateProcessorFactory >> [collection1_shard1_replica2] webapp=/solr path=/update >> params={update.distrib=FROMLEADER&update.chain=files-update- >> processor&waitSearcher=true&openSearcher=true&commit=true& >> softCommit=false&distrib.from=http://192.168.99.1:8983/solr/ >> collection1_shard1_replica2/&commit_end_point=true&wt= >> javabin&version=2&expungeDeletes=false}{commit=} 0 49 >> 2017-03-19 01:02:26.662 WARN (qtp1543727556-139) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] >> o.a.s.u.p.DistributedUpdateProcessor >> Error sending update to http://192.168.99.1:8984/solr >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://192.168.99.1:8984/solr/ >> collection1_shard1_replica1: Expected mime type application/octet-stream >> but got text/html. <html> >> <head> >> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> >> <title>Error 404 </title> >> </head> >> <body> >> <h2>HTTP ERROR: 404</h2> >> <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason: >> <pre> Not Found</pre></p> >> <hr /> >> </body> >> </html> >> >> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMeth >> od(HttpSolrClient.java:578) >> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt >> pSolrClient.java:279) >> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt >> pSolrClient.java:268) >> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient >> .request(ConcurrentUpdateSolrClient.java:430) >> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219) >> at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdD >> istributor.java:293) >> at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0( >> SolrCmdDistributor.java:282) >> at java.util.concurrent.FutureTask.run(Unknown Source) >> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) >> at java.util.concurrent.FutureTask.run(Unknown Source) >> at com.codahale.metrics.InstrumentedExecutorService$Instrumente >> dRunnable.run(InstrumentedExecutorService.java:176) >> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE >> xecutor.lambda$execute$0(ExecutorUtil.java:229) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >> at java.lang.Thread.run(Unknown Source) >> 2017-03-19 01:02:26.662 INFO (qtp1543727556-139) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] >> o.a.s.u.p.LogUpdateProcessorFactory >> [collection1_shard1_replica2] webapp=/solr path=/update >> params={commit=true}{commit=} 0 66 >> 2017-03-19 01:02:43.019 INFO (qtp1543727556-21) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request >> [collection1_shard1_replica2] webapp=/solr path=/admin/file >> params={wt=json&_=1489885363012} status=0 QTime=4 >> 2017-03-19 01:02:45.453 INFO (qtp1543727556-19) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.PluginBag Going to >> create a new requestHandler with {type = requestHandler,name = >> /select,class = solr.SearchHandler,attributes = {enable=true, startup=lazy, >> name=/select, class=solr.SearchHandler},args = >> {defaults={echoParams=explicit,rows=10,wt=json,indent=true,df=text,fl=id, >> content, content_type, content_cat, content_subcat, creation_date, subject, >> userid, author, entity, location, geolocation, visibility, accesslevel, >> accessgroup, reference, crossreference, resourcename, importance, tag, >> popularity, language_s, score}}} >> 2017-03-19 01:02:45.461 INFO (qtp1543727556-19) [c:collection1 s:shard1 >> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request >> [collection1_shard1_replica2] webapp=/solr path=/select >> params={q=*:*&indent=true&wt=json&_=1489885365450} hits=3 status=0 >> QTime=8 >> >> >> Regards, >> Edwin >> >> >> On 19 March 2017 at 06:31, Rick Leir <rl...@leirtech.com> wrote: >> >>> Hi Edwin >>> The pdf file format can store text as an image, and then you need OCR to >>> get the text. However, text is more commonly not stored as an image in the >>> pdf, and then you should not use OCR to get the text. >>> >>> Do you get an error message when you have a failure? >>> Cheers -- Rick >>> >>> On March 18, 2017 12:01:17 PM EDT, Zheng Lin Edwin Yeo < >>> edwinye...@gmail.com> wrote: >>> >Hi, >>> > >>> >I'm facing the issue of that the Tesseract OCR is not able to extract >>> >the >>> >words in a PDF file in an attachment in EMLfile and index it into Solr >>> >occasionally? However, most of the time it can be extracted. >>> > >>> >What could be the reason that causes the file in the email attachment >>> >to be >>> >failed to extracted using OCR? >>> > >>> >I'm using Solr 6.4.2. >>> > >>> >Regards, >>> >Edwin >>> >>> -- >>> Sent from my Android device with K-9 Mail. Please excuse my brevity. >> >> >> >