This is my settings in the PDFParser.properties file under tika-parsers-1.13.jar
enableAutoSpace true extractAnnotationText true sortByPosition false suppressDuplicateOverlappingText false extractAcroFormContent true extractInlineImages true extractUniqueInlineImagesOnly true checkExtractAccessPermission false allowExtractionForAccessibility true ifXFAExtractOnlyXFA false catchIntermediateIOExceptions true Regards, Edwin On 19 March 2017 at 09:08, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Rick, > > Thanks for your reply. > I saw this error message for the file which has a failure. > Am I able to index such files together with the other files which store > text as an image together in the same indexing threads? > > > 2017-03-19 01:02:26.610 INFO (qtp1543727556-19) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2 > start commit{,optimize=false,openSearcher=true,waitSearcher=true, > expungeDeletes=false,softCommit=false,prepareCommit=false} > 2017-03-19 01:02:26.610 INFO (qtp1543727556-19) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] o.a.s.u.SolrIndexWriter Calling > setCommitData with IW:org.apache.solr.update.SolrIndexWriter@2330f07c > 2017-03-19 01:02:26.610 ERROR (updateExecutor-2-thread-4-processing-n: > 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1 > c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1 > x:collection1_shard1_replica2] o.a.s.u.SolrCmdDistributor > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > Error from server at http://192.168.99.1:8984/solr/ > collection1_shard1_replica1: Expected mime type application/octet-stream > but got text/html. <html> > <head> > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> > <title>Error 404 </title> > </head> > <body> > <h2>HTTP ERROR: 404</h2> > <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason: > <pre> Not Found</pre></p> > <hr /> > </body> > </html> > > at org.apache.solr.client.solrj.impl.HttpSolrClient. > executeMethod(HttpSolrClient.java:578) > at org.apache.solr.client.solrj.impl.HttpSolrClient.request( > HttpSolrClient.java:279) > at org.apache.solr.client.solrj.impl.HttpSolrClient.request( > HttpSolrClient.java:268) > at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request( > ConcurrentUpdateSolrClient.java:430) > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219) > at org.apache.solr.update.SolrCmdDistributor.doRequest( > SolrCmdDistributor.java:293) > at org.apache.solr.update.SolrCmdDistributor.lambda$ > submit$0(SolrCmdDistributor.java:282) > at java.util.concurrent.FutureTask.run(Unknown Source) > at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > at java.util.concurrent.FutureTask.run(Unknown Source) > at com.codahale.metrics.InstrumentedExecutorService$ > InstrumentedRunnable.run(InstrumentedExecutorService.java:176) > at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor. > lambda$execute$0(ExecutorUtil.java:229) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > > 2017-03-19 01:02:26.657 INFO (qtp1543727556-19) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] o.a.s.s.SolrIndexSearcher > Opening [Searcher@77e108d5[collection1_shard1_replica2] main] > 2017-03-19 01:02:26.658 INFO (qtp1543727556-19) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2 > end_commit_flush > 2017-03-19 01:02:26.658 INFO (searcherExecutor-16-thread-1-processing-n: > 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1 > c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1 > x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener > QuerySenderListener sending requests to > Searcher@77e108d5[collection1_shard1_replica2] > main{ExitableDirectoryReader(UninvertingDirectoryReader( > Uninverting(_0(6.4.2):C3)))} > 2017-03-19 01:02:26.658 INFO (searcherExecutor-16-thread-1-processing-n: > 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1 > c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1 > x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener > QuerySenderListener done. > 2017-03-19 01:02:26.659 INFO (searcherExecutor-16-thread-1-processing-n: > 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1 > c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1 > x:collection1_shard1_replica2] o.a.s.c.SolrCore > [collection1_shard1_replica2] Registered new searcher Searcher@77e108d5 > [collection1_shard1_replica2] main{ExitableDirectoryReader( > UninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))} > 2017-03-19 01:02:26.659 INFO (qtp1543727556-19) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] > o.a.s.u.p.LogUpdateProcessorFactory > [collection1_shard1_replica2] webapp=/solr path=/update > params={update.distrib=FROMLEADER&update.chain=files- > update-processor&waitSearcher=true&openSearcher=true&commit= > true&softCommit=false&distrib.from=http://192.168.99.1:8983/ > solr/collection1_shard1_replica2/&commit_end_point= > true&wt=javabin&version=2&expungeDeletes=false}{commit=} 0 49 > 2017-03-19 01:02:26.662 WARN (qtp1543727556-139) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] > o.a.s.u.p.DistributedUpdateProcessor > Error sending update to http://192.168.99.1:8984/solr > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > Error from server at http://192.168.99.1:8984/solr/ > collection1_shard1_replica1: Expected mime type application/octet-stream > but got text/html. <html> > <head> > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> > <title>Error 404 </title> > </head> > <body> > <h2>HTTP ERROR: 404</h2> > <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason: > <pre> Not Found</pre></p> > <hr /> > </body> > </html> > > at org.apache.solr.client.solrj.impl.HttpSolrClient. > executeMethod(HttpSolrClient.java:578) > at org.apache.solr.client.solrj.impl.HttpSolrClient.request( > HttpSolrClient.java:279) > at org.apache.solr.client.solrj.impl.HttpSolrClient.request( > HttpSolrClient.java:268) > at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request( > ConcurrentUpdateSolrClient.java:430) > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219) > at org.apache.solr.update.SolrCmdDistributor.doRequest( > SolrCmdDistributor.java:293) > at org.apache.solr.update.SolrCmdDistributor.lambda$ > submit$0(SolrCmdDistributor.java:282) > at java.util.concurrent.FutureTask.run(Unknown Source) > at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > at java.util.concurrent.FutureTask.run(Unknown Source) > at com.codahale.metrics.InstrumentedExecutorService$ > InstrumentedRunnable.run(InstrumentedExecutorService.java:176) > at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor. > lambda$execute$0(ExecutorUtil.java:229) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > 2017-03-19 01:02:26.662 INFO (qtp1543727556-139) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] > o.a.s.u.p.LogUpdateProcessorFactory > [collection1_shard1_replica2] webapp=/solr path=/update > params={commit=true}{commit=} 0 66 > 2017-03-19 01:02:43.019 INFO (qtp1543727556-21) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request > [collection1_shard1_replica2] webapp=/solr path=/admin/file > params={wt=json&_=1489885363012} status=0 QTime=4 > 2017-03-19 01:02:45.453 INFO (qtp1543727556-19) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] o.a.s.c.PluginBag Going to > create a new requestHandler with {type = requestHandler,name = > /select,class = solr.SearchHandler,attributes = {enable=true, startup=lazy, > name=/select, class=solr.SearchHandler},args = {defaults={echoParams= > explicit,rows=10,wt=json,indent=true,df=text,fl=id, content, > content_type, content_cat, content_subcat, creation_date, subject, userid, > author, entity, location, geolocation, visibility, accesslevel, > accessgroup, reference, crossreference, resourcename, importance, tag, > popularity, language_s, score}}} > 2017-03-19 01:02:45.461 INFO (qtp1543727556-19) [c:collection1 s:shard1 > r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request > [collection1_shard1_replica2] webapp=/solr path=/select > params={q=*:*&indent=true&wt=json&_=1489885365450} hits=3 status=0 QTime=8 > > > Regards, > Edwin > > > On 19 March 2017 at 06:31, Rick Leir <rl...@leirtech.com> wrote: > >> Hi Edwin >> The pdf file format can store text as an image, and then you need OCR to >> get the text. However, text is more commonly not stored as an image in the >> pdf, and then you should not use OCR to get the text. >> >> Do you get an error message when you have a failure? >> Cheers -- Rick >> >> On March 18, 2017 12:01:17 PM EDT, Zheng Lin Edwin Yeo < >> edwinye...@gmail.com> wrote: >> >Hi, >> > >> >I'm facing the issue of that the Tesseract OCR is not able to extract >> >the >> >words in a PDF file in an attachment in EMLfile and index it into Solr >> >occasionally? However, most of the time it can be extracted. >> > >> >What could be the reason that causes the file in the email attachment >> >to be >> >failed to extracted using OCR? >> > >> >I'm using Solr 6.4.2. >> > >> >Regards, >> >Edwin >> >> -- >> Sent from my Android device with K-9 Mail. Please excuse my brevity. > > >