I have found this solution in Stackoverflow from Tim Allison to be working.

http://stackoverflow.com/questions/32354209/apache-
tika-extract-scanned-pdf-files

Regards,
Edwin

On 19 March 2017 at 19:47, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:

> This is my settings in the PDFParser.properties file
> under tika-parsers-1.13.jar
>
> enableAutoSpace true
> extractAnnotationText true
> sortByPosition false
> suppressDuplicateOverlappingText false
> extractAcroFormContent true
> extractInlineImages true
> extractUniqueInlineImagesOnly true
> checkExtractAccessPermission false
> allowExtractionForAccessibility true
> ifXFAExtractOnlyXFA false
> catchIntermediateIOExceptions true
>
> Regards,
> Edwin
>
>
> On 19 March 2017 at 09:08, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
>
>> Hi Rick,
>>
>> Thanks for your reply.
>> I saw this error message for the file which has a failure.
>> Am I able to index such files together with the other files which store
>> text as an image together in the same indexing threads?
>>
>>
>> 2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
>> start commit{,optimize=false,openSearcher=true,waitSearcher=true,e
>> xpungeDeletes=false,softCommit=false,prepareCommit=false}
>> 2017-03-19 01:02:26.610 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.SolrIndexWriter Calling
>> setCommitData with IW:org.apache.solr.update.SolrIndexWriter@2330f07c
>> 2017-03-19 01:02:26.610 ERROR (updateExecutor-2-thread-4-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.u.SolrCmdDistributor
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://192.168.99.1:8984/solr/
>> collection1_shard1_replica1: Expected mime type application/octet-stream
>> but got text/html. <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
>> <title>Error 404 </title>
>> </head>
>> <body>
>> <h2>HTTP ERROR: 404</h2>
>> <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason:
>> <pre>    Not Found</pre></p>
>> <hr />
>> </body>
>> </html>
>>
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMeth
>> od(HttpSolrClient.java:578)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:279)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:268)
>> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient
>> .request(ConcurrentUpdateSolrClient.java:430)
>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
>> at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdD
>> istributor.java:293)
>> at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(
>> SolrCmdDistributor.java:282)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at com.codahale.metrics.InstrumentedExecutorService$Instrumente
>> dRunnable.run(InstrumentedExecutorService.java:176)
>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>>
>> 2017-03-19 01:02:26.657 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.s.SolrIndexSearcher
>> Opening [Searcher@77e108d5[collection1_shard1_replica2] main]
>> 2017-03-19 01:02:26.658 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.u.DirectUpdateHandler2
>> end_commit_flush
>> 2017-03-19 01:02:26.658 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener
>> QuerySenderListener sending requests to 
>> Searcher@77e108d5[collection1_shard1_replica2]
>> main{ExitableDirectoryReader(UninvertingDirectoryReader(Unin
>> verting(_0(6.4.2):C3)))}
>> 2017-03-19 01:02:26.658 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.QuerySenderListener
>> QuerySenderListener done.
>> 2017-03-19 01:02:26.659 INFO  (searcherExecutor-16-thread-1-processing-n:
>> 192.168.99.1:8983_solr x:collection1_shard1_replica2 s:shard1
>> c:collection1 r:core_node1) [c:collection1 s:shard1 r:core_node1
>> x:collection1_shard1_replica2] o.a.s.c.SolrCore
>> [collection1_shard1_replica2] Registered new searcher Searcher@77e108d5
>> [collection1_shard1_replica2] main{ExitableDirectoryReader(U
>> ninvertingDirectoryReader(Uninverting(_0(6.4.2):C3)))}
>> 2017-03-19 01:02:26.659 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] 
>> o.a.s.u.p.LogUpdateProcessorFactory
>> [collection1_shard1_replica2]  webapp=/solr path=/update
>> params={update.distrib=FROMLEADER&update.chain=files-update-
>> processor&waitSearcher=true&openSearcher=true&commit=true&
>> softCommit=false&distrib.from=http://192.168.99.1:8983/solr/
>> collection1_shard1_replica2/&commit_end_point=true&wt=
>> javabin&version=2&expungeDeletes=false}{commit=} 0 49
>> 2017-03-19 01:02:26.662 WARN  (qtp1543727556-139) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] 
>> o.a.s.u.p.DistributedUpdateProcessor
>> Error sending update to http://192.168.99.1:8984/solr
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://192.168.99.1:8984/solr/
>> collection1_shard1_replica1: Expected mime type application/octet-stream
>> but got text/html. <html>
>> <head>
>> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
>> <title>Error 404 </title>
>> </head>
>> <body>
>> <h2>HTTP ERROR: 404</h2>
>> <p>Problem accessing /solr/collection1_shard1_replica1/update. Reason:
>> <pre>    Not Found</pre></p>
>> <hr />
>> </body>
>> </html>
>>
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMeth
>> od(HttpSolrClient.java:578)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:279)
>> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(Htt
>> pSolrClient.java:268)
>> at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient
>> .request(ConcurrentUpdateSolrClient.java:430)
>> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
>> at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdD
>> istributor.java:293)
>> at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(
>> SolrCmdDistributor.java:282)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at com.codahale.metrics.InstrumentedExecutorService$Instrumente
>> dRunnable.run(InstrumentedExecutorService.java:176)
>> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>> 2017-03-19 01:02:26.662 INFO  (qtp1543727556-139) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] 
>> o.a.s.u.p.LogUpdateProcessorFactory
>> [collection1_shard1_replica2]  webapp=/solr path=/update
>> params={commit=true}{commit=} 0 66
>> 2017-03-19 01:02:43.019 INFO  (qtp1543727556-21) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request
>> [collection1_shard1_replica2]  webapp=/solr path=/admin/file
>> params={wt=json&_=1489885363012} status=0 QTime=4
>> 2017-03-19 01:02:45.453 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.PluginBag Going to
>> create a new requestHandler with {type = requestHandler,name =
>> /select,class = solr.SearchHandler,attributes = {enable=true, startup=lazy,
>> name=/select, class=solr.SearchHandler},args =
>> {defaults={echoParams=explicit,rows=10,wt=json,indent=true,df=text,fl=id,
>> content, content_type, content_cat, content_subcat, creation_date, subject,
>> userid, author, entity, location, geolocation, visibility, accesslevel,
>> accessgroup, reference, crossreference, resourcename, importance, tag,
>> popularity, language_s, score}}}
>> 2017-03-19 01:02:45.461 INFO  (qtp1543727556-19) [c:collection1 s:shard1
>> r:core_node1 x:collection1_shard1_replica2] o.a.s.c.S.Request
>> [collection1_shard1_replica2]  webapp=/solr path=/select
>> params={q=*:*&indent=true&wt=json&_=1489885365450} hits=3 status=0
>> QTime=8
>>
>>
>> Regards,
>> Edwin
>>
>>
>> On 19 March 2017 at 06:31, Rick Leir <rl...@leirtech.com> wrote:
>>
>>> Hi Edwin
>>> The pdf file format can store text as an image, and then you need OCR to
>>> get the text. However, text is more commonly not stored as an image in the
>>> pdf, and then you should not use OCR to get the text.
>>>
>>> Do you get an error message when you have a failure?
>>> Cheers -- Rick
>>>
>>> On March 18, 2017 12:01:17 PM EDT, Zheng Lin Edwin Yeo <
>>> edwinye...@gmail.com> wrote:
>>> >Hi,
>>> >
>>> >I'm facing the issue of that the Tesseract OCR is not able to extract
>>> >the
>>> >words in a PDF file in an attachment in EMLfile and index it into Solr
>>> >occasionally? However, most of the time it can be extracted.
>>> >
>>> >What could be the reason that causes the file in the email attachment
>>> >to be
>>> >failed to extracted using OCR?
>>> >
>>> >I'm using Solr 6.4.2.
>>> >
>>> >Regards,
>>> >Edwin
>>>
>>> --
>>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>>
>>
>>
>

Reply via email to