Re: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-11 Thread Retro
AJ Weber wrote > There are alternative, paid, libraries to parse and extract attachments > from EML files as well > EML attachments will have a mimetype associated with their metadata. Hello, can you give a hint what are those commercial libraries that would do the job? We need to index MSG files

RE: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-14 Thread Retro
Hello, thanks for answer, but let me explain the setup. We are running our own backup solution for emails (messages from Exchange in MSG format). Content of these messages then indexed in SOLR. But SOLR can not process attachments within those MSG files, can not OCR them. This is what I need - to O

Re: regarding Extracting text from Images

2020-01-17 Thread Retro
Hello, can you please advise me, how to configure Solr so that embedded Tika is able to use Tesseract to do the ocr of images? I have installed the following software - SOLR - 7.4.0 Tesseract - 4.1.1-rc2-20-g01fb TIKA - TIKA 1.18 Tesseract is installed in to the following directory: /u

Re: regarding Extracting text from Images

2020-01-21 Thread Retro
Yes, I did. this manual is referring to standalone version of TIKA, while I have a build-in version. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: regarding Extracting text from Images

2020-01-21 Thread Retro
Hello, thank you for the info, Iwill look into this as well. Yes, we plan to use it in production, but on a longer run. For the moment I just need to make it work as a test case. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: regarding Extracting text from Images

2020-01-22 Thread Retro
Good day, We solved the situation. Here is what was used and changed: In our installation we used Tesseract version 3.05, Tika version 1.17, SOLR version 7.4. We actually, had TIKA version 1.17, not 18. 1. Changed from HOCR to TXT >>> in file parseContext.xml 2. Had to start SOLR as a root