AJ Weber wrote
> There are alternative, paid, libraries to parse and extract attachments
> from EML files as well
> EML attachments will have a mimetype associated with their metadata.
Hello, can you give a hint what are those commercial libraries that would do
the job? We need to index MSG files
Hello, thanks for answer, but let me explain the setup. We are running our
own backup solution for emails (messages from Exchange in MSG format).
Content of these messages then indexed in SOLR. But SOLR can not process
attachments within those MSG files, can not OCR them. This is what I need -
to O
Hello, can you please advise me, how to configure Solr so that embedded Tika
is able to use Tesseract to do the ocr of images? I have installed the
following software -
SOLR - 7.4.0
Tesseract - 4.1.1-rc2-20-g01fb
TIKA - TIKA 1.18
Tesseract is installed in to the following directory:
/u
Yes, I did. this manual is referring to standalone version of TIKA, while I
have a build-in version.
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Hello, thank you for the info, Iwill look into this as well. Yes, we plan to
use it in production, but on a longer run. For the moment I just need to
make it work as a test case.
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Good day,
We solved the situation. Here is what was used and changed:
In our installation we used Tesseract version 3.05, Tika version 1.17, SOLR
version 7.4. We actually, had TIKA version 1.17, not 18.
1. Changed from HOCR to TXT >>>
in file parseContext.xml
2. Had to start SOLR as a root