In my experience, enabling Tika at server level can result in memory heap space
used up under high volume of extraction, and bring down Solr entirely. Likely
due to garbage collector not able to keep up w/ load, even tuning garbage
collector didn't resolve the problem completely. Not recommen
Good day,
We solved the situation. Here is what was used and changed:
In our installation we used Tesseract version 3.05, Tika version 1.17, SOLR
version 7.4. We actually, had TIKA version 1.17, not 18.
1. Changed from HOCR to TXT >>>
in file parseContext.xml
2. Had to start SOLR as a root
Hello, thank you for the info, Iwill look into this as well. Yes, we plan to
use it in production, but on a longer run. For the moment I just need to
make it work as a test case.
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Yes, I did. this manual is referring to standalone version of TIKA, while I
have a build-in version.
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Are you intending to use the solution in production? If so, combining Tika
and Tesseract on the same server could not be a good choice.
Tika and Tesseract are heavy processing consumers, harming the main service
on the solution, in your case, Solr service.
I had the same situation here, and the com
Have you checked this?
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
> Am 17.01.2020 um 10:54 schrieb Retro :
>
> Hello, can you please advise me, how to configure Solr so that embedded Tika
> is able to use Tesseract to do the ocr of images? I have installed the
> following softwar
Hello, can you please advise me, how to configure Solr so that embedded Tika
is able to use Tesseract to do the ocr of images? I have installed the
following software -
SOLR - 7.4.0
Tesseract - 4.1.1-rc2-20-g01fb
TIKA - TIKA 1.18
Tesseract is installed in to the following directory:
/u
Maybe some additional consideration:
If you need to upgrade Solr then eventually you need to reindex.
If you change fields or add fields then you need to reindex.
Both are much faster if you have an external program that converts rich
documents (pdf, word, ocr) to Text once and you use the text
I would do neither. I’d put it all on an external server and use _that_, then
send
the finished docs to Solr.
The problem with putting this all on Solr is at least three-fold:
1> you’re talking heavy-duty work here to do the OCR, which takes away from the
available resources for searching and in
No. You should install tesseract-ocr on the same box your Solr instance is,
and configure Solr so that embedded Tika is able to use Tesseract to do the
ocr of images.
Best,
Edward
Em qua, 23 de out de 2019 20:08, suresh pendap
escreveu:
> Hi Alex,
> Thanks for your reply. How do we integrate te
Just to stir the pot on this topic, here is an article about why and how to use
Tika inside of Solr:
https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> On Oct 23, 2019, at 7:21 PM, Erick Erickson wrote:
>
> Here’s a blog about why and how t
Here’s a blog about why and how to use Tika outside Solr (and an RDBMS too, but
you can pull that part out pretty easily):
https://lucidworks.com/post/indexing-with-solrj/
> On Oct 23, 2019, at 7:16 PM, Alexandre Rafalovitch wrote:
>
> Again, I think you are best to do it out of Solr.
>
> Bu
Again, I think you are best to do it out of Solr.
But even of you want to get it to work in Solr, I think you start by
getting it to work directly in Tika. Then, get the missing libraries and
configuration into Solr.
Regards,
Alex
On Wed, Oct 23, 2019, 7:08 PM suresh pendap, wrote:
> Hi Al
Hi Alex,
Thanks for your reply. How do we integrate tesseract with Solr? Do we have
to implement Custom update processor or extend the
ExtractingRequestProcessor?
Regards
Suresh
On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch
wrote:
> I believe Tika that powers this can do so with extra
I believe Tika that powers this can do so with extra libraries (tesseract?)
But Solr does not bundle those extras.
In any case, you may want to run Tika externally to avoid the
conversion/extraction process be a burden to Solr itself.
Regards,
Alex
On Wed, Oct 23, 2019, 1:58 PM suresh penda
15 matches
Mail list logo