Re: regarding Extracting text from Images

2020-01-22 Thread Steve Ge
In my experience, enabling Tika at server level can result in memory heap space used up under high volume of extraction, and bring down Solr entirely.   Likely due to garbage collector not able to keep up w/ load, even tuning garbage collector didn't resolve the problem completely.  Not recommen

Re: regarding Extracting text from Images

2020-01-22 Thread Retro
Good day, We solved the situation. Here is what was used and changed: In our installation we used Tesseract version 3.05, Tika version 1.17, SOLR version 7.4. We actually, had TIKA version 1.17, not 18. 1. Changed from HOCR to TXT >>> in file parseContext.xml 2. Had to start SOLR as a root

Re: regarding Extracting text from Images

2020-01-21 Thread Retro
Hello, thank you for the info, Iwill look into this as well. Yes, we plan to use it in production, but on a longer run. For the moment I just need to make it work as a test case. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: regarding Extracting text from Images

2020-01-21 Thread Retro
Yes, I did. this manual is referring to standalone version of TIKA, while I have a build-in version. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: regarding Extracting text from Images

2020-01-17 Thread Marco Reis
Are you intending to use the solution in production? If so, combining Tika and Tesseract on the same server could not be a good choice. Tika and Tesseract are heavy processing consumers, harming the main service on the solution, in your case, Solr service. I had the same situation here, and the com

Re: regarding Extracting text from Images

2020-01-17 Thread Jörn Franke
Have you checked this? https://cwiki.apache.org/confluence/display/TIKA/TikaOCR > Am 17.01.2020 um 10:54 schrieb Retro : > > Hello, can you please advise me, how to configure Solr so that embedded Tika > is able to use Tesseract to do the ocr of images? I have installed the > following softwar

Re: regarding Extracting text from Images

2020-01-17 Thread Retro
Hello, can you please advise me, how to configure Solr so that embedded Tika is able to use Tesseract to do the ocr of images? I have installed the following software - SOLR - 7.4.0 Tesseract - 4.1.1-rc2-20-g01fb TIKA - TIKA 1.18 Tesseract is installed in to the following directory: /u

Re: regarding Extracting text from Images

2019-10-27 Thread Jörn Franke
Maybe some additional consideration: If you need to upgrade Solr then eventually you need to reindex. If you change fields or add fields then you need to reindex. Both are much faster if you have an external program that converts rich documents (pdf, word, ocr) to Text once and you use the text

Re: regarding Extracting text from Images

2019-10-27 Thread Erick Erickson
I would do neither. I’d put it all on an external server and use _that_, then send the finished docs to Solr. The problem with putting this all on Solr is at least three-fold: 1> you’re talking heavy-duty work here to do the OCR, which takes away from the available resources for searching and in

Re: regarding Extracting text from Images

2019-10-26 Thread Edward Ribeiro
No. You should install tesseract-ocr on the same box your Solr instance is, and configure Solr so that embedded Tika is able to use Tesseract to do the ocr of images. Best, Edward Em qua, 23 de out de 2019 20:08, suresh pendap escreveu: > Hi Alex, > Thanks for your reply. How do we integrate te

Re: regarding Extracting text from Images

2019-10-25 Thread Eric Pugh
Just to stir the pot on this topic, here is an article about why and how to use Tika inside of Solr: https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/ > On Oct 23, 2019, at 7:21 PM, Erick Erickson wrote: > > Here’s a blog about why and how t

Re: regarding Extracting text from Images

2019-10-23 Thread Erick Erickson
Here’s a blog about why and how to use Tika outside Solr (and an RDBMS too, but you can pull that part out pretty easily): https://lucidworks.com/post/indexing-with-solrj/ > On Oct 23, 2019, at 7:16 PM, Alexandre Rafalovitch wrote: > > Again, I think you are best to do it out of Solr. > > Bu

Re: regarding Extracting text from Images

2019-10-23 Thread Alexandre Rafalovitch
Again, I think you are best to do it out of Solr. But even of you want to get it to work in Solr, I think you start by getting it to work directly in Tika. Then, get the missing libraries and configuration into Solr. Regards, Alex On Wed, Oct 23, 2019, 7:08 PM suresh pendap, wrote: > Hi Al

Re: regarding Extracting text from Images

2019-10-23 Thread suresh pendap
Hi Alex, Thanks for your reply. How do we integrate tesseract with Solr? Do we have to implement Custom update processor or extend the ExtractingRequestProcessor? Regards Suresh On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch wrote: > I believe Tika that powers this can do so with extra

Re: regarding Extracting text from Images

2019-10-23 Thread Alexandre Rafalovitch
I believe Tika that powers this can do so with extra libraries (tesseract?) But Solr does not bundle those extras. In any case, you may want to run Tika externally to avoid the conversion/extraction process be a burden to Solr itself. Regards, Alex On Wed, Oct 23, 2019, 1:58 PM suresh penda