Re: Fwd: configuring Solr with Tesseract

Rick Leir Mon, 06 Nov 2017 04:06:30 -0800

Anand,
As Charlie says you should have a separate process for this. Also, if you go 
back about ten months in this mailing list you will see some discussion about 
how OCR can take minutes of CPU per page, and needs some preprocessing with 
Imagemagick or Graphicsmagick. You will want to do some fine tuning with this, 
then save your OCR output in a DB or the filesystem. Then you will want to be 
able to re-index Solr easily as you fine tune Solr.


Yes, use Python or your preferred Scripting language.
Cheers -- Rick

On November 6, 2017 4:05:42 AM EST, Charlie Hull <char...@flax.co.uk> wrote:
>On 03/11/2017 15:32, Admin eLawJournal wrote:
>> Hi,
>> I have read that we can use tesseract with solr to index image files.
>I
>> would like some guidance on setting this up.
>> 
>> Currently, I am using solr for searching my wordpress installation
>via the
>> WPSOLR plugin.
>> 
>> I have Solr 6.6 installed on ubuntu 14.04 which is working fine with
>> wordpress.
>> 
>> I have also installed tesseract but have no clue on configuring it.
>> 
>> 
>> I am new to solr so will greatly appreciate a detailed step by step
>> instruction.
>
>Hi,
>
>I'm guessing if you're using a preconfigured Solr plugin for WP you 
>probably haven't got your hands properly dirty with Solr yet.
>
>One way to use Tesseract would be via Apache Tika 
>https://wiki.apache.org/tika/TikaOCR which is an awesome library for 
>extracting plain text from many different document formats and types. 
>There's a direct way to use Tesseract from within Solr (the 
>ExtractingRequestHandler 
>https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html#uploading-data-with-solr-cell-using-apache-tika)
>
>but we don't generally recommend this, as dodgy files can sometimes eat
>
>all your resources during parsing and if Tika dies then so does Solr.
>We 
>usually process the files externally and the feed them to Solr using
>its 
>HTTP API.
>
>Here's one way to do it - a simple server wrapper around Tika 
>https://github.com/mattflax/dropwizard-tika-server written by my 
>colleague Matt Pearce.
>
>So you're going to need to do some coding I think - Python would be a 
>good choice - to feed your source files to Tika for OCR and extraction,
>
>and then the resulting text to Solr for indexing.
>
>Cheers
>
>Charlie
>
>> 
>> Thank you very much
>> 
>
>
>-- 
>Charlie Hull
>Flax - Open Source Enterprise Search
>
>tel/fax: +44 (0)8700 118334
>mobile:  +44 (0)7767 825828
>web: www.flax.co.uk

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Fwd: configuring Solr with Tesseract

Reply via email to