Re: ExtractingRequestHandler

Erick Erickson Sun, 01 Apr 2012 10:50:36 -0700

Ahhh, OK. Sure, anything you store in Solr you can get back. The key
is not Tika, but your schema.xml file, and setting 'stored="true" '


bq: So my question was if I can index the original doc via
ExtractingRequestHandler in Solr AND get back the text output, in a single
call.

I know of now way to do this using Solr Cell. That said, you can always
use SolrJ and Tika on the client to separate the Tika parsing from
the indexing steps. Then you have all the parts available on the
client to do whatever you want.

 Solr Cell is great for proof-of-concept, but for heavy-duty applications,
you're offloading all the processing on the  Solr server, which can be a
problem.

Here's a writeup describing how to use Tika independently of
Solr while indexing data to Solr that might help:

http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

Hope that helps
Erick

On Sun, Apr 1, 2012 at 1:27 PM,  <[email protected]> wrote:
> Hi Erik,
>
> I think we have some misunderstanding.
>
> I want to index the text of the docs in Solr (only indexed, NOT stored).
>
> But I want the text (Tika output) back for:
>
> * later faster reindexing (some text extraction like OCR takes really long)
> * use the text for other processings
>
> The original doc is NOT stored in solr.
>
>
> So my question was if I can index the original doc via
> ExtractingRequestHandler in Solr AND get back the text output, in a single
> call.
>
> AFAIK I can do it only in 2 calls:
>
> 1) ExtractingRequestHandler?ext.extract.only=true -> Text
> 2) Index the text from 1) in solr
>
>
> Thx
>
>> Yes, you can. but.... Generally, storing the raw input in Solr is
>> not the best approach. The problem here is that pretty soon
>> you get a huge index that contains *everything*. Solr was not
>> intended to be a data store.
>>
>> Besides, you then need to store the binary form of the file. Solr
>> only deals with text, not markup.
>>
>> Most people index the text in Solr, and enough information
>> so the application knows where to go to fetch the original
>> document when the user drills down (e.g. file path, database
>> PK, etc). Would that work for your situation?
>>
>> Best
>> Erick
>>
>> On Sat, Mar 31, 2012 at 3:55 PM,  <[email protected]> wrote:
>> > Hi,
>> >
>> > I want to index various filetypes in solr, this can easily done with
>> > ExtractingRequestHandler. But I also need the extracted
>> content back.
>> > I know ext.extract.only but then nothing gets indexed, right?
>> >
>> > Can I index the document AND get the content back as with
>> ext.extract.only?
>> > In a single request?
>> >
>> > Thank you
>> >
>> >
>>
>

Re: ExtractingRequestHandler

Reply via email to