Hi Erik,

I think we have some misunderstanding.

I want to index the text of the docs in Solr (only indexed, NOT stored).

But I want the text (Tika output) back for:

* later faster reindexing (some text extraction like OCR takes really long)
* use the text for other processings

The original doc is NOT stored in solr.


So my question was if I can index the original doc via
ExtractingRequestHandler in Solr AND get back the text output, in a single
call.

AFAIK I can do it only in 2 calls:

1) ExtractingRequestHandler?ext.extract.only=true -> Text
2) Index the text from 1) in solr


Thx 

> Yes, you can. but.... Generally, storing the raw input in Solr is
> not the best approach. The problem here is that pretty soon
> you get a huge index that contains *everything*. Solr was not
> intended to be a data store.
> 
> Besides, you then need to store the binary form of the file. Solr
> only deals with text, not markup.
> 
> Most people index the text in Solr, and enough information
> so the application knows where to go to fetch the original
> document when the user drills down (e.g. file path, database
> PK, etc). Would that work for your situation?
> 
> Best
> Erick
> 
> On Sat, Mar 31, 2012 at 3:55 PM,  <spr...@gmx.eu> wrote:
> > Hi,
> >
> > I want to index various filetypes in solr, this can easily done with
> > ExtractingRequestHandler. But I also need the extracted 
> content back.
> > I know ext.extract.only but then nothing gets indexed, right?
> >
> > Can I index the document AND get the content back as with 
> ext.extract.only?
> > In a single request?
> >
> > Thank you
> >
> >
> 

Reply via email to