Re: What is the best way of Indexing different formats of documents?

Upayavira Tue, 07 Apr 2015 06:06:13 -0700


On Tue, Apr 7, 2015, at 11:48 AM, sangeetha.subraman...@gtnexus.com
wrote:
> Hi,
> 
> I am a newbie to SOLR and basically from database background. We have a
> requirement of indexing files of different formats (x12,edifact,
> csv,xml).
> The files which are inputted can be of any format and we need to do a
> content based search on it.
> 
> From the web I understand we can use TIKA processor to extract the
> content and store it in SOLR. What I want to know is, is there any better
> approach for indexing files in SOLR ? Can we index the document through
> streaming directly from the Application ? If so what is the disadvantage
> of using it (against DIH which fetches from the database)? Could someone
> share me some insight on this ? ls there any web links which I can refer
> to get some idea on it ? Please do help.


You can have Solr do the TIKA work for you, by posting to
update/extract. See here:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

You can only post one document at a time, and you will have to provide
extra metadata fields in the URL you post to (e.g. the document ID).

If the extracting update handler can handle what you need, then you are
good. Otherwise, you will want to write your own code to call Tika, then
push the extracted content as a plain document.

Solr is just an HTTP server, so your application can post binary files
for Solr to ingest with Tika, or otherwise.

Upayavira

Re: What is the best way of Indexing different formats of documents?

Reply via email to