RE: What is the best way of Indexing different formats of documents?

sangeetha.subraman...@gtnexus.com Wed, 08 Apr 2015 01:51:23 -0700

Hi Swaraj,

Thanks for the answers.

From my understanding We can index,

·       Using DIH from db

·       Using DIH from filesystem - this is where I am concentrating on.

o   For this we can use SolrJ with Tika(solr cell) from Java layer in order to 
extract the content and send the data through REST API to solrserver

o   Or we can use extractrequesthandler to do the job.

I just want to index only certain documents and there will not be any update 
happening on the indexed document.

In our existing system we already have DIH implemented which indexes document 
from sql server (As you said based on last index time). In this case the 
metadata is there available in database.

But if we are streaming via url, we would need to append the metadata too. 
correct me if i am wrong. And how does the indexing happening here based on 
last index time or something else ? Also for  extractrequesthandler when you 
say manual operation what is it you are talking about ? Can you please clarify.

Thanks

Sangeetha

-----Original Message-----
From: Swaraj Kumar [mailto:swaraj2...@gmail.com]
Sent: 07 April 2015 18:02
To: solr-user@lucene.apache.org
Subject: Re: What is the best way of Indexing different formats of documents?

You can always choose either DIH or /update/extract to index docs in solr.

Now there are multiple benefits of DIH which I am listing below :-

1. Clean and update using a single command.

2. DIH also optimize indexing using optimize=true 3. You can do delta-import 
based on last index time where as in case of /update/extract you need to do 
manual operation in case of delta import.

4. You can use multiple entity processor and transformers in case of DIH which 
is very useful to index exact data you want.

5. Query parameter "rows" limits the num of records.

Regards,

Swaraj Kumar

Senior Software Engineer I

MakeMyTrip.com

Mob No- 9811774497

On Tue, Apr 7, 2015 at 4:18 PM, 
sangeetha.subraman...@gtnexus.com<mailto:sangeetha.subraman...@gtnexus.com> < 
sangeetha.subraman...@gtnexus.com<mailto:sangeetha.subraman...@gtnexus.com>> 
wrote:

> Hi,

>

> I am a newbie to SOLR and basically from database background. We have

> a requirement of indexing files of different formats (x12,edifact, csv,xml).

> The files which are inputted can be of any format and we need to do a

> content based search on it.

>

> From the web I understand we can use TIKA processor to extract the

> content and store it in SOLR. What I want to know is, is there any

> better approach for indexing files in SOLR ? Can we index the document

> through streaming directly from the Application ? If so what is the

> disadvantage of using it (against DIH which fetches from the

> database)? Could someone share me some insight on this ? ls there any

> web links which I can refer to get some idea on it ? Please do help.

>

> Thanks

> Sangeetha

>

>

RE: What is the best way of Indexing different formats of documents?

Reply via email to