Hi Swaraj,
Thanks for the answers. From my understanding We can index, · Using DIH from db · Using DIH from filesystem - this is where I am concentrating on. o For this we can use SolrJ with Tika(solr cell) from Java layer in order to extract the content and send the data through REST API to solrserver o Or we can use extractrequesthandler to do the job. I just want to index only certain documents and there will not be any update happening on the indexed document. In our existing system we already have DIH implemented which indexes document from sql server (As you said based on last index time). In this case the metadata is there available in database. But if we are streaming via url, we would need to append the metadata too. correct me if i am wrong. And how does the indexing happening here based on last index time or something else ? Also for extractrequesthandler when you say manual operation what is it you are talking about ? Can you please clarify. Thanks Sangeetha -----Original Message----- From: Swaraj Kumar [mailto:swaraj2...@gmail.com] Sent: 07 April 2015 18:02 To: solr-user@lucene.apache.org Subject: Re: What is the best way of Indexing different formats of documents? You can always choose either DIH or /update/extract to index docs in solr. Now there are multiple benefits of DIH which I am listing below :- 1. Clean and update using a single command. 2. DIH also optimize indexing using optimize=true 3. You can do delta-import based on last index time where as in case of /update/extract you need to do manual operation in case of delta import. 4. You can use multiple entity processor and transformers in case of DIH which is very useful to index exact data you want. 5. Query parameter "rows" limits the num of records. Regards, Swaraj Kumar Senior Software Engineer I MakeMyTrip.com Mob No- 9811774497 On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com<mailto:sangeetha.subraman...@gtnexus.com> < sangeetha.subraman...@gtnexus.com<mailto:sangeetha.subraman...@gtnexus.com>> wrote: > Hi, > > I am a newbie to SOLR and basically from database background. We have > a requirement of indexing files of different formats (x12,edifact, csv,xml). > The files which are inputted can be of any format and we need to do a > content based search on it. > > From the web I understand we can use TIKA processor to extract the > content and store it in SOLR. What I want to know is, is there any > better approach for indexing files in SOLR ? Can we index the document > through streaming directly from the Application ? If so what is the > disadvantage of using it (against DIH which fetches from the > database)? Could someone share me some insight on this ? ls there any > web links which I can refer to get some idea on it ? Please do help. > > Thanks > Sangeetha > >