Hi Sangeetha, /update/extract refers to extractrequesthandler.
If you only want to index the data, you can do it with extractrequesthandler. I dont think it requires metadata, but you need to provide literal.id to specify which field will be unique id. For more information :- https://wiki.apache.org/solr/ExtractingRequestHandler https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Regards, Swaraj Kumar Senior Software Engineer I MakeMyTrip.com Mob No- 9811774497 On Wed, Apr 8, 2015 at 2:20 PM, sangeetha.subraman...@gtnexus.com < sangeetha.subraman...@gtnexus.com> wrote: > Hi Swaraj, > > > > Thanks for the answers. > > From my understanding We can index, > > · Using DIH from db > > · Using DIH from filesystem - this is where I am concentrating on. > > o For this we can use SolrJ with Tika(solr cell) from Java layer in > order to extract the content and send the data through REST API to > solrserver > > o Or we can use extractrequesthandler to do the job. > > > > I just want to index only certain documents and there will not be any > update happening on the indexed document. > > > > In our existing system we already have DIH implemented which indexes > document from sql server (As you said based on last index time). In this > case the metadata is there available in database. > > > > But if we are streaming via url, we would need to append the metadata too. > correct me if i am wrong. And how does the indexing happening here based on > last index time or something else ? Also for extractrequesthandler when > you say manual operation what is it you are talking about ? Can you please > clarify. > > > > Thanks > > Sangeetha > > > > -----Original Message----- > From: Swaraj Kumar [mailto:swaraj2...@gmail.com] > Sent: 07 April 2015 18:02 > To: solr-user@lucene.apache.org > Subject: Re: What is the best way of Indexing different formats of > documents? > > > > You can always choose either DIH or /update/extract to index docs in solr. > > Now there are multiple benefits of DIH which I am listing below :- > > > > 1. Clean and update using a single command. > > 2. DIH also optimize indexing using optimize=true 3. You can do > delta-import based on last index time where as in case of /update/extract > you need to do manual operation in case of delta import. > > 4. You can use multiple entity processor and transformers in case of DIH > which is very useful to index exact data you want. > > 5. Query parameter "rows" limits the num of records. > > > > Regards, > > > > > > Swaraj Kumar > > Senior Software Engineer I > > MakeMyTrip.com > > Mob No- 9811774497 > > > > On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com<mailto: > sangeetha.subraman...@gtnexus.com> < sangeetha.subraman...@gtnexus.com > <mailto:sangeetha.subraman...@gtnexus.com>> wrote: > > > > > Hi, > > > > > > I am a newbie to SOLR and basically from database background. We have > > > a requirement of indexing files of different formats (x12,edifact, > csv,xml). > > > The files which are inputted can be of any format and we need to do a > > > content based search on it. > > > > > > From the web I understand we can use TIKA processor to extract the > > > content and store it in SOLR. What I want to know is, is there any > > > better approach for indexing files in SOLR ? Can we index the document > > > through streaming directly from the Application ? If so what is the > > > disadvantage of using it (against DIH which fetches from the > > > database)? Could someone share me some insight on this ? ls there any > > > web links which I can refer to get some idea on it ? Please do help. > > > > > > Thanks > > > Sangeetha > > > > > > >