Re: What is the best way of Indexing different formats of documents?

Swaraj Kumar Wed, 08 Apr 2015 02:32:28 -0700

Hi Sangeetha,

/update/extract refers to extractrequesthandler.


If you only want to index the data, you can do it with extractrequesthandler.
I dont think it requires metadata, but you need to provide literal.id to
specify which field will be unique id.

For more information :-
https://wiki.apache.org/solr/ExtractingRequestHandler
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika



Regards,


Swaraj Kumar
Senior Software Engineer I
MakeMyTrip.com
Mob No- 9811774497

On Wed, Apr 8, 2015 at 2:20 PM, sangeetha.subraman...@gtnexus.com <
sangeetha.subraman...@gtnexus.com> wrote:

> Hi Swaraj,
>
>
>
> Thanks for the answers.
>
> From my understanding We can index,
>
> ·       Using DIH from db
>
> ·       Using DIH from filesystem - this is where I am concentrating on.
>
> o   For this we can use SolrJ with Tika(solr cell) from Java layer in
> order to extract the content and send the data through REST API to
> solrserver
>
> o   Or we can use extractrequesthandler to do the job.
>
>
>
> I just want to index only certain documents and there will not be any
> update happening on the indexed document.
>
>
>
> In our existing system we already have DIH implemented which indexes
> document from sql server (As you said based on last index time). In this
> case the metadata is there available in database.
>
>
>
> But if we are streaming via url, we would need to append the metadata too.
> correct me if i am wrong. And how does the indexing happening here based on
> last index time or something else ? Also for  extractrequesthandler when
> you say manual operation what is it you are talking about ? Can you please
> clarify.
>
>
>
> Thanks
>
> Sangeetha
>
>
>
> -----Original Message-----
> From: Swaraj Kumar [mailto:swaraj2...@gmail.com]
> Sent: 07 April 2015 18:02
> To: solr-user@lucene.apache.org
> Subject: Re: What is the best way of Indexing different formats of
> documents?
>
>
>
> You can always choose either DIH or /update/extract to index docs in solr.
>
> Now there are multiple benefits of DIH which I am listing below :-
>
>
>
> 1. Clean and update using a single command.
>
> 2. DIH also optimize indexing using optimize=true 3. You can do
> delta-import based on last index time where as in case of /update/extract
> you need to do manual operation in case of delta import.
>
> 4. You can use multiple entity processor and transformers in case of DIH
> which is very useful to index exact data you want.
>
> 5. Query parameter "rows" limits the num of records.
>
>
>
> Regards,
>
>
>
>
>
> Swaraj Kumar
>
> Senior Software Engineer I
>
> MakeMyTrip.com
>
> Mob No- 9811774497
>
>
>
> On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com<mailto:
> sangeetha.subraman...@gtnexus.com> < sangeetha.subraman...@gtnexus.com
> <mailto:sangeetha.subraman...@gtnexus.com>> wrote:
>
>
>
> > Hi,
>
> >
>
> > I am a newbie to SOLR and basically from database background. We have
>
> > a requirement of indexing files of different formats (x12,edifact,
> csv,xml).
>
> > The files which are inputted can be of any format and we need to do a
>
> > content based search on it.
>
> >
>
> > From the web I understand we can use TIKA processor to extract the
>
> > content and store it in SOLR. What I want to know is, is there any
>
> > better approach for indexing files in SOLR ? Can we index the document
>
> > through streaming directly from the Application ? If so what is the
>
> > disadvantage of using it (against DIH which fetches from the
>
> > database)? Could someone share me some insight on this ? ls there any
>
> > web links which I can refer to get some idea on it ? Please do help.
>
> >
>
> > Thanks
>
> > Sangeetha
>
> >
>
> >
>

Re: What is the best way of Indexing different formats of documents?

Reply via email to