Re: using tika inside SOLR vs using nutch

Furkan KAMACI Tue, 10 Sep 2013 10:30:23 -0700

If you have tens of millions of documents to parse and do want to do that
job inside Solr than it means that you will make a workload on Solr. If
there are many queries into your Solr node than you should consider that
CPU and RAM may not be enough for you while both parsing and somebody is
querying you system.


Parsing documents at Nutch is a batch processing. If you do that in Solr
you won't wait to send that documents from Nutch to Solr.

If you parse that documents at Nutch side and do that on Hadoop than and
have many machines than doing that job on Map/Reduce ""may"" be a good
choice for you.


2013/9/10 adfel70 <adfe...@gmail.com>

> Hi
>
> What are the pros and cons of both use cases?
> 1. use nutch to crawl file system + parse files + perform other data
> manipulation and eventually index to solr.
> 2. use solr dataimporthandlers and plugins in order to perform this task.
>
>
> Note that I have  tens of millions of docs which I need to handle the first
> time, and then delta imports of around 100k docs per day.
> Each doc may be up to 100mb.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/using-tika-inside-SOLR-vs-using-nutch-tp4089120.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: using tika inside SOLR vs using nutch

Reply via email to