If you have tens of millions of documents to parse and do want to do that job inside Solr than it means that you will make a workload on Solr. If there are many queries into your Solr node than you should consider that CPU and RAM may not be enough for you while both parsing and somebody is querying you system.
Parsing documents at Nutch is a batch processing. If you do that in Solr you won't wait to send that documents from Nutch to Solr. If you parse that documents at Nutch side and do that on Hadoop than and have many machines than doing that job on Map/Reduce ""may"" be a good choice for you. 2013/9/10 adfel70 <adfe...@gmail.com> > Hi > > What are the pros and cons of both use cases? > 1. use nutch to crawl file system + parse files + perform other data > manipulation and eventually index to solr. > 2. use solr dataimporthandlers and plugins in order to perform this task. > > > Note that I have tens of millions of docs which I need to handle the first > time, and then delta imports of around 100k docs per day. > Each doc may be up to 100mb. > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/using-tika-inside-SOLR-vs-using-nutch-tp4089120.html > Sent from the Solr - User mailing list archive at Nabble.com. >