A Hadoop cluster of low-end machines (2 cores, 2GB RAM) can, with a parsing fetcher and proper configuration and evenly distributed fetch lists, process up to ~15 URL's per second per node. Such a machine has a single mapper and a single reducer running because of limited memory.
On Friday 30 March 2012 11:25:35 ashish vyas wrote: > @ Christoph: Thanks for replying. I would try with more nodes/larger url > set to see how much improvement in processing time i get from cluster. > > @mapreduce-mailing-community: It would be great if anybody can help me with > Nutch benchmark on small cluster since it would help me in determining no. > of machines i would need for my application to scale up. > > Regards: > Ashish Vyas > On Fri, Mar 30, 2012 at 2:16 PM, Christoph Schmitz < > > [email protected]> wrote: > > Hi Ashish, > > > > IMHO your numbers (2 machines, 10 URLs) are way too small to outweigh the > > natural overhead that occurs with a distributed computation (distributing > > the program code, coordinating the distributed file system, making sure > > everybody is starting and stopping, etc.). Also, if you're web crawling, > > the bottleneck might not even be the processing capacity of your > > machines, but rather some network component on the way between you and > > the web. > > > > I'm not aware of any Hadoop or Nutch benchmarks, but once you use larger > > data and/or CPU intensive computations, you should actually see a more or > > less linear increase in throughput with more machines. > > > > Regards, > > Christoph > > > > -----Ursprüngliche Nachricht----- > > Von: ashish vyas [mailto:[email protected]] > > Gesendet: Freitag, 30. März 2012 10:30 > > An: [email protected] > > Betreff: Performance improvement-Cluster vs Pseudo > > > > Hi, > > > > > > > > I have setup hadoop clutser(2 node cluster) and I am running Nutch > > > > crawl on it. I am trying to compare results and improvement in processing > > time when I crawl with 10 URL's and depth 2. When I am running the crawl > > on cluster its taking more time than pseudo cluster which in turn is > > taking more time than standalone nutch crawl. > > > > I am just wondering that after running Nutch on hadoop cluster > > > > processing time should come down logicaly since that's why hadoop has > > evolved out of Nutch project. Please let me know if there is any > > benchmark test for pseudo vs cluster and why Nutch crawl is taking more > > time on cluster. > > > > Please let me know if you need more info. > > > > > > > > Regards: > > > > Ashish Vyas -- Markus Jelsma - CTO - Openindex
