A Hadoop cluster of low-end machines (2 cores, 2GB RAM) can, with a parsing 
fetcher and proper configuration and evenly distributed fetch lists, process 
up to ~15 URL's per second per node. Such a machine has a single mapper and a 
single reducer running because of limited memory.

On Friday 30 March 2012 11:25:35 ashish vyas wrote:
> @ Christoph: Thanks for replying. I would try with more nodes/larger url
> set to see how much improvement in processing time i get from cluster.
> 
> @mapreduce-mailing-community: It would be great if anybody can help me with
> Nutch benchmark on small cluster since it would help me in determining no.
> of machines i would need for my application to scale up.
> 
> Regards:
> Ashish Vyas
> On Fri, Mar 30, 2012 at 2:16 PM, Christoph Schmitz <
> 
> [email protected]> wrote:
> > Hi Ashish,
> > 
> > IMHO your numbers (2 machines, 10 URLs) are way too small to outweigh the
> > natural overhead that occurs with a distributed computation (distributing
> > the program code, coordinating the distributed file system, making sure
> > everybody is starting and stopping, etc.). Also, if you're web crawling,
> > the bottleneck might not even be the processing capacity of your
> > machines, but rather some network component on the way between you and
> > the web.
> > 
> > I'm not aware of any Hadoop or Nutch benchmarks, but once you use larger
> > data and/or CPU intensive computations, you should actually see a more or
> > less linear increase in throughput with more machines.
> > 
> > Regards,
> > Christoph
> > 
> > -----Ursprüngliche Nachricht-----
> > Von: ashish vyas [mailto:[email protected]]
> > Gesendet: Freitag, 30. März 2012 10:30
> > An: [email protected]
> > Betreff: Performance improvement-Cluster vs Pseudo
> > 
> >        Hi,
> >        
> >        
> >        
> >        I have setup hadoop clutser(2 node cluster) and I am running Nutch
> > 
> > crawl on it. I am trying to compare results and improvement in processing
> > time when I crawl with 10 URL's and depth 2. When I am running the crawl
> > on cluster its taking more time than pseudo cluster which in turn is
> > taking more time than standalone nutch crawl.
> > 
> >        I am just wondering that after running Nutch on hadoop cluster
> > 
> > processing time should come down logicaly since that's why hadoop has
> > evolved out of Nutch project. Please let me know if there is any
> > benchmark test for pseudo vs cluster and why Nutch crawl is taking more
> > time on cluster.
> > 
> >        Please let me know if you need more info.
> >        
> >        
> >        
> >        Regards:
> >        
> >        Ashish Vyas

-- 
Markus Jelsma - CTO - Openindex

Reply via email to