-AW: Performance improvement-Cluster vs Pseudo
Christoph Schmitz 2012-03-30, 08:46
IMHO your numbers (2 machines, 10 URLs) are way too small to outweigh the natural overhead that occurs with a distributed computation (distributing the program code, coordinating the distributed file system, making sure everybody is starting and stopping, etc.). Also, if you're web crawling, the bottleneck might not even be the processing capacity of your machines, but rather some network component on the way between you and the web.
I'm not aware of any Hadoop or Nutch benchmarks, but once you use larger data and/or CPU intensive computations, you should actually see a more or less linear increase in throughput with more machines.
Von: ashish vyas [mailto:[EMAIL PROTECTED]]
Gesendet: Freitag, 30. März 2012 10:30
An: [EMAIL PROTECTED]
Betreff: Performance improvement-Cluster vs Pseudo
I have setup hadoop clutser(2 node cluster) and I am running Nutch crawl on it. I am trying to compare results and improvement in processing time when I crawl with 10 URL's and depth 2. When I am running the crawl on cluster its taking more time than pseudo cluster which in turn is taking more time than standalone nutch crawl.
I am just wondering that after running Nutch on hadoop cluster processing time should come down logicaly since that's why hadoop has evolved out of Nutch project. Please let me know if there is any benchmark test for pseudo vs cluster and why Nutch crawl is taking more time on cluster.
Please let me know if you need more info.