Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Web crawler in hadoop - unresponsive after a while

Copy link to this message
Re: Web crawler in hadoop - unresponsive after a while
Hi Aishwarya
        To debug this issue you necessarily don't need the intermediate output. If there is any error/exception then you can get it from your job logs directly. In your case the job turns irresponsive, to do further trouble shooting  you can include log statements on your program and then rerun the same and obtain the records that creates the problem from your logs.
       In a direct manner you can obtain your logs from the job tracker web UI. http://<host>:50030/jobtracker.jsp. From your job drill down to the task and on the right side you can see options to display your task tracker logs.
       On top of this i'd like to add on, since you mentioned  single node, I assume it is either on stand alone/distributed mode. These setup is basically for development and testing of functionality. If you are looking for better performance of your jobs, you  need to leverage the parallel processing power of hadoop. You need to have  a mini cluster at least for performance bench marking and processing relatively large volume data.

Hope it helps!..

------Original Message------
From: Aishwarya Venkataraman
Subject: Web crawler in hadoop - unresponsive after a while
Sent: Oct 14, 2011 08:20


I trying to make my web crawling go faster with hadoop. My mapper just
consists of a single line and my reducer is an IdentityReducer

while read line;do
  #result="`wget -O - --timeout=500 http://$line 2>&1`"
  echo $result

I am crawling about 50,000 sites. But my mapper always seems to time out
after sometime. The crawler just becomes unresponsive I guess.
I am not able to see which site is causing the problem as mapper deletes the
output if the job fails. I am running a single node hadoop cluster
Is this the problem ?

Did anyone else have a similar problem ? I am not sure why this is
happening. Can I prevent mapper from deleting intermediate outputs ?

I tried running mapper against 10-20 sites as opposed to 50k sites and that
worked fine.


Bejoy K S