Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Web Crawler in hadoop - Unresponsive after a while


Copy link to this message
-
Re: Web Crawler in hadoop - Unresponsive after a while
Aishwarya, you should probably ask on the -user list.
Moreover, you should probably just look at and use Nutch, which uses MR under the hood for fetching and other tasks - see http://nutch.apache.org/

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/

>________________________________
>From: Aishwarya Venkataraman <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Thursday, October 13, 2011 8:13 PM
>Subject: Web Crawler in hadoop - Unresponsive after a while
>
>Hello,
>
>I trying to make my web crawling go faster with hadoop. My mapper just
>consists of a single line and my reducer is an IdentityReducer
>
>while read line;do
>  #result="`wget -O - --timeout=500 http://$line 2>&1`"
>  echo $result
>done
>
>I am crawling about 50,000 sites. But my mapper always seems to time out
>after sometime. The crawler just becomes unresponsive I guess.
>I am not able to see which site is causing the problem as mapper deletes the
>output if the job fails. I am running a single node hadoop cluster
>currently.
>Is this the problem ?
>
>Did anyone else have a similar problem ? I am not sure why this is
>happening. Can I prevent mapper from deleting intermediate outputs ?
>
>I tried running mapper against 10-20 sites as opposed to 50k sites and that
>worked fine.
>
>Thanks,
>Aishwarya Venkataraman
>[EMAIL PROTECTED]
>Graduate Student | Department of Computer Science
>University of California, San Diego
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB