Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Web Crawler in hadoop - Unresponsive after a while


Copy link to this message
-
Re: Web Crawler in hadoop - Unresponsive after a while
You would probably be happier using an industrial strength crawler.

Check out Bixo.

http://bixolabs.com/about/focused-crawler/

On Thu, Oct 13, 2011 at 5:13 PM, Aishwarya Venkataraman <
[EMAIL PROTECTED]> wrote:

> Hello,
>
> I trying to make my web crawling go faster with hadoop. My mapper just
> consists of a single line and my reducer is an IdentityReducer
>
> while read line;do
>  #result="`wget -O - --timeout=500 http://$line 2>&1`"
>  echo $result
> done
>
> I am crawling about 50,000 sites. But my mapper always seems to time out
> after sometime. The crawler just becomes unresponsive I guess.
> I am not able to see which site is causing the problem as mapper deletes
> the
> output if the job fails. I am running a single node hadoop cluster
> currently.
> Is this the problem ?
>
> Did anyone else have a similar problem ? I am not sure why this is
> happening. Can I prevent mapper from deleting intermediate outputs ?
>
> I tried running mapper against 10-20 sites as opposed to 50k sites and that
> worked fine.
>
> Thanks,
> Aishwarya Venkataraman
> [EMAIL PROTECTED]
> Graduate Student | Department of Computer Science
> University of California, San Diego
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB