|
|
+
Aishwarya Venkataraman 2011-10-14, 00:13
+
Otis Gospodnetic 2011-10-14, 02:03
+
Aishwarya Venkataraman 2011-10-14, 02:51
-
Re: Web Crawler in hadoop - Unresponsive after a whileTed Dunning 2011-10-14, 18:30
You would probably be happier using an industrial strength crawler.
Check out Bixo. http://bixolabs.com/about/focused-crawler/ On Thu, Oct 13, 2011 at 5:13 PM, Aishwarya Venkataraman < [EMAIL PROTECTED]> wrote: > Hello, > > I trying to make my web crawling go faster with hadoop. My mapper just > consists of a single line and my reducer is an IdentityReducer > > while read line;do > #result="`wget -O - --timeout=500 http://$line 2>&1`" > echo $result > done > > I am crawling about 50,000 sites. But my mapper always seems to time out > after sometime. The crawler just becomes unresponsive I guess. > I am not able to see which site is causing the problem as mapper deletes > the > output if the job fails. I am running a single node hadoop cluster > currently. > Is this the problem ? > > Did anyone else have a similar problem ? I am not sure why this is > happening. Can I prevent mapper from deleting intermediate outputs ? > > I tried running mapper against 10-20 sites as opposed to 50k sites and that > worked fine. > > Thanks, > Aishwarya Venkataraman > [EMAIL PROTECTED] > Graduate Student | Department of Computer Science > University of California, San Diego > |