Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Question from a Desperate Java Newbie


Copy link to this message
-
Re: Question from a Desperate Java Newbie
I totally obey the robots.txt since I am only fetching RSS feeds :-)
I implemented my crawler with HttpClient and it is working fine.
I often get messages about "Cookie rejected", but am able to fetch news
articles anyway.

I guess the default "java.net" client is the stateful client you mentioned.
Thanks for the tip!!

Ed

2010년 12월 16일 오전 2:18, Steve Loughran <[EMAIL PROTECTED]>님의 말:

> On 10/12/10 09:08, Edward Choi wrote:
> > I was wrong. It wasn't because of the "read once free" policy. I tried
> again with Java first again and this time it didn't work.
> > I looked up google and found the Http Client you mentioned. It is the one
> provided by apache, right? I guess I will have to try that one now. Thanks!
> >
>
> httpclient is good, HtmlUnit has a very good client that can simulate
> things like a full web browser with cookies, but that may be overkill.
>
> NYT's read once policy uses cookies to verify that you are there for the
> first day not logged in, for later days you get 302'd unless you delete
> the cookie, so stateful clients are bad.
>
> What you may have been hit by is whatever robot trap they have -if you
> generate too much load and don't follow the robots.txt rules they may
> detect this and push back
>
>