Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Question from a Desperate Java Newbie


Copy link to this message
-
Re: Question from a Desperate Java Newbie
I totally obey the robots.txt since I am only fetching RSS feeds :-)
I implemented my crawler with HttpClient and it is working fine.
I often get messages about "Cookie rejected", but am able to fetch news
articles anyway.

I guess the default "java.net" client is the stateful client you mentioned.
Thanks for the tip!!

Ed

2010년 12월 16일 오전 2:18, Steve Loughran <[EMAIL PROTECTED]>님의 말:

> On 10/12/10 09:08, Edward Choi wrote:
> > I was wrong. It wasn't because of the "read once free" policy. I tried
> again with Java first again and this time it didn't work.
> > I looked up google and found the Http Client you mentioned. It is the one
> provided by apache, right? I guess I will have to try that one now. Thanks!
> >
>
> httpclient is good, HtmlUnit has a very good client that can simulate
> things like a full web browser with cookies, but that may be overkill.
>
> NYT's read once policy uses cookies to verify that you are there for the
> first day not logged in, for later days you get 302'd unless you delete
> the cookie, so stateful clients are bad.
>
> What you may have been hit by is whatever robot trap they have -if you
> generate too much load and don't follow the robots.txt rules they may
> detect this and push back
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB