Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Question from a Desperate Java Newbie


Copy link to this message
-
Re: Question from a Desperate Java Newbie
Not exactly what you may want - but could you try using a HTTP client
in Java ? Some of them have the ability to automatically follow
redirects, manage cookies etc.

Thanks
hemanth

On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote:
> Excuse me for asking a general Java question here.
> I tried to find Java mailing list from Google but none of them were active.
>
> There is a problem that's been driving me crazy for a while.
>
> I am trying to download webpages from New York Times.
> With Java URL.openStream(), I can't get past the login requirement.
> But with c++ socket programming (using read() and write()), I can download
> any webpage just fine.
>
> Interesting thing is that with c++, I get redirected like 10 times. Below is
> the content of the header of the firstly redirected webpage when I try to
> download
> "
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
> "
>
> HTTP/1.1 302 Moved Temporarily
> Server: Sun-ONE-Web-Server/6.1
> Date: Thu, 09 Dec 2010 08:42:35 GMT
> Content-type: text/html
> Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
> 08:42:35 GMT; path=/; domain=.nytimes.com
> Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
> nytimes.com
> Set-cookie:
> NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
> expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
> Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
> nytimes.com
> Location:
> http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
> Expires: Thu, 01 Dec 1994 16:00:00 GMT
> Cache-control: no-cache
> Pragma: no-cache
> Connection: close
>
> But with Java, I get redirected only once to a https:// webpage and it's a
> dead end. Below is the result of java.net.URLConnection.getHeaderFiles()
>
> HTTP/1.1 301 Moved Permanently,
> Date: Thu, 09 Dec 2010 10:50:53 GMT,
> Content-type: text/html,
> Content-length: 0,
> Location:
> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
> ,
> Server: Sun-ONE-Web-Server/6.1,
>
> There is a clear difference between the two. I don't know why and it's been
> driving me crazy.
> My guess is that c++ write() function can create some kind of cookie by
> itself, but Java URL.openStream() can't.
>
> Am I right? Or can anyone explain this for me?
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB