Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Question from a Desperate Java Newbie


Copy link to this message
-
Question from a Desperate Java Newbie
Excuse me for asking a general Java question here.
I tried to find Java mailing list from Google but none of them were active.

There is a problem that's been driving me crazy for a while.

I am trying to download webpages from New York Times.
With Java URL.openStream(), I can't get past the login requirement.
But with c++ socket programming (using read() and write()), I can download
any webpage just fine.

Interesting thing is that with c++, I get redirected like 10 times. Below is
the content of the header of the firstly redirected webpage when I try to
download
"
http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
"

HTTP/1.1 302 Moved Temporarily
Server: Sun-ONE-Web-Server/6.1
Date: Thu, 09 Dec 2010 08:42:35 GMT
Content-type: text/html
Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
08:42:35 GMT; path=/; domain=.nytimes.com
Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
nytimes.com
Set-cookie:
NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
nytimes.com
Location:
http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Cache-control: no-cache
Pragma: no-cache
Connection: close

But with Java, I get redirected only once to a https:// webpage and it's a
dead end. Below is the result of java.net.URLConnection.getHeaderFiles()

HTTP/1.1 301 Moved Permanently,
Date: Thu, 09 Dec 2010 10:50:53 GMT,
Content-type: text/html,
Content-length: 0,
Location:
https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
,
Server: Sun-ONE-Web-Server/6.1,

There is a clear difference between the two. I don't know why and it's been
driving me crazy.
My guess is that c++ write() function can create some kind of cookie by
itself, but Java URL.openStream() can't.

Am I right? Or can anyone explain this for me?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB