|
|
-
Re: Question from a Desperate Java Newbieedward choi 2010-12-10, 07:33
Are you talking about java.net.HttpURLConnection?
If so, I've already tried using that with getInputStream() function. But still no luck. I actually got an interesting answer from Aardvark, which said that NY Times has a policy called "Read once for free". So obviously I crawled with C++ application first and blew my chance to crawl with Java application. The answerer was not sure about this policy but I think it makes sense, because today I tried with Java crawler first and it worked just fine. Ed 2010/12/10 Hemanth Yamijala <[EMAIL PROTECTED]> > Not exactly what you may want - but could you try using a HTTP client > in Java ? Some of them have the ability to automatically follow > redirects, manage cookies etc. > > Thanks > hemanth > > On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote: > > Excuse me for asking a general Java question here. > > I tried to find Java mailing list from Google but none of them were > active. > > > > There is a problem that's been driving me crazy for a while. > > > > I am trying to download webpages from New York Times. > > With Java URL.openStream(), I can't get past the login requirement. > > But with c++ socket programming (using read() and write()), I can > download > > any webpage just fine. > > > > Interesting thing is that with c++, I get redirected like 10 times. Below > is > > the content of the header of the firstly redirected webpage when I try to > > download > > " > > > http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh > > " > > > > HTTP/1.1 302 Moved Temporarily > > Server: Sun-ONE-Web-Server/6.1 > > Date: Thu, 09 Dec 2010 08:42:35 GMT > > Content-type: text/html > > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011 > > 08:42:35 GMT; path=/; domain=.nytimes.com > > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=. > > nytimes.com > > Set-cookie: > > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI; > > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com > > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=. > > nytimes.com > > Location: > > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp > > Expires: Thu, 01 Dec 1994 16:00:00 GMT > > Cache-control: no-cache > > Pragma: no-cache > > Connection: close > > > > But with Java, I get redirected only once to a https:// webpage and it's > a > > dead end. Below is the result of java.net.URLConnection.getHeaderFiles() > > > > HTTP/1.1 301 Moved Permanently, > > Date: Thu, 09 Dec 2010 10:50:53 GMT, > > Content-type: text/html, > > Content-length: 0, > > Location: > > > https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR > > , > > Server: Sun-ONE-Web-Server/6.1, > > > > There is a clear difference between the two. I don't know why and it's > been > > driving me crazy. > > My guess is that c++ write() function can create some kind of cookie by > > itself, but Java URL.openStream() can't. > > > > Am I right? Or can anyone explain this for me? > > > |