|
edward choi
2010-12-09, 11:05
Hemanth Yamijala
2010-12-10, 02:33
Santosh Borse
2010-12-10, 04:50
edward choi
2010-12-10, 07:29
edward choi
2010-12-10, 07:33
Edward Choi
2010-12-10, 09:08
Steve Loughran
2010-12-15, 17:18
edward choi
2010-12-16, 06:14
|
-
Question from a Desperate Java Newbieedward choi 2010-12-09, 11:05
Excuse me for asking a general Java question here.
I tried to find Java mailing list from Google but none of them were active. There is a problem that's been driving me crazy for a while. I am trying to download webpages from New York Times. With Java URL.openStream(), I can't get past the login requirement. But with c++ socket programming (using read() and write()), I can download any webpage just fine. Interesting thing is that with c++, I get redirected like 10 times. Below is the content of the header of the firstly redirected webpage when I try to download " http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh " HTTP/1.1 302 Moved Temporarily Server: Sun-ONE-Web-Server/6.1 Date: Thu, 09 Dec 2010 08:42:35 GMT Content-type: text/html Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011 08:42:35 GMT; path=/; domain=.nytimes.com Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=. nytimes.com Set-cookie: NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI; expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=. nytimes.com Location: http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp Expires: Thu, 01 Dec 1994 16:00:00 GMT Cache-control: no-cache Pragma: no-cache Connection: close But with Java, I get redirected only once to a https:// webpage and it's a dead end. Below is the result of java.net.URLConnection.getHeaderFiles() HTTP/1.1 301 Moved Permanently, Date: Thu, 09 Dec 2010 10:50:53 GMT, Content-type: text/html, Content-length: 0, Location: https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR , Server: Sun-ONE-Web-Server/6.1, There is a clear difference between the two. I don't know why and it's been driving me crazy. My guess is that c++ write() function can create some kind of cookie by itself, but Java URL.openStream() can't. Am I right? Or can anyone explain this for me?
-
Re: Question from a Desperate Java NewbieHemanth Yamijala 2010-12-10, 02:33
Not exactly what you may want - but could you try using a HTTP client
in Java ? Some of them have the ability to automatically follow redirects, manage cookies etc. Thanks hemanth On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote: > Excuse me for asking a general Java question here. > I tried to find Java mailing list from Google but none of them were active. > > There is a problem that's been driving me crazy for a while. > > I am trying to download webpages from New York Times. > With Java URL.openStream(), I can't get past the login requirement. > But with c++ socket programming (using read() and write()), I can download > any webpage just fine. > > Interesting thing is that with c++, I get redirected like 10 times. Below is > the content of the header of the firstly redirected webpage when I try to > download > " > http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh > " > > HTTP/1.1 302 Moved Temporarily > Server: Sun-ONE-Web-Server/6.1 > Date: Thu, 09 Dec 2010 08:42:35 GMT > Content-type: text/html > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011 > 08:42:35 GMT; path=/; domain=.nytimes.com > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=. > nytimes.com > Set-cookie: > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI; > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=. > nytimes.com > Location: > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp > Expires: Thu, 01 Dec 1994 16:00:00 GMT > Cache-control: no-cache > Pragma: no-cache > Connection: close > > But with Java, I get redirected only once to a https:// webpage and it's a > dead end. Below is the result of java.net.URLConnection.getHeaderFiles() > > HTTP/1.1 301 Moved Permanently, > Date: Thu, 09 Dec 2010 10:50:53 GMT, > Content-type: text/html, > Content-length: 0, > Location: > https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR > , > Server: Sun-ONE-Web-Server/6.1, > > There is a clear difference between the two. I don't know why and it's been > driving me crazy. > My guess is that c++ write() function can create some kind of cookie by > itself, but Java URL.openStream() can't. > > Am I right? Or can anyone explain this for me? >
-
RE: Question from a Desperate Java NewbieSantosh Borse 2010-12-10, 04:50
You can use open source wget as well.
-----Original Message----- From: Hemanth Yamijala [mailto:[EMAIL PROTECTED]] Sent: Friday, December 10, 2010 8:04 AM To: [EMAIL PROTECTED] Subject: Re: Question from a Desperate Java Newbie Not exactly what you may want - but could you try using a HTTP client in Java ? Some of them have the ability to automatically follow redirects, manage cookies etc. Thanks hemanth On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote: > Excuse me for asking a general Java question here. > I tried to find Java mailing list from Google but none of them were active. > > There is a problem that's been driving me crazy for a while. > > I am trying to download webpages from New York Times. > With Java URL.openStream(), I can't get past the login requirement. > But with c++ socket programming (using read() and write()), I can download > any webpage just fine. > > Interesting thing is that with c++, I get redirected like 10 times. Below is > the content of the header of the firstly redirected webpage when I try to > download > " > http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh > " > > HTTP/1.1 302 Moved Temporarily > Server: Sun-ONE-Web-Server/6.1 > Date: Thu, 09 Dec 2010 08:42:35 GMT > Content-type: text/html > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011 > 08:42:35 GMT; path=/; domain=.nytimes.com > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=. > nytimes.com > Set-cookie: > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI; > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=. > nytimes.com > Location: > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp > Expires: Thu, 01 Dec 1994 16:00:00 GMT > Cache-control: no-cache > Pragma: no-cache > Connection: close > > But with Java, I get redirected only once to a https:// webpage and it's a > dead end. Below is the result of java.net.URLConnection.getHeaderFiles() > > HTTP/1.1 301 Moved Permanently, > Date: Thu, 09 Dec 2010 10:50:53 GMT, > Content-type: text/html, > Content-length: 0, > Location: > https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR > , > Server: Sun-ONE-Web-Server/6.1, > > There is a clear difference between the two. I don't know why and it's been > driving me crazy. > My guess is that c++ write() function can create some kind of cookie by > itself, but Java URL.openStream() can't. > > Am I right? Or can anyone explain this for me? > DISCLAIMER =========This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
-
Re: Question from a Desperate Java Newbieedward choi 2010-12-10, 07:29
I would, but I am trying to integrate the crawler with Hadoop, so I wanted
to write in Java :-) 2010/12/10 Santosh Borse <[EMAIL PROTECTED]> > You can use open source wget as well. > > -----Original Message----- > From: Hemanth Yamijala [mailto:[EMAIL PROTECTED]] > Sent: Friday, December 10, 2010 8:04 AM > To: [EMAIL PROTECTED] > Subject: Re: Question from a Desperate Java Newbie > > Not exactly what you may want - but could you try using a HTTP client > in Java ? Some of them have the ability to automatically follow > redirects, manage cookies etc. > > Thanks > hemanth > > On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote: > > Excuse me for asking a general Java question here. > > I tried to find Java mailing list from Google but none of them were > active. > > > > There is a problem that's been driving me crazy for a while. > > > > I am trying to download webpages from New York Times. > > With Java URL.openStream(), I can't get past the login requirement. > > But with c++ socket programming (using read() and write()), I can > download > > any webpage just fine. > > > > Interesting thing is that with c++, I get redirected like 10 times. Below > is > > the content of the header of the firstly redirected webpage when I try to > > download > > " > > > http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh > > " > > > > HTTP/1.1 302 Moved Temporarily > > Server: Sun-ONE-Web-Server/6.1 > > Date: Thu, 09 Dec 2010 08:42:35 GMT > > Content-type: text/html > > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011 > > 08:42:35 GMT; path=/; domain=.nytimes.com > > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=. > > nytimes.com > > Set-cookie: > > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI; > > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com > > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=. > > nytimes.com > > Location: > > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp > > Expires: Thu, 01 Dec 1994 16:00:00 GMT > > Cache-control: no-cache > > Pragma: no-cache > > Connection: close > > > > But with Java, I get redirected only once to a https:// webpage and it's > a > > dead end. Below is the result of java.net.URLConnection.getHeaderFiles() > > > > HTTP/1.1 301 Moved Permanently, > > Date: Thu, 09 Dec 2010 10:50:53 GMT, > > Content-type: text/html, > > Content-length: 0, > > Location: > > > https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR > > , > > Server: Sun-ONE-Web-Server/6.1, > > > > There is a clear difference between the two. I don't know why and it's > been > > driving me crazy. > > My guess is that c++ write() function can create some kind of cookie by > > itself, but Java URL.openStream() can't. > > > > Am I right? Or can anyone explain this for me? > > > > DISCLAIMER > =========> This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. >
-
Re: Question from a Desperate Java Newbieedward choi 2010-12-10, 07:33
Are you talking about java.net.HttpURLConnection?
If so, I've already tried using that with getInputStream() function. But still no luck. I actually got an interesting answer from Aardvark, which said that NY Times has a policy called "Read once for free". So obviously I crawled with C++ application first and blew my chance to crawl with Java application. The answerer was not sure about this policy but I think it makes sense, because today I tried with Java crawler first and it worked just fine. Ed 2010/12/10 Hemanth Yamijala <[EMAIL PROTECTED]> > Not exactly what you may want - but could you try using a HTTP client > in Java ? Some of them have the ability to automatically follow > redirects, manage cookies etc. > > Thanks > hemanth > > On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote: > > Excuse me for asking a general Java question here. > > I tried to find Java mailing list from Google but none of them were > active. > > > > There is a problem that's been driving me crazy for a while. > > > > I am trying to download webpages from New York Times. > > With Java URL.openStream(), I can't get past the login requirement. > > But with c++ socket programming (using read() and write()), I can > download > > any webpage just fine. > > > > Interesting thing is that with c++, I get redirected like 10 times. Below > is > > the content of the header of the firstly redirected webpage when I try to > > download > > " > > > http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh > > " > > > > HTTP/1.1 302 Moved Temporarily > > Server: Sun-ONE-Web-Server/6.1 > > Date: Thu, 09 Dec 2010 08:42:35 GMT > > Content-type: text/html > > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011 > > 08:42:35 GMT; path=/; domain=.nytimes.com > > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=. > > nytimes.com > > Set-cookie: > > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI; > > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com > > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=. > > nytimes.com > > Location: > > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp > > Expires: Thu, 01 Dec 1994 16:00:00 GMT > > Cache-control: no-cache > > Pragma: no-cache > > Connection: close > > > > But with Java, I get redirected only once to a https:// webpage and it's > a > > dead end. Below is the result of java.net.URLConnection.getHeaderFiles() > > > > HTTP/1.1 301 Moved Permanently, > > Date: Thu, 09 Dec 2010 10:50:53 GMT, > > Content-type: text/html, > > Content-length: 0, > > Location: > > > https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR > > , > > Server: Sun-ONE-Web-Server/6.1, > > > > There is a clear difference between the two. I don't know why and it's > been > > driving me crazy. > > My guess is that c++ write() function can create some kind of cookie by > > itself, but Java URL.openStream() can't. > > > > Am I right? Or can anyone explain this for me? > > >
-
Re: Question from a Desperate Java NewbieEdward Choi 2010-12-10, 09:08
I was wrong. It wasn't because of the "read once free" policy. I tried again with Java first again and this time it didn't work.
I looked up google and found the Http Client you mentioned. It is the one provided by apache, right? I guess I will have to try that one now. Thanks! From mp2893's iPhone On 2010. 12. 10., at 오전 11:33, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: > Not exactly what you may want - but could you try using a HTTP client > in Java ? Some of them have the ability to automatically follow > redirects, manage cookies etc. > > Thanks > hemanth > > On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote: >> Excuse me for asking a general Java question here. >> I tried to find Java mailing list from Google but none of them were active. >> >> There is a problem that's been driving me crazy for a while. >> >> I am trying to download webpages from New York Times. >> With Java URL.openStream(), I can't get past the login requirement. >> But with c++ socket programming (using read() and write()), I can download >> any webpage just fine. >> >> Interesting thing is that with c++, I get redirected like 10 times. Below is >> the content of the header of the firstly redirected webpage when I try to >> download >> " >> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh >> " >> >> HTTP/1.1 302 Moved Temporarily >> Server: Sun-ONE-Web-Server/6.1 >> Date: Thu, 09 Dec 2010 08:42:35 GMT >> Content-type: text/html >> Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011 >> 08:42:35 GMT; path=/; domain=.nytimes.com >> Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=. >> nytimes.com >> Set-cookie: >> NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI; >> expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com >> Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=. >> nytimes.com >> Location: >> http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp >> Expires: Thu, 01 Dec 1994 16:00:00 GMT >> Cache-control: no-cache >> Pragma: no-cache >> Connection: close >> >> But with Java, I get redirected only once to a https:// webpage and it's a >> dead end. Below is the result of java.net.URLConnection.getHeaderFiles() >> >> HTTP/1.1 301 Moved Permanently, >> Date: Thu, 09 Dec 2010 10:50:53 GMT, >> Content-type: text/html, >> Content-length: 0, >> Location: >> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR >> , >> Server: Sun-ONE-Web-Server/6.1, >> >> There is a clear difference between the two. I don't know why and it's been >> driving me crazy. >> My guess is that c++ write() function can create some kind of cookie by >> itself, but Java URL.openStream() can't. >> >> Am I right? Or can anyone explain this for me? >>
-
Re: Question from a Desperate Java NewbieSteve Loughran 2010-12-15, 17:18
On 10/12/10 09:08, Edward Choi wrote:
> I was wrong. It wasn't because of the "read once free" policy. I tried again with Java first again and this time it didn't work. > I looked up google and found the Http Client you mentioned. It is the one provided by apache, right? I guess I will have to try that one now. Thanks! > httpclient is good, HtmlUnit has a very good client that can simulate things like a full web browser with cookies, but that may be overkill. NYT's read once policy uses cookies to verify that you are there for the first day not logged in, for later days you get 302'd unless you delete the cookie, so stateful clients are bad. What you may have been hit by is whatever robot trap they have -if you generate too much load and don't follow the robots.txt rules they may detect this and push back
-
Re: Question from a Desperate Java Newbieedward choi 2010-12-16, 06:14
I totally obey the robots.txt since I am only fetching RSS feeds :-)
I implemented my crawler with HttpClient and it is working fine. I often get messages about "Cookie rejected", but am able to fetch news articles anyway. I guess the default "java.net" client is the stateful client you mentioned. Thanks for the tip!! Ed 2010년 12월 16일 오전 2:18, Steve Loughran <[EMAIL PROTECTED]>님의 말: > On 10/12/10 09:08, Edward Choi wrote: > > I was wrong. It wasn't because of the "read once free" policy. I tried > again with Java first again and this time it didn't work. > > I looked up google and found the Http Client you mentioned. It is the one > provided by apache, right? I guess I will have to try that one now. Thanks! > > > > httpclient is good, HtmlUnit has a very good client that can simulate > things like a full web browser with cookies, but that may be overkill. > > NYT's read once policy uses cookies to verify that you are there for the > first day not logged in, for later days you get 302'd unless you delete > the cookie, so stateful clients are bad. > > What you may have been hit by is whatever robot trap they have -if you > generate too much load and don't follow the robots.txt rules they may > detect this and push back > > |