Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Question from a Desperate Java Newbie


Copy link to this message
-
Re: Question from a Desperate Java Newbie
edward choi 2010-12-10, 07:29
I would, but I am trying to integrate the crawler with Hadoop, so I wanted
to write in Java :-)

2010/12/10 Santosh Borse <[EMAIL PROTECTED]>

> You can use open source wget as well.
>
> -----Original Message-----
> From: Hemanth Yamijala [mailto:[EMAIL PROTECTED]]
> Sent: Friday, December 10, 2010 8:04 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Question from a Desperate Java Newbie
>
> Not exactly what you may want - but could you try using a HTTP client
> in Java ? Some of them have the ability to automatically follow
> redirects, manage cookies etc.
>
> Thanks
> hemanth
>
> On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote:
> > Excuse me for asking a general Java question here.
> > I tried to find Java mailing list from Google but none of them were
> active.
> >
> > There is a problem that's been driving me crazy for a while.
> >
> > I am trying to download webpages from New York Times.
> > With Java URL.openStream(), I can't get past the login requirement.
> > But with c++ socket programming (using read() and write()), I can
> download
> > any webpage just fine.
> >
> > Interesting thing is that with c++, I get redirected like 10 times. Below
> is
> > the content of the header of the firstly redirected webpage when I try to
> > download
> > "
> >
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
> > "
> >
> > HTTP/1.1 302 Moved Temporarily
> > Server: Sun-ONE-Web-Server/6.1
> > Date: Thu, 09 Dec 2010 08:42:35 GMT
> > Content-type: text/html
> > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
> > 08:42:35 GMT; path=/; domain=.nytimes.com
> > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
> > nytimes.com
> > Set-cookie:
> > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
> > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
> > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
> > nytimes.com
> > Location:
> > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
> > Expires: Thu, 01 Dec 1994 16:00:00 GMT
> > Cache-control: no-cache
> > Pragma: no-cache
> > Connection: close
> >
> > But with Java, I get redirected only once to a https:// webpage and it's
> a
> > dead end. Below is the result of java.net.URLConnection.getHeaderFiles()
> >
> > HTTP/1.1 301 Moved Permanently,
> > Date: Thu, 09 Dec 2010 10:50:53 GMT,
> > Content-type: text/html,
> > Content-length: 0,
> > Location:
> >
> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
> > ,
> > Server: Sun-ONE-Web-Server/6.1,
> >
> > There is a clear difference between the two. I don't know why and it's
> been
> > driving me crazy.
> > My guess is that c++ write() function can create some kind of cookie by
> > itself, but Java URL.openStream() can't.
> >
> > Am I right? Or can anyone explain this for me?
> >
>
> DISCLAIMER
> =========> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>