Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Question from a Desperate Java Newbie


Copy link to this message
-
Re: Question from a Desperate Java Newbie
I would, but I am trying to integrate the crawler with Hadoop, so I wanted
to write in Java :-)

2010/12/10 Santosh Borse <[EMAIL PROTECTED]>

> You can use open source wget as well.
>
> -----Original Message-----
> From: Hemanth Yamijala [mailto:[EMAIL PROTECTED]]
> Sent: Friday, December 10, 2010 8:04 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Question from a Desperate Java Newbie
>
> Not exactly what you may want - but could you try using a HTTP client
> in Java ? Some of them have the ability to automatically follow
> redirects, manage cookies etc.
>
> Thanks
> hemanth
>
> On Thu, Dec 9, 2010 at 4:35 PM, edward choi <[EMAIL PROTECTED]> wrote:
> > Excuse me for asking a general Java question here.
> > I tried to find Java mailing list from Google but none of them were
> active.
> >
> > There is a problem that's been driving me crazy for a while.
> >
> > I am trying to download webpages from New York Times.
> > With Java URL.openStream(), I can't get past the login requirement.
> > But with c++ socket programming (using read() and write()), I can
> download
> > any webpage just fine.
> >
> > Interesting thing is that with c++, I get redirected like 10 times. Below
> is
> > the content of the header of the firstly redirected webpage when I try to
> > download
> > "
> >
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com:80/2010/12/09/world/asia/09military.html&OQ=_rQ3D1Q26hp&OP=47c049a1Q2FVQ3EY6VQ5Dks9akk5Q27VQ27Q2AFQ2AVFQ27VQ2AtVQ3EkahQ5DVQ3F9Q5CQ3FVQ2AtbQ5ChQ5C5Q3Fa!Q2BN5bh
> > "
> >
> > HTTP/1.1 302 Moved Temporarily
> > Server: Sun-ONE-Web-Server/6.1
> > Date: Thu, 09 Dec 2010 08:42:35 GMT
> > Content-type: text/html
> > Set-cookie: RMID=0b5d4aea392d4d00967bfaf1; expires=Friday, 09-Dec-2011
> > 08:42:35 GMT; path=/; domain=.nytimes.com
> > Set-cookie: NYT_GR=4d009b2b-yJ4V047ooAmPtGcvASTmng; path=/; domain=.
> > nytimes.com
> > Set-cookie:
> > NYT-S=0Mzh9PJwQ663rDXrmvxADeHJOGvJvXmRaJdeFz9JchiAJK89nlVaR7bsV.Ynx4rkFI;
> > expires=Saturday, 08-Jan-2011 08:42:35 GMT; path=/; domain=.nytimes.com
> > Set-cookie: NYT-Pref=hppznw|^creator|NYTD.Cookies; path=/; domain=.
> > nytimes.com
> > Location:
> > http://www.nytimes.com:80/2010/12/09/world/asia/09military.html?_r=1&hp
> > Expires: Thu, 01 Dec 1994 16:00:00 GMT
> > Cache-control: no-cache
> > Pragma: no-cache
> > Connection: close
> >
> > But with Java, I get redirected only once to a https:// webpage and it's
> a
> > dead end. Below is the result of java.net.URLConnection.getHeaderFiles()
> >
> > HTTP/1.1 301 Moved Permanently,
> > Date: Thu, 09 Dec 2010 10:50:53 GMT,
> > Content-type: text/html,
> > Content-length: 0,
> > Location:
> >
> https://myaccount.nytimes.com/auth/login?URI=/2010/12/09/world/asia/09military.html&OQ=_rQ3D5Q26hp&REFUSE_COOKIE_ERROR=SHOW_ERROR
> > ,
> > Server: Sun-ONE-Web-Server/6.1,
> >
> > There is a clear difference between the two. I don't know why and it's
> been
> > driving me crazy.
> > My guess is that c++ write() function can create some kind of cookie by
> > itself, but Java URL.openStream() can't.
> >
> > Am I right? Or can anyone explain this for me?
> >
>
> DISCLAIMER
> =========> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB