Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Is it possible to write file output in Map phase once and write another file output in Reduce phase?


Copy link to this message
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
Excuse me but could I ask one more question?
Can I operate Bixo on a cluster other than Amazon EC2?
I already am running a Hadoop cluster of my own, so I'd like run Bixo on top
of my cluster.
But I don't see how to do it in the Bixo's "Getting Started" page.
All I see are "running locally", "running locally with eclipse", "running in
Amazon EC2".

Ed

2010년 12월 11일 오후 4:34, Ted Dunning <[EMAIL PROTECTED]>님의 말:

> If you are only loading articles at that rate, I would suggest that a
> simple
> java or perl or ruby program would be MUCH easy to write and debug than a
> full on map-reduce program.
>
> 2010/12/10 Edward Choi <[EMAIL PROTECTED]>
>
> > Thanks for the advice. But my plan is to crawl news rss feeds every 30
> > minutes. So I'd be downloading at most 5 to 10 news articles per map task
> > (since news aren't published that often). So I guess I won't have to
> worry
> > to much about the crawling dealy.
> > I thought it would be a good idea to make a dictionary during the
> crawling
> > process. Because I will be needing the a dictionary to calculate tf-idf
> and
> > I didn't want to have to go through the whole repository everytime a news
> > aricle is added.
> > If I crawl and make a dictionary at the same time, all I need to do to
> make
> > a dictionary is to merge the new ones (which are generated every 30
> minutes)
> > with the existing dictionary which I guess will be computationally cheap.
> >
> > Ed
> >
> > From mp2893's iPhone
> >
> > On 2010. 12. 11., at 오전 3:42, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> > > Regarding the idea of doing word counts during the crawl, I think you
> are
> > > motivated by the best of principles (read
> > > input only once), but in practice, you will be doing many small crawls
> > and
> > > saving the content.  Word counting
> > > should probably not be tied too closely to the crawl because the crawl
> > can
> > > be delayed arbitrarily.  Better to have
> > > a good content repository that is updated as often as crawls complete
> and
> > > run other processing against the
> > > repository whenever it seems like a good idea.
> > >
> > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]>
> > >
> > >> Thanks for the tip. I guess it's a little different project from
> Nutch.
> > My
> > >> understanding is that while Nutch tries to implement a whole web
> search
> > >> package, Bixo focuses on the crawling part. I should look into both
> > projects
> > >> more deeply. Thanks again!!
> > >>
> > >> Ed
> > >>
> > >> From mp2893's iPhone
> > >>
> > >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[EMAIL PROTECTED]>
> > wrote:
> > >>
> > >>> That is definitely possible, but may not be very desirable.
> > >>>
> > >>> Take a look at the Bixo project for a full-scale crawler.  There is a
> > lot
> > >> of
> > >>> subtlety in the fetching of URL's
> > >>> due to the varying quality of different sites and the interaction
> with
> > >> crawl
> > >>> choking due to robots.txt considerations.
> > >>>
> > >>> http://bixo.101tec.com/
> > >>>
> > >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]>
> wrote:
> > >>>
> > >>>> So my design is:
> > >>>> Map phase ==> crawl news articles, process text, write the result to
> a
> > >>>> file.
> > >>>>      II
> > >>>>      II     pass (term, term_frequency) pair to the Reducer
> > >>>>      II
> > >>>>      V
> > >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
> > >>>> dictionary
> > >>>>
> > >>>> Is this at all possible? Or is it inherently impossible due to the
> > >>>> structure
> > >>>> of Hadoop?
> > >>>>
> > >>
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB