Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Is it possible to write file output in Map phase once and write another file output in Reduce phase?


Copy link to this message
-
Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?
edward choi 2010-12-11, 09:32
Excuse me but could I ask one more question?
Can I operate Bixo on a cluster other than Amazon EC2?
I already am running a Hadoop cluster of my own, so I'd like run Bixo on top
of my cluster.
But I don't see how to do it in the Bixo's "Getting Started" page.
All I see are "running locally", "running locally with eclipse", "running in
Amazon EC2".

Ed

2010년 12월 11일 오후 4:34, Ted Dunning <[EMAIL PROTECTED]>님의 말:

> If you are only loading articles at that rate, I would suggest that a
> simple
> java or perl or ruby program would be MUCH easy to write and debug than a
> full on map-reduce program.
>
> 2010/12/10 Edward Choi <[EMAIL PROTECTED]>
>
> > Thanks for the advice. But my plan is to crawl news rss feeds every 30
> > minutes. So I'd be downloading at most 5 to 10 news articles per map task
> > (since news aren't published that often). So I guess I won't have to
> worry
> > to much about the crawling dealy.
> > I thought it would be a good idea to make a dictionary during the
> crawling
> > process. Because I will be needing the a dictionary to calculate tf-idf
> and
> > I didn't want to have to go through the whole repository everytime a news
> > aricle is added.
> > If I crawl and make a dictionary at the same time, all I need to do to
> make
> > a dictionary is to merge the new ones (which are generated every 30
> minutes)
> > with the existing dictionary which I guess will be computationally cheap.
> >
> > Ed
> >
> > From mp2893's iPhone
> >
> > On 2010. 12. 11., at 오전 3:42, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> > > Regarding the idea of doing word counts during the crawl, I think you
> are
> > > motivated by the best of principles (read
> > > input only once), but in practice, you will be doing many small crawls
> > and
> > > saving the content.  Word counting
> > > should probably not be tied too closely to the crawl because the crawl
> > can
> > > be delayed arbitrarily.  Better to have
> > > a good content repository that is updated as often as crawls complete
> and
> > > run other processing against the
> > > repository whenever it seems like a good idea.
> > >
> > > 2010/12/10 Edward Choi <[EMAIL PROTECTED]>
> > >
> > >> Thanks for the tip. I guess it's a little different project from
> Nutch.
> > My
> > >> understanding is that while Nutch tries to implement a whole web
> search
> > >> package, Bixo focuses on the crawling part. I should look into both
> > projects
> > >> more deeply. Thanks again!!
> > >>
> > >> Ed
> > >>
> > >> From mp2893's iPhone
> > >>
> > >> On 2010. 12. 11., at 오전 1:15, Ted Dunning <[EMAIL PROTECTED]>
> > wrote:
> > >>
> > >>> That is definitely possible, but may not be very desirable.
> > >>>
> > >>> Take a look at the Bixo project for a full-scale crawler.  There is a
> > lot
> > >> of
> > >>> subtlety in the fetching of URL's
> > >>> due to the varying quality of different sites and the interaction
> with
> > >> crawl
> > >>> choking due to robots.txt considerations.
> > >>>
> > >>> http://bixo.101tec.com/
> > >>>
> > >>> On Thu, Dec 9, 2010 at 11:27 PM, edward choi <[EMAIL PROTECTED]>
> wrote:
> > >>>
> > >>>> So my design is:
> > >>>> Map phase ==> crawl news articles, process text, write the result to
> a
> > >>>> file.
> > >>>>      II
> > >>>>      II     pass (term, term_frequency) pair to the Reducer
> > >>>>      II
> > >>>>      V
> > >>>> Reduce phase ==> Merge the (term, term_frequency) pair and create a
> > >>>> dictionary
> > >>>>
> > >>>> Is this at all possible? Or is it inherently impossible due to the
> > >>>> structure
> > >>>> of Hadoop?
> > >>>>
> > >>
> >
>