Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Is it possible to write file output in Map phase once and write another file output in Reduce phase?


Copy link to this message
-
Is it possible to write file output in Map phase once and write another file output in Reduce phase?
edward choi 2010-12-10, 07:27
Hi,

I'm trying to crawl numerous news sites.
My plan is to make a file containing a list of all the news rss feed urls,
and the path to save the crawled news article.
So it would be like this:

nytimes_nation,    /user/hadoop/nytimes
nytimes_sports,    /user/hadoop/nytimes
latimes_world,      /user/hadoop/latimes
latimes_nation,     /user/hadoop/latimes
...
...
...

Each mapper would get a single line and crawl the assigned url, process
text, and save the result.
So this job does not need any Reducing process.

But what I'd also like to do is to create a dictionary at the same time.
This could definitely take advantage of Reduce phase. Each mapper can
generate output as "KEY:term, VALUE:term_frequency"
Then Reducer can merge them all together and create a dictionary. (Of course
I would be using many Reducers so the dictionary would be partitioned)

I know that I can do this by creating two separate jobs (one for crawling,
the other for making dictionary), but I'd like to do this in one-pass.

So my design is:
Map phase ==> crawl news articles, process text, write the result to a file.
        II
        II     pass (term, term_frequency) pair to the Reducer
        II
        V
Reduce phase ==> Merge the (term, term_frequency) pair and create a
dictionary

Is this at all possible? Or is it inherently impossible due to the structure
of Hadoop?
If it's possible, could anyone tell me how to do it?

Ed.