Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: Applications creates bigger output than input?


Copy link to this message
-
Re: Applications creates bigger output than input?
Ted Dunning 2011-04-30, 20:32
Cooccurrence analysis is commonly used in recommendations.  These produce
large intermediates.

Come on over to the Mahout project if you would like to talk to a bunch of
people who work on these problems.

On Fri, Apr 29, 2011 at 9:31 PM, elton sky <[EMAIL PROTECTED]> wrote:

> Thank you for suggestions:
>
> Weblog analysis, market basket analysis and generating search index.
>
> I guess for these applications we need more reduces than maps, for handling
> large intermediate output, isn't it. Besides, the input split for map
> should
> be smaller than usual,  because the workload for spill and merge on map's
> local disk is heavy.
>
> -Elton
>
> On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <[EMAIL PROTECTED]>
> wrote:
>
> > On Fri, Apr 29, 2011 at 5:02 AM, elton sky <[EMAIL PROTECTED]>
> wrote:
> >
> > > For my benchmark purpose, I am looking for some non-trivial, real life
> > > applications which creates *bigger* output than its input. Trivial
> > example
> > > I
> > > can think about is cross join...
> > >
> >
> > As you say, almost all cross join jobs have that property. The other case
> > that almost always fits into that category is generating an index. For
> > example, if your input is a corpus of documents and you want to generate
> > the
> > list of documents that contain each word, the output (and especially the
> > shuffle data) is much larger than the input.
> >
> > -- Owen
> >
>