Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Applications creates bigger output than input?


Copy link to this message
-
Re: Applications creates bigger output than input?
Cooccurrence analysis is commonly used in recommendations.  These produce
large intermediates.

Come on over to the Mahout project if you would like to talk to a bunch of
people who work on these problems.

On Fri, Apr 29, 2011 at 9:31 PM, elton sky <[EMAIL PROTECTED]> wrote:

> Thank you for suggestions:
>
> Weblog analysis, market basket analysis and generating search index.
>
> I guess for these applications we need more reduces than maps, for handling
> large intermediate output, isn't it. Besides, the input split for map
> should
> be smaller than usual,  because the workload for spill and merge on map's
> local disk is heavy.
>
> -Elton
>
> On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <[EMAIL PROTECTED]>
> wrote:
>
> > On Fri, Apr 29, 2011 at 5:02 AM, elton sky <[EMAIL PROTECTED]>
> wrote:
> >
> > > For my benchmark purpose, I am looking for some non-trivial, real life
> > > applications which creates *bigger* output than its input. Trivial
> > example
> > > I
> > > can think about is cross join...
> > >
> >
> > As you say, almost all cross join jobs have that property. The other case
> > that almost always fits into that category is generating an index. For
> > example, if your input is a corpus of documents and you want to generate
> > the
> > list of documents that contain each word, the output (and especially the
> > shuffle data) is much larger than the input.
> >
> > -- Owen
> >
>