Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - Re: Applications creates bigger output than input?


+
Owen OMalley 2011-04-30, 01:22
+
elton sky 2011-04-30, 04:31
Copy link to this message
-
Re: Applications creates bigger output than input?
Ted Dunning 2011-04-30, 20:32
Cooccurrence analysis is commonly used in recommendations.  These produce
large intermediates.

Come on over to the Mahout project if you would like to talk to a bunch of
people who work on these problems.

On Fri, Apr 29, 2011 at 9:31 PM, elton sky <[EMAIL PROTECTED]> wrote:

> Thank you for suggestions:
>
> Weblog analysis, market basket analysis and generating search index.
>
> I guess for these applications we need more reduces than maps, for handling
> large intermediate output, isn't it. Besides, the input split for map
> should
> be smaller than usual,  because the workload for spill and merge on map's
> local disk is heavy.
>
> -Elton
>
> On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <[EMAIL PROTECTED]>
> wrote:
>
> > On Fri, Apr 29, 2011 at 5:02 AM, elton sky <[EMAIL PROTECTED]>
> wrote:
> >
> > > For my benchmark purpose, I am looking for some non-trivial, real life
> > > applications which creates *bigger* output than its input. Trivial
> > example
> > > I
> > > can think about is cross join...
> > >
> >
> > As you say, almost all cross join jobs have that property. The other case
> > that almost always fits into that category is generating an index. For
> > example, if your input is a corpus of documents and you want to generate
> > the
> > list of documents that contain each word, the output (and especially the
> > shuffle data) is much larger than the input.
> >
> > -- Owen
> >
>
+
Steve Loughran 2011-05-02, 17:10
+
elton sky 2011-05-19, 08:06
+
Niels Basjes 2011-05-19, 12:57
+
Robert Evans 2011-05-19, 14:57
+
elton sky 2011-05-21, 04:57