|
|
-
Re: Applications creates bigger output than input?
Owen O'Malley 2011-04-30, 01:22
On Fri, Apr 29, 2011 at 5:02 AM, elton sky <[EMAIL PROTECTED]> wrote:
> For my benchmark purpose, I am looking for some non-trivial, real life > applications which creates *bigger* output than its input. Trivial example > I > can think about is cross join... >
As you say, almost all cross join jobs have that property. The other case that almost always fits into that category is generating an index. For example, if your input is a corpus of documents and you want to generate the list of documents that contain each word, the output (and especially the shuffle data) is much larger than the input.
-- Owen
-
Re: Applications creates bigger output than input?
elton sky 2011-04-30, 04:31
Thank you for suggestions:
Weblog analysis, market basket analysis and generating search index.
I guess for these applications we need more reduces than maps, for handling large intermediate output, isn't it. Besides, the input split for map should be smaller than usual, because the workload for spill and merge on map's local disk is heavy.
-Elton
On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> On Fri, Apr 29, 2011 at 5:02 AM, elton sky <[EMAIL PROTECTED]> wrote: > > > For my benchmark purpose, I am looking for some non-trivial, real life > > applications which creates *bigger* output than its input. Trivial > example > > I > > can think about is cross join... > > > > As you say, almost all cross join jobs have that property. The other case > that almost always fits into that category is generating an index. For > example, if your input is a corpus of documents and you want to generate > the > list of documents that contain each word, the output (and especially the > shuffle data) is much larger than the input. > > -- Owen >
-
Re: Applications creates bigger output than input?
Ted Dunning 2011-04-30, 20:32
Cooccurrence analysis is commonly used in recommendations. These produce large intermediates.
Come on over to the Mahout project if you would like to talk to a bunch of people who work on these problems.
On Fri, Apr 29, 2011 at 9:31 PM, elton sky <[EMAIL PROTECTED]> wrote:
> Thank you for suggestions: > > Weblog analysis, market basket analysis and generating search index. > > I guess for these applications we need more reduces than maps, for handling > large intermediate output, isn't it. Besides, the input split for map > should > be smaller than usual, because the workload for spill and merge on map's > local disk is heavy. > > -Elton > > On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley <[EMAIL PROTECTED]> > wrote: > > > On Fri, Apr 29, 2011 at 5:02 AM, elton sky <[EMAIL PROTECTED]> > wrote: > > > > > For my benchmark purpose, I am looking for some non-trivial, real life > > > applications which creates *bigger* output than its input. Trivial > > example > > > I > > > can think about is cross join... > > > > > > > As you say, almost all cross join jobs have that property. The other case > > that almost always fits into that category is generating an index. For > > example, if your input is a corpus of documents and you want to generate > > the > > list of documents that contain each word, the output (and especially the > > shuffle data) is much larger than the input. > > > > -- Owen > > >
-
Re: Applications creates bigger output than input?
Steve Loughran 2011-05-02, 17:10
On 30/04/2011 05:31, elton sky wrote: > Thank you for suggestions: > > Weblog analysis, market basket analysis and generating search index. > > I guess for these applications we need more reduces than maps, for handling > large intermediate output, isn't it. Besides, the input split for map should > be smaller than usual, because the workload for spill and merge on map's > local disk is heavy. any form of rendering can generate very large images see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
-
Re: Applications creates bigger output than input?
elton sky 2011-05-19, 08:06
Hello, I pick up this topic again, because what I am looking for is something not CPU bound. Augmenting data for ETL and generating index are good examples. Neither of them requires too much cpu time on map side. The main bottle neck for them is shuffle and merge. Market basket analysis is cpu intensive in map phase, for sampling all possible combinations of items. I am still looking for more applications, which creates bigger output and not CPU bound. Any further idea? I appreciate. On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 30/04/2011 05:31, elton sky wrote: > >> Thank you for suggestions: >> >> Weblog analysis, market basket analysis and generating search index. >> >> I guess for these applications we need more reduces than maps, for >> handling >> large intermediate output, isn't it. Besides, the input split for map >> should >> be smaller than usual, because the workload for spill and merge on map's >> local disk is heavy. >> > > any form of rendering can generate very large images > > see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf> > >
-
Re: Applications creates bigger output than input?
Niels Basjes 2011-05-19, 12:57
Something I've seen in the past is code that has the input "something" and outputs "s" "so" "som" "some" "somet" "someth" "somethi" "somethin" "something" So the output number of records is the same as the length of the input text. Niels 2011/5/19 elton sky <[EMAIL PROTECTED]>: > Hello, > I pick up this topic again, because what I am looking for is something not > CPU bound. Augmenting data for ETL and generating index are good examples. > Neither of them requires too much cpu time on map side. The main bottle neck > for them is shuffle and merge. > > Market basket analysis is cpu intensive in map phase, for sampling all > possible combinations of items. > > I am still looking for more applications, which creates bigger output and > not CPU bound. > Any further idea? I appreciate. > > > On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > >> On 30/04/2011 05:31, elton sky wrote: >> >>> Thank you for suggestions: >>> >>> Weblog analysis, market basket analysis and generating search index. >>> >>> I guess for these applications we need more reduces than maps, for >>> handling >>> large intermediate output, isn't it. Besides, the input split for map >>> should >>> be smaller than usual, because the workload for spill and merge on map's >>> local disk is heavy. >>> >> >> any form of rendering can generate very large images >> >> see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf>> >> >> > -- Met vriendelijke groeten, Niels Basjes
-
Re: Applications creates bigger output than input?
Robert Evans 2011-05-19, 14:57
I'm not sure if this has been mentioned or not but in Machine Learning with text based documents, the first stage is often a glorified word count action. Except much of the time they will do N-Gram. So Map Input: "Hello this is a test" Map Output: "Hello" "This" "is" "a" "test" "Hello" "this" "this" "is" "is" "a" "a" "test" ... You may also be extracting all kinds of other features form the text, but the tokenization/n-gram is not that CPU intensive. --Bobby Evans On 5/19/11 3:06 AM, "elton sky" <[EMAIL PROTECTED]> wrote: Hello, I pick up this topic again, because what I am looking for is something not CPU bound. Augmenting data for ETL and generating index are good examples. Neither of them requires too much cpu time on map side. The main bottle neck for them is shuffle and merge. Market basket analysis is cpu intensive in map phase, for sampling all possible combinations of items. I am still looking for more applications, which creates bigger output and not CPU bound. Any further idea? I appreciate. On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 30/04/2011 05:31, elton sky wrote: > >> Thank you for suggestions: >> >> Weblog analysis, market basket analysis and generating search index. >> >> I guess for these applications we need more reduces than maps, for >> handling >> large intermediate output, isn't it. Besides, the input split for map >> should >> be smaller than usual, because the workload for spill and merge on map's >> local disk is heavy. >> > > any form of rendering can generate very large images > > see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf> > >
-
Re: Applications creates bigger output than input?
elton sky 2011-05-21, 04:57
Thanks Robert, Niels Ye, I think text manipulation, especially ngram is a good application for me. Cheers On Fri, May 20, 2011 at 12:57 AM, Robert Evans <[EMAIL PROTECTED]> wrote: > I'm not sure if this has been mentioned or not but in Machine Learning with > text based documents, the first stage is often a glorified word count > action. Except much of the time they will do N-Gram. So > > Map Input: > "Hello this is a test" > > Map Output: > "Hello" > "This" > "is" > "a" > "test" > "Hello" "this" > "this" "is" > "is" "a" > "a" "test" > ... > > > You may also be extracting all kinds of other features form the text, but > the tokenization/n-gram is not that CPU intensive. > > --Bobby Evans > > On 5/19/11 3:06 AM, "elton sky" <[EMAIL PROTECTED]> wrote: > > Hello, > I pick up this topic again, because what I am looking for is something not > CPU bound. Augmenting data for ETL and generating index are good examples. > Neither of them requires too much cpu time on map side. The main bottle > neck > for them is shuffle and merge. > > Market basket analysis is cpu intensive in map phase, for sampling all > possible combinations of items. > > I am still looking for more applications, which creates bigger output and > not CPU bound. > Any further idea? I appreciate. > > > On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > > > On 30/04/2011 05:31, elton sky wrote: > > > >> Thank you for suggestions: > >> > >> Weblog analysis, market basket analysis and generating search index. > >> > >> I guess for these applications we need more reduces than maps, for > >> handling > >> large intermediate output, isn't it. Besides, the input split for map > >> should > >> be smaller than usual, because the workload for spill and merge on > map's > >> local disk is heavy. > >> > > > > any form of rendering can generate very large images > > > > see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf> > > > > > > >
|
|