Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Applications creates bigger output than input?


Copy link to this message
-
Re: Applications creates bigger output than input?
Something I've seen in the past is code that has the input
   "something"
and outputs
   "s"
   "so"
   "som"
   "some"
   "somet"
   "someth"
   "somethi"
   "somethin"
   "something"

So the output number of records is the same as the length of the input text.

Niels

2011/5/19 elton sky <[EMAIL PROTECTED]>:
> Hello,
> I pick up this topic again, because what I am looking for is something not
> CPU bound. Augmenting data for ETL and generating index are good examples.
> Neither of them requires too much cpu time on map side. The main bottle neck
> for them is shuffle and merge.
>
> Market basket analysis is cpu intensive in map phase, for sampling all
> possible combinations of items.
>
> I am still looking for more applications, which creates bigger output and
> not CPU bound.
> Any further idea? I appreciate.
>
>
> On Tue, May 3, 2011 at 3:10 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
>
>> On 30/04/2011 05:31, elton sky wrote:
>>
>>> Thank you for suggestions:
>>>
>>> Weblog analysis, market basket analysis and generating search index.
>>>
>>> I guess for these applications we need more reduces than maps, for
>>> handling
>>> large intermediate output, isn't it. Besides, the input split for map
>>> should
>>> be smaller than usual,  because the workload for spill and merge on map's
>>> local disk is heavy.
>>>
>>
>> any form of rendering can generate very large images
>>
>> see: http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf
>>
>>
>>
>

--
Met vriendelijke groeten,

Niels Basjes