Seems like you want to misuse Hadoop but maybe I still don't understand
The standard way would be to split your files into multiples maps. Each map
could profit from data locality. Do a part of the worker stuff in the
mapper and then use a reducer to aggregate all the results (which could be
another part of your worker). That way you would be able to parallelise
your worker logic on a file. You seems to avoid using a reducer in order to
lessen the network traffic. That's a good concern but reducer do have their
On Mon, Aug 13, 2012 at 5:53 PM, Matthias Kricke <
[EMAIL PROTECTED]> wrote:
> @Bejoy KS: Thanks for your advice.
> @Bertrand: It is parallelisable, this is just a test case. In later cases
> there will be a lot of big files which should be processed completly each
> in one map step. We want to minimize the overhead of network traffic. The
> idea is to execute some worker (could be different stuff, e.g. wordcount,
> linecount, translation etc) at the node where the file is situated.
> If I get it right so far, we need to do several things... first chunk size
> should be as big as the file. Then the file is on a single node of the
> hadoop cluster, am I right? And
> set the file to non splitable.
> Did you have some more advice? Anyway thanks so far!
> 2012/8/13 Bertrand Dechoux <[EMAIL PROTECTED]>
>> It was almost what I was getting at but I was not sure about your
>> Basically, Hadoop is only adding overhead due to the way your job is
>> Now the question is : why do you need a single mapper? Is your need truly
>> not 'parallelisable'?
>> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <[EMAIL PROTECTED]> wrote:
>>> Hi Matthais
>>> When an mapreduce program is being used there are some extra steps like
>>> checking for input and output dir, calclulating input splits, JT assigning
>>> TT for executing the task etc.
>>> If your file is non splittable , then one map task per file will be
>>> generated irrespective of the number of hdfs blocks. Now some blocks will
>>> be in a different node than the node where map task is executed so time
>>> will be spend here on the network transfer.
>>> In your case MR would be a overhead as your file is non splittable hence
>>> no parallelism and also there is an overhead of copying blocks to the map
>>> task node.
>>> Bejoy KS
>>> Sent from handheld, please excuse typos.
>>> *From: * Matthias Kricke <[EMAIL PROTECTED]>
>>> *Sender: * [EMAIL PROTECTED]
>>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>>> *To: *<[EMAIL PROTECTED]>
>>> *ReplyTo: * [EMAIL PROTECTED]
>>> *Subject: *Re: how to enhance job start up speed?
>>> Ok, I try to clarify:
>>> 1) The worker is the logic inside my mapper and the same for both cases.
>>> 2) I have two cases. In the first one I use hadoop to execute my worker
>>> and in a second one, I execute my worker without hadoop (simple read of the
>>> Now I measured, for both cases, the time the worker and
>>> the surroundings need (so i have two values for each case). The worker took
>>> the same time in both cases for the same input (this is expected). But the
>>> surroundings took 17% more time when using hadoop.
>>> 3) ~ 3GB.
>>> I want to know how to reduce this difference and where they come from.
>>> I hope that helped? If not, feel free to ask again :)
>>> P.S. just for your information, I did the same test with hypertable as
>>> I got:
>>> * worker without anything: 15% overhead
>>> * worker with hadoop: 32% overhead
>>> * worker with hypertable: 53% overhead
>>> Remark: overhead was measured in comparison to the worker. e.g.
>>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>> 2012/8/13 Bertrand Dechoux <[EMAIL PROTECTED]>
>>>> I am not sure to understand and I guess I am not the only one.