It was almost what I was getting at but I was not sure about your problem.
Basically, Hadoop is only adding overhead due to the way your job is
Now the question is : why do you need a single mapper? Is your need truly
On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <[EMAIL PROTECTED]> wrote:
> Hi Matthais
> When an mapreduce program is being used there are some extra steps like
> checking for input and output dir, calclulating input splits, JT assigning
> TT for executing the task etc.
> If your file is non splittable , then one map task per file will be
> generated irrespective of the number of hdfs blocks. Now some blocks will
> be in a different node than the node where map task is executed so time
> will be spend here on the network transfer.
> In your case MR would be a overhead as your file is non splittable hence
> no parallelism and also there is an overhead of copying blocks to the map
> task node.
> Bejoy KS
> Sent from handheld, please excuse typos.
> *From: * Matthias Kricke <[EMAIL PROTECTED]>
> *Sender: * [EMAIL PROTECTED]
> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
> *To: *<[EMAIL PROTECTED]>
> *ReplyTo: * [EMAIL PROTECTED]
> *Subject: *Re: how to enhance job start up speed?
> Ok, I try to clarify:
> 1) The worker is the logic inside my mapper and the same for both cases.
> 2) I have two cases. In the first one I use hadoop to execute my worker
> and in a second one, I execute my worker without hadoop (simple read of the
> Now I measured, for both cases, the time the worker and
> the surroundings need (so i have two values for each case). The worker took
> the same time in both cases for the same input (this is expected). But the
> surroundings took 17% more time when using hadoop.
> 3) ~ 3GB.
> I want to know how to reduce this difference and where they come from.
> I hope that helped? If not, feel free to ask again :)
> P.S. just for your information, I did the same test with hypertable as
> I got:
> * worker without anything: 15% overhead
> * worker with hadoop: 32% overhead
> * worker with hypertable: 53% overhead
> Remark: overhead was measured in comparison to the worker. e.g. hypertable
> uses 53% of the whole process time, while worker uses 47%.
> 2012/8/13 Bertrand Dechoux <[EMAIL PROTECTED]>
>> I am not sure to understand and I guess I am not the only one.
>> 1) What's a worker in your context? Only the logic inside your Mapper or
>> something else?
>> 2) You should clarify your cases. You seem to have two cases but both are
>> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
>> sequential is not Hadoop?
>> 3) What are the size of the file?
>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>> [EMAIL PROTECTED]> wrote:
>>> Hello all,
>>> I'm using CDH3u3.
>>> If I want to process one File, set to non splitable hadoop starts one
>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>> goes through a configuration step where some variables for the worker
>>> inside the mapper are initialized.
>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>> process the V with the worker.
>>> When I compare the run time of hadoop to the run time of the same
>>> process in sequentiell manner, I get:
>>> worker time --> same in both cases
>>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>>> chunk size)
>>> case: sequentiell --> overhead of ~15% to the worker process
>>> It shouldn't be that much slower, because of non splitable, the mapper
>>> will be executed where the data is saved by HDFS, won't it?
>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>> time for reading or streaming the data out of HDFS?
>>> I would appreciate your help,