|
|
Aleksandar Stupar 2010-04-08, 12:55
Hi all,
As I realize hadoop is mainly used for tasks that take long time to execute. I'm considering to use hadoop for task whose lower bound in distributed execution is like 5 to 10 seconds. Am wondering what would the overhead be with using hadoop.
Does anyone have an idea? Any link where I can find this out?
Thanks, Aleksandar.
Jeff Zhang 2010-04-08, 14:37
By default, for each task hadoop will create a new jvm process which will be the major cost in my opinion. You can customize configuration to let tasktracker reuse the jvm to eliminate the overhead to some extend.
On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar < [EMAIL PROTECTED]> wrote:
> Hi all, > > As I realize hadoop is mainly used for tasks that take long > time to execute. I'm considering to use hadoop for task > whose lower bound in distributed execution is like 5 to 10 > seconds. Am wondering what would the overhead be with > using hadoop. > > Does anyone have an idea? Any link where I can find this out? > > Thanks, > Aleksandar. > > > -- Best Regards
Jeff Zhang
Rajesh Balamohan 2010-04-08, 14:50
If its too many short duration jobs, you might want to keep an eye on jobtracker and tweak number of heartbeats processed per second & outofbandheartbeat option. JobTracker might be bombarded with events otherwise.
On Thu, Apr 8, 2010 at 8:07 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
> By default, for each task hadoop will create a new jvm process which will > be > the major cost in my opinion. You can customize configuration to let > tasktracker reuse the jvm to eliminate the overhead to some extend. > > On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar < > [EMAIL PROTECTED]> wrote: > > > Hi all, > > > > As I realize hadoop is mainly used for tasks that take long > > time to execute. I'm considering to use hadoop for task > > whose lower bound in distributed execution is like 5 to 10 > > seconds. Am wondering what would the overhead be with > > using hadoop. > > > > Does anyone have an idea? Any link where I can find this out? > > > > Thanks, > > Aleksandar. > > > > > > > > > > > -- > Best Regards > > Jeff Zhang >
-- ~Rajesh.B
Patrick Angeles 2010-04-08, 14:51
Packaging the job and config and sending it to the JobTracker and various nodes also adds a few seconds overhead.
On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
> By default, for each task hadoop will create a new jvm process which will > be > the major cost in my opinion. You can customize configuration to let > tasktracker reuse the jvm to eliminate the overhead to some extend. > > On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar < > [EMAIL PROTECTED]> wrote: > > > Hi all, > > > > As I realize hadoop is mainly used for tasks that take long > > time to execute. I'm considering to use hadoop for task > > whose lower bound in distributed execution is like 5 to 10 > > seconds. Am wondering what would the overhead be with > > using hadoop. > > > > Does anyone have an idea? Any link where I can find this out? > > > > Thanks, > > Aleksandar. > > > > > > > > > > > -- > Best Regards > > Jeff Zhang >
Edward Capriolo 2010-04-08, 15:28
On Thu, Apr 8, 2010 at 10:51 AM, Patrick Angeles <[EMAIL PROTECTED]>wrote:
> Packaging the job and config and sending it to the JobTracker and various > nodes also adds a few seconds overhead. > > On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > By default, for each task hadoop will create a new jvm process which will > > be > > the major cost in my opinion. You can customize configuration to let > > tasktracker reuse the jvm to eliminate the overhead to some extend. > > > > On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar < > > [EMAIL PROTECTED]> wrote: > > > > > Hi all, > > > > > > As I realize hadoop is mainly used for tasks that take long > > > time to execute. I'm considering to use hadoop for task > > > whose lower bound in distributed execution is like 5 to 10 > > > seconds. Am wondering what would the overhead be with > > > using hadoop. > > > > > > Does anyone have an idea? Any link where I can find this out? > > > > > > Thanks, > > > Aleksandar. > > > > > > > > > > > > > > > > > > > -- > > Best Regards > > > > Jeff Zhang > > >
All jobs make entries in a jobhistory directory on the task tracker. As of now the jobhistory directory has some limitations with ext3 you hit max files in a directory at 32k, if you use xfs or ext4 you can have no theoretical limit but hadoop itself will bog down if the directory gets too large.
If you want to do this enable JVM re-use as mentioned above to shorten job start times. Also be prepared to make some shell scripts to handle some cleanup tasks.
Edward
Aleksandar Stupar 2010-04-09, 06:14
Thank you very much for all the answers.
I will definitely try using hadoop. Hope that results will be good.
Kind regards, Aleksandar Stupar.
________________________________ From: Edward Capriolo <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thu, April 8, 2010 5:28:00 PM Subject: Re: Hadoop overhead
On Thu, Apr 8, 2010 at 10:51 AM, Patrick Angeles <[EMAIL PROTECTED]>wrote:
> Packaging the job and config and sending it to the JobTracker and various > nodes also adds a few seconds overhead. > > On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > By default, for each task hadoop will create a new jvm process which will > > be > > the major cost in my opinion. You can customize configuration to let > > tasktracker reuse the jvm to eliminate the overhead to some extend. > > > > On Thu, Apr 8, 2010 at 8:55 PM, Aleksandar Stupar < > > [EMAIL PROTECTED]> wrote: > > > > > Hi all, > > > > > > As I realize hadoop is mainly used for tasks that take long > > > time to execute. I'm considering to use hadoop for task > > > whose lower bound in distributed execution is like 5 to 10 > > > seconds. Am wondering what would the overhead be with > > > using hadoop. > > > > > > Does anyone have an idea? Any link where I can find this out? > > > > > > Thanks, > > > Aleksandar. > > > > > > > > > > > > > > > > > > > -- > > Best Regards > > > > Jeff Zhang > > >
All jobs make entries in a jobhistory directory on the task tracker. As of now the jobhistory directory has some limitations with ext3 you hit max files in a directory at 32k, if you use xfs or ext4 you can have no theoretical limit but hadoop itself will bog down if the directory gets too large.
If you want to do this enable JVM re-use as mentioned above to shorten job start times. Also be prepared to make some shell scripts to handle some cleanup tasks.
Edward
|
|