|
|
-
How do I synchronize Hadoop jobs?
W.P. McNeill 2012-02-15, 19:23
Say I have two Hadoop jobs, A and B, that can be run in parallel. I have another job, C, that takes the output of both A and B as input. I want to run A and B at the same time, wait until both have finished, and then launch C. What is the best way to do this?
I know the answer if I've got a single Java client program that launches A, B, and C. But what if I don't have the option to launch all of them from a single Java program? (Say I've got a much more complicated system with many steps happening between A-B and C.) How do I synchronize between jobs, make sure there's no race conditions etc. Is this what Zookeeper is for?
-
Re: How do I synchronize Hadoop jobs?
John Armstrong 2012-02-15, 19:26
Actually, I think this is what Oozie is for. It seems to leap out as a great example of a forked workflow.
hth On 02/15/2012 02:23 PM, W.P. McNeill wrote: > Say I have two Hadoop jobs, A and B, that can be run in parallel. I have > another job, C, that takes the output of both A and B as input. I want to > run A and B at the same time, wait until both have finished, and then > launch C. What is the best way to do this? > > I know the answer if I've got a single Java client program that launches A, > B, and C. But what if I don't have the option to launch all of them from a > single Java program? (Say I've got a much more complicated system with many > steps happening between A-B and C.) How do I synchronize between jobs, make > sure there's no race conditions etc. Is this what Zookeeper is for?
-
Re: How do I synchronize Hadoop jobs?
Alejandro Abdelnur 2012-02-15, 19:28
You can use Oozie for that, you can write a workflow job that forks A & B and then joins before C.
Thanks.
Alejandro
On Wed, Feb 15, 2012 at 11:23 AM, W.P. McNeill <[EMAIL PROTECTED]> wrote: > Say I have two Hadoop jobs, A and B, that can be run in parallel. I have > another job, C, that takes the output of both A and B as input. I want to > run A and B at the same time, wait until both have finished, and then > launch C. What is the best way to do this? > > I know the answer if I've got a single Java client program that launches A, > B, and C. But what if I don't have the option to launch all of them from a > single Java program? (Say I've got a much more complicated system with many > steps happening between A-B and C.) How do I synchronize between jobs, make > sure there's no race conditions etc. Is this what Zookeeper is for?
-
Re: How do I synchronize Hadoop jobs?
bejoy.hadoop@... 2012-02-15, 19:28
Hi McNeil Have a look at OOZIE. It is meant for work flow management in hadoop and can serve your purpose.
------Original Message------ From: W.P. McNeill To: Hadoop Mailing List ReplyTo: [EMAIL PROTECTED] Subject: How do I synchronize Hadoop jobs? Sent: Feb 16, 2012 00:53
Say I have two Hadoop jobs, A and B, that can be run in parallel. I have another job, C, that takes the output of both A and B as input. I want to run A and B at the same time, wait until both have finished, and then launch C. What is the best way to do this?
I know the answer if I've got a single Java client program that launches A, B, and C. But what if I don't have the option to launch all of them from a single Java program? (Say I've got a much more complicated system with many steps happening between A-B and C.) How do I synchronize between jobs, make sure there's no race conditions etc. Is this what Zookeeper is for?
Regards Bejoy K S
From handheld, Please excuse typos.
-
Re: How do I synchronize Hadoop jobs?
Bharath Mundlapudi 2012-02-15, 21:29
Or you could use job chaining in MR. http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining-Bharath On Wed, Feb 15, 2012 at 11:26 AM, John Armstrong <[EMAIL PROTECTED]> wrote: > Actually, I think this is what Oozie is for. It seems to leap out as a > great example of a forked workflow. > > hth > > > > On 02/15/2012 02:23 PM, W.P. McNeill wrote: > >> Say I have two Hadoop jobs, A and B, that can be run in parallel. I have >> another job, C, that takes the output of both A and B as input. I want to >> run A and B at the same time, wait until both have finished, and then >> launch C. What is the best way to do this? >> >> I know the answer if I've got a single Java client program that launches >> A, >> B, and C. But what if I don't have the option to launch all of them from a >> single Java program? (Say I've got a much more complicated system with >> many >> steps happening between A-B and C.) How do I synchronize between jobs, >> make >> sure there's no race conditions etc. Is this what Zookeeper is for? >> > >
-
Re: How do I synchronize Hadoop jobs?
Bharath Mundlapudi 2012-02-15, 21:31
For complex workflows indeed Oozie(or Azkaban) is the answer. -Bhartah On Wed, Feb 15, 2012 at 1:29 PM, Bharath Mundlapudi <[EMAIL PROTECTED]>wrote: > Or you could use job chaining in MR. > http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining> > -Bharath > > > On Wed, Feb 15, 2012 at 11:26 AM, John Armstrong <[EMAIL PROTECTED]> wrote: > >> Actually, I think this is what Oozie is for. It seems to leap out as a >> great example of a forked workflow. >> >> hth >> >> >> >> On 02/15/2012 02:23 PM, W.P. McNeill wrote: >> >>> Say I have two Hadoop jobs, A and B, that can be run in parallel. I have >>> another job, C, that takes the output of both A and B as input. I want to >>> run A and B at the same time, wait until both have finished, and then >>> launch C. What is the best way to do this? >>> >>> I know the answer if I've got a single Java client program that launches >>> A, >>> B, and C. But what if I don't have the option to launch all of them from >>> a >>> single Java program? (Say I've got a much more complicated system with >>> many >>> steps happening between A-B and C.) How do I synchronize between jobs, >>> make >>> sure there's no race conditions etc. Is this what Zookeeper is for? >>> >> >> >
|
|