|
|
-
Calling one MR job within another MR job
Stuti Awasthi 2012-04-04, 10:34
Hi all,
We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } }
MRJob2{ Processing.... }
My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida. ________________________________ ::DISCLAIMER:: -----------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
-----------------------------------------------------------------------------------------------------------------------
-
Re: Calling one MR job within another MR job
Ashwanth Kumar 2012-04-04, 10:42
Have you tired using Oozie < http://incubator.apache.org/oozie/ >? On Wed, Apr 4, 2012 at 4:04 PM, Stuti Awasthi <[EMAIL PROTECTED]> wrote: > Hi all,**** > > ** ** > > We have a usecase in which I start with first MR1 job with input file as > File1.txt, and from this job, call another MR2 job with input as File2.txt > **** > > So :**** > > MRjob1{**** > > Map(){**** > > MRJob2(File2.txt)**** > > }**** > > }**** > > ** ** > > MRJob2{**** > > Processing….**** > > }**** > > ** ** > > My queries are is this kind of approach is possible and how much are the > implications from the performance perspective.**** > > ** ** > > ** ** > > Regards,**** > > *Stuti Awasthi* > > HCL Comnet Systems and Services Ltd**** > > F-8/9 Basement, Sec-3,Noida.**** > > ** ** > > ------------------------------ > ::DISCLAIMER:: > > ----------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its > affiliates. Any views or opinions presented in > this email are solely those of the author and may not necessarily reflect > the opinions of HCL or its affiliates. > Any form of reproduction, dissemination, copying, disclosure, > modification, distribution and / or publication of > this message without the prior written consent of the author of this > e-mail is strictly prohibited. If you have > received this email in error please delete it and notify the sender > immediately. Before opening any mail and > attachments please check them for viruses and defect. > > > ----------------------------------------------------------------------------------------------------------------------- > -- Ashwanth Kumar / ashwanthkumar.in
-
RE: Calling one MR job within another MR job
Stuti Awasthi 2012-04-04, 10:49
Hi Ashwanth, No I have not tried oozie. I want to attain this simply through Java Map Reduce jobs. Any ideas? From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Ashwanth Kumar Sent: Wednesday, April 04, 2012 4:13 PM To: [EMAIL PROTECTED] Subject: Re: Calling one MR job within another MR job Have you tired using Oozie < http://incubator.apache.org/oozie/ >? On Wed, Apr 4, 2012 at 4:04 PM, Stuti Awasthi <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi all, We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } } MRJob2{ Processing.... } My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida. ________________________________ ::DISCLAIMER:: ----------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect. ----------------------------------------------------------------------------------------------------------------------- -- Ashwanth Kumar / ashwanthkumar.in< http://ashwanthkumar.in/>
-
Re: Calling one MR job within another MR job
Ashwanth Kumar 2012-04-04, 10:51
I have not tired doing but, looking into Oozie source code should get you some ideas. As Oozie uses something called LauncherMapper which launches other MR Jobs. On Wed, Apr 4, 2012 at 4:19 PM, Stuti Awasthi <[EMAIL PROTECTED]> wrote: > Hi Ashwanth,**** > > ** ** > > No I have not tried oozie. I want to attain this simply through Java Map > Reduce jobs.**** > > Any ideas?**** > > ** ** > > *From:* [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] *On > Behalf Of *Ashwanth Kumar > *Sent:* Wednesday, April 04, 2012 4:13 PM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Calling one MR job within another MR job**** > > ** ** > > Have you tired using Oozie < http://incubator.apache.org/oozie/ >? **** > > ** ** > > On Wed, Apr 4, 2012 at 4:04 PM, Stuti Awasthi <[EMAIL PROTECTED]> > wrote:**** > > Hi all,**** > > **** > > We have a usecase in which I start with first MR1 job with input file as > File1.txt, and from this job, call another MR2 job with input as File2.txt > **** > > So :**** > > MRjob1{**** > > Map(){**** > > MRJob2(File2.txt)**** > > }**** > > }**** > > **** > > MRJob2{**** > > Processing….**** > > }**** > > **** > > My queries are is this kind of approach is possible and how much are the > implications from the performance perspective.**** > > **** > > **** > > Regards,**** > > *Stuti Awasthi***** > > HCL Comnet Systems and Services Ltd**** > > F-8/9 Basement, Sec-3,Noida.**** > > **** > > ** ** > ------------------------------ > > ::DISCLAIMER:: > > ----------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its > affiliates. Any views or opinions presented in > this email are solely those of the author and may not necessarily reflect > the opinions of HCL or its affiliates. > Any form of reproduction, dissemination, copying, disclosure, > modification, distribution and / or publication of > this message without the prior written consent of the author of this > e-mail is strictly prohibited. If you have > received this email in error please delete it and notify the sender > immediately. Before opening any mail and > attachments please check them for viruses and defect. > > > ----------------------------------------------------------------------------------------------------------------------- > **** > > > > **** > > ** ** > > -- **** > > ** ** > > Ashwanth Kumar / ashwanthkumar.in**** > > ** ** > > ** ** > -- Ashwanth Kumar / ashwanthkumar.in
-
Re: Calling one MR job within another MR job
Ashwanth Kumar 2012-04-04, 11:03
Have you tired using JobConf / JobClient for starting new jobs? Also refer here - http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining on Job Chaining. On Wed, Apr 4, 2012 at 4:19 PM, Stuti Awasthi <[EMAIL PROTECTED]> wrote: > Hi Ashwanth,**** > > ** ** > > No I have not tried oozie. I want to attain this simply through Java Map > Reduce jobs.**** > > Any ideas?**** > > ** ** > > *From:* [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] *On > Behalf Of *Ashwanth Kumar > *Sent:* Wednesday, April 04, 2012 4:13 PM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Calling one MR job within another MR job**** > > ** ** > > Have you tired using Oozie < http://incubator.apache.org/oozie/ >? **** > > ** ** > > On Wed, Apr 4, 2012 at 4:04 PM, Stuti Awasthi <[EMAIL PROTECTED]> > wrote:**** > > Hi all,**** > > **** > > We have a usecase in which I start with first MR1 job with input file as > File1.txt, and from this job, call another MR2 job with input as File2.txt > **** > > So :**** > > MRjob1{**** > > Map(){**** > > MRJob2(File2.txt)**** > > }**** > > }**** > > **** > > MRJob2{**** > > Processing….**** > > }**** > > **** > > My queries are is this kind of approach is possible and how much are the > implications from the performance perspective.**** > > **** > > **** > > Regards,**** > > *Stuti Awasthi***** > > HCL Comnet Systems and Services Ltd**** > > F-8/9 Basement, Sec-3,Noida.**** > > **** > > ** ** > ------------------------------ > > ::DISCLAIMER:: > > ----------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its > affiliates. Any views or opinions presented in > this email are solely those of the author and may not necessarily reflect > the opinions of HCL or its affiliates. > Any form of reproduction, dissemination, copying, disclosure, > modification, distribution and / or publication of > this message without the prior written consent of the author of this > e-mail is strictly prohibited. If you have > received this email in error please delete it and notify the sender > immediately. Before opening any mail and > attachments please check them for viruses and defect. > > > ----------------------------------------------------------------------------------------------------------------------- > **** > > > > **** > > ** ** > > -- **** > > ** ** > > Ashwanth Kumar / ashwanthkumar.in**** > > ** ** > > ** ** > -- Ashwanth Kumar / ashwanthkumar.in
-
RE: Calling one MR job within another MR job
Ravi teja ch n v 2012-04-04, 11:05
Hi Stuti,
If you are looking for MRjob2 to run after MRjob1, ie the job dependency,
you can use JobControl API, where you can manage the dependencies.
Calling another Job from a Mapper is not a good idea.
Thanks,
Ravi Teja
________________________________ From: Stuti Awasthi [[EMAIL PROTECTED]] Sent: 04 April 2012 16:04:19 To: [EMAIL PROTECTED] Subject: Calling one MR job within another MR job
Hi all,
We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } }
MRJob2{ Processing…. }
My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida. ________________________________ ::DISCLAIMER:: -----------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
-----------------------------------------------------------------------------------------------------------------------
-
RE: Calling one MR job within another MR job
Stuti Awasthi 2012-04-04, 11:55
Hi Ravi,
There is no job dependency so I cannot use chaining MR or JobControl as you suggested. I have 2 relatively big files, I start processing with File1 as input to MR1 job , now this processing required to find the data from File2. One way to do is loop through File2 and get the data. Other way to pass File2 in MR2 job for parallel processing.
Second option is making hinting me to call an MR2 job inside from MR1 job. I am sure this is the common problem that people usually face. What is the best way to resolve this kind of issue.
Thanks
From: Ravi teja ch n v [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 04, 2012 4:35 PM To: [EMAIL PROTECTED] Subject: RE: Calling one MR job within another MR job Hi Stuti,
If you are looking for MRjob2 to run after MRjob1, ie the job dependency,
you can use JobControl API, where you can manage the dependencies.
Calling another Job from a Mapper is not a good idea.
Thanks,
Ravi Teja
________________________________ From: Stuti Awasthi [[EMAIL PROTECTED]] Sent: 04 April 2012 16:04:19 To: [EMAIL PROTECTED] Subject: Calling one MR job within another MR job Hi all,
We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } }
MRJob2{ Processing.... }
My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida. ________________________________ ::DISCLAIMER:: -----------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
-----------------------------------------------------------------------------------------------------------------------
-
RE: Calling one MR job within another MR job
Stuti Awasthi 2012-04-04, 12:01
Hi Ashwanth, My scenario is not resolved by chaining jobs as in chaining : Output of one MR job is input in other MR job. Neither I can use JobControl Api as this tells Job1 to wait till Job2 is complete. In my scenario processing of each line File1 is dependent on simultaneous processing of File2. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Ashwanth Kumar Sent: Wednesday, April 04, 2012 4:34 PM To: [EMAIL PROTECTED] Subject: Re: Calling one MR job within another MR job Have you tired using JobConf / JobClient for starting new jobs? Also refer here - http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining on Job Chaining. On Wed, Apr 4, 2012 at 4:19 PM, Stuti Awasthi <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi Ashwanth, No I have not tried oozie. I want to attain this simply through Java Map Reduce jobs. Any ideas? From: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] On Behalf Of Ashwanth Kumar Sent: Wednesday, April 04, 2012 4:13 PM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Calling one MR job within another MR job Have you tired using Oozie < http://incubator.apache.org/oozie/ >? On Wed, Apr 4, 2012 at 4:04 PM, Stuti Awasthi <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi all, We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } } MRJob2{ Processing.... } My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida. ________________________________ ::DISCLAIMER:: ----------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect. ----------------------------------------------------------------------------------------------------------------------- -- Ashwanth Kumar / ashwanthkumar.in< http://ashwanthkumar.in/>-- Ashwanth Kumar / ashwanthkumar.in< http://ashwanthkumar.in/>
-
RE: Calling one MR job within another MR job
Ravi teja ch n v 2012-04-04, 12:31
Hi Stuti,
In that case, you can run the Job with dependent file (file2) first, then go for the job using file1.
Then your second mapper can use the already processed output.
I guess this will solve the problem u have mentioned.
Thanks,
Ravi Teja
________________________________ From: Stuti Awasthi [[EMAIL PROTECTED]] Sent: 04 April 2012 17:25:02 To: [EMAIL PROTECTED] Subject: RE: Calling one MR job within another MR job
Hi Ravi,
There is no job dependency so I cannot use chaining MR or JobControl as you suggested. I have 2 relatively big files, I start processing with File1 as input to MR1 job , now this processing required to find the data from File2. One way to do is loop through File2 and get the data. Other way to pass File2 in MR2 job for parallel processing.
Second option is making hinting me to call an MR2 job inside from MR1 job. I am sure this is the common problem that people usually face. What is the best way to resolve this kind of issue.
Thanks
From: Ravi teja ch n v [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 04, 2012 4:35 PM To: [EMAIL PROTECTED] Subject: RE: Calling one MR job within another MR job Hi Stuti,
If you are looking for MRjob2 to run after MRjob1, ie the job dependency,
you can use JobControl API, where you can manage the dependencies.
Calling another Job from a Mapper is not a good idea.
Thanks,
Ravi Teja
________________________________ From: Stuti Awasthi [[EMAIL PROTECTED]] Sent: 04 April 2012 16:04:19 To: [EMAIL PROTECTED] Subject: Calling one MR job within another MR job Hi all,
We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } }
MRJob2{ Processing…. }
My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida. ________________________________ ::DISCLAIMER:: -----------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
-----------------------------------------------------------------------------------------------------------------------
-
RE: Calling one MR job within another MR job
Devaraj k 2012-04-04, 12:39
Hi Stuti,
If you want deal with different types of files in the map phase, you can use org.apache.hadoop.mapred.lib.MultipleInputs API(different input formats, mappers) and then the output of those mappers can same type. After map phase, partitioner can send the map outputs from file1 and file2(which are similar based on your business need) to same reducer. You can compare these in the reduce phase. If you give the scenario with some more details, people maylp you better.
Thanks Devaraj ________________________________________ From: Stuti Awasthi [[EMAIL PROTECTED]] Sent: Wednesday, April 04, 2012 5:25 PM To: [EMAIL PROTECTED] Subject: RE: Calling one MR job within another MR job
Hi Ravi,
There is no job dependency so I cannot use chaining MR or JobControl as you suggested. I have 2 relatively big files, I start processing with File1 as input to MR1 job , now this processing required to find the data from File2. One way to do is loop through File2 and get the data. Other way to pass File2 in MR2 job for parallel processing.
Second option is making hinting me to call an MR2 job inside from MR1 job. I am sure this is the common problem that people usually face. What is the best way to resolve this kind of issue.
Thanks
From: Ravi teja ch n v [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 04, 2012 4:35 PM To: [EMAIL PROTECTED] Subject: RE: Calling one MR job within another MR job Hi Stuti,
If you are looking for MRjob2 to run after MRjob1, ie the job dependency,
you can use JobControl API, where you can manage the dependencies.
Calling another Job from a Mapper is not a good idea.
Thanks,
Ravi Teja
________________________________ From: Stuti Awasthi [[EMAIL PROTECTED]] Sent: 04 April 2012 16:04:19 To: [EMAIL PROTECTED] Subject: Calling one MR job within another MR job Hi all,
We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } }
MRJob2{ Processing…. }
My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida. ________________________________ ::DISCLAIMER:: -----------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
-----------------------------------------------------------------------------------------------------------------------
-
Re: Calling one MR job within another MR job
praveenesh kumar 2012-04-04, 12:41
Try looking into distributed cache.. may be it solves your problem ?
Regards, Praveenesh
On Wed, Apr 4, 2012 at 6:01 PM, Ravi teja ch n v <[EMAIL PROTECTED]>wrote:
> Hi Stuti, > > > > In that case, you can run the Job with dependent file (file2) first, then > go for the job using file1. > > Then your second mapper can use the already processed output. > > > > I guess this will solve the problem u have mentioned. > > > > Thanks, > > Ravi Teja > > > ------------------------------ > *From:* Stuti Awasthi [[EMAIL PROTECTED]] > *Sent:* 04 April 2012 17:25:02 > > *To:* [EMAIL PROTECTED] > *Subject:* RE: Calling one MR job within another MR job > > Hi Ravi, > > > > There is no job dependency so I cannot use chaining MR or JobControl as > you suggested. > > I have 2 relatively big files, I start processing with File1 as input to > MR1 job , now this processing required to find the data from File2. One way > to do is loop through File2 and get the data. Other way to pass File2 in > MR2 job for parallel processing. > > > > Second option is making hinting me to call an MR2 job inside from MR1 job. > I am sure this is the common problem that people usually face. What is the > best way to resolve this kind of issue. > > > > Thanks > > > > *From:* Ravi teja ch n v [mailto:[EMAIL PROTECTED]] > *Sent:* Wednesday, April 04, 2012 4:35 PM > *To:* [EMAIL PROTECTED] > *Subject:* RE: Calling one MR job within another MR job > > > > Hi Stuti, > > > > If you are looking for MRjob2 to run after MRjob1, ie the job dependency, > > you can use JobControl API, where you can manage the dependencies. > > > > Calling another Job from a Mapper is not a good idea. > > > > Thanks, > > Ravi Teja > > > ------------------------------ > > *From:* Stuti Awasthi [[EMAIL PROTECTED]] > *Sent:* 04 April 2012 16:04:19 > *To:* [EMAIL PROTECTED] > *Subject:* Calling one MR job within another MR job > > Hi all, > > > > We have a usecase in which I start with first MR1 job with input file as > File1.txt, and from this job, call another MR2 job with input as File2.txt > > So : > > MRjob1{ > > Map(){ > > MRJob2(File2.txt) > > } > > } > > > > MRJob2{ > > Processing…. > > } > > > > My queries are is this kind of approach is possible and how much are the > implications from the performance perspective. > > > > > > Regards, > > *Stuti Awasthi* > > HCL Comnet Systems and Services Ltd > > F-8/9 Basement, Sec-3,Noida. > > > > > ------------------------------ > > ::DISCLAIMER:: > > ----------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > It shall not attach any liability on the originator or HCL or its > affiliates. Any views or opinions presented in > this email are solely those of the author and may not necessarily reflect > the opinions of HCL or its affiliates. > Any form of reproduction, dissemination, copying, disclosure, > modification, distribution and / or publication of > this message without the prior written consent of the author of this > e-mail is strictly prohibited. If you have > received this email in error please delete it and notify the sender > immediately. Before opening any mail and > attachments please check them for viruses and defect. > > > ----------------------------------------------------------------------------------------------------------------------- >
-
RE: Calling one MR job within another MR job
jagatsingh@... 2012-04-04, 12:47
Hello Stuti
The way you have explained it seems we can think about caching the file2 already in nodes.
-- Just out of context , In the same way replicated joins are being handled in Pig in which one file (file2) to be joined is cached in the memory by file1.
Regards
Jagat
----- Original Message ----- From: Stuti Awasthi Sent: 04/04/12 07:55 AM To: [EMAIL PROTECTED] Subject: RE: Calling one MR job within another MR job
Hi Ravi, There is no job dependency so I cannot use chaining MR or JobControl as you suggested. I have 2 relatively big files, I start processing with File1 as input to MR1 job , now this processing required to find the data from File2. One way to do is loop through File2 and get the data. Other way to pass File2 in MR2 job for parallel processing. Second option is making hinting me to call an MR2 job inside from MR1 job. I am sure this is the common problem that people usually face. What is the best way to resolve this kind of issue. Thanks
From: Ravi teja ch n v [mailto:[EMAIL PROTECTED]] *Sent:* Wednesday, April 04, 2012 4:35 PM *To:* [EMAIL PROTECTED] *Subject:* RE: Calling one MR job within another MR job
Hi Stuti, If you are looking for MRjob2 to run after MRjob1, ie the job dependency, you can use JobControl API, where you can manage the dependencies. Calling another Job from a Mapper is not a good idea. Thanks, Ravi Teja
-----------------------------------------------------------------
From: Stuti Awasthi [[EMAIL PROTECTED]] *Sent:* 04 April 2012 16:04:19 *To:* [EMAIL PROTECTED] *Subject:* Calling one MR job within another MR job
Hi all, We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } } MRJob2{ Processing…. } My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida.
----------------------------------------------------------------- ::DISCLAIMER:: -----------------------------------------------------------------------------------------------------------------------
The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect.
-----------------------------------------------------------------------------------------------------------------------
-
Re: Calling one MR job within another MR job
Praveen Kumar K J V S 2012-04-04, 13:12
Dear Stuti, As per the mail chain I uderstand you want to do SetJoin on two sets File1 and File2 with some join finction F(F1,F2). On this assumption, please find my reply below: Set join is not simple and that too if input the input is very large. It essestially does a cartesian product between the two sets F1 and F2 and filter out the required data based on some function F(F1, F2). What i mean is say you have two files each with 10Lakh lines, then to perform a set join you essentially do 100Lakh operations and filter phase works on these 100Lakh results to filter out the required ones. Hence such a problem being exponentially inreasing in input size, it is helpful if you know how to Set-Join funciton works. having such insight is helpful. Though I have to admit, that these kind of problems are still under active reasearch, please refer links below for more detail: 1. http://www.youtube.com/watch?v=kiuUGXWRzPA - google tech talks 2. http://www.slideshare.net/rvernica/efficient-parallel-setsimilarity-joins-using-mapreduce-sigmod-2010-slides 3. http://research.microsoft.com/apps/pubs/default.aspx?id=76165 4. http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010@Distributed cahce: Its not great if you have huge files. By Default you have to size limit of 10GB as the max size for a distributed file @launching jobs inside a mapper: Not a great idea, because for every key value you will launch a job and so essentially you will end up launching very huge number of jobs. Absolutely No No. A bug in production can bring down the cluster. Also its difficult to track all these jobs. Thank, Praveen On Wed, Apr 4, 2012 at 6:17 PM, <[EMAIL PROTECTED]> wrote: > Hello Stuti > > The way you have explained it seems we can think about caching the file2 > already in nodes. > > -- Just out of context , In the same way replicated joins are being > handled in Pig in which one file (file2) to be joined is cached in the > memory by file1. > > Regards > > Jagat > > > > ----- Original Message ----- > > From: Stuti Awasthi > > Sent: 04/04/12 07:55 AM > > To: [EMAIL PROTECTED] > > Subject: RE: Calling one MR job within another MR job > > Hi Ravi, > > > > > > > > > > > > There is no job dependency so I cannot use chaining MR or JobControl as > you suggested. > > > > > > I have 2 relatively big files, I start processing with File1 as input to > MR1 job , now this processing required to find the data from File2. One way > to do is loop through File2 and get the data. Other way to pass File2 in > MR2 job for parallel processing. > > > > > > > > > > > > Second option is making hinting me to call an MR2 job inside from MR1 job. > I am sure this is the common problem that people usually face. What is the > best way to resolve this kind of issue. > > > > > > > > > > > > Thanks > > > > > > > > > > > > *From:* Ravi teja ch n v [mailto:[EMAIL PROTECTED]] > *Sent:* Wednesday, April 04, 2012 4:35 PM > *To:* [EMAIL PROTECTED] > *Subject:* RE: Calling one MR job within another MR job > > > > > > > > > > > > Hi Stuti, > > > > > > > > > > > > If you are looking for MRjob2 to run after MRjob1, ie the job dependency, > > > > > > you can use JobControl API, where you can manage the dependencies. > > > > > > > > > > > > Calling another Job from a Mapper is not a good idea. > > > > > > > > > > > > Thanks, > > > > > > Ravi Teja > > > > > > > > > > > ------------------------------ > > *From:* Stuti Awasthi [[EMAIL PROTECTED]] > *Sent:* 04 April 2012 16:04:19 > *To:* [EMAIL PROTECTED] > *Subject:* Calling one MR job within another MR job > > > > > > Hi all, > > > > > > > > > > > > We have a usecase in which I start with first MR1 job with input file as > File1.txt, and from this job, call another MR2 job with input as File2.txt > > > > > > So : > > > > > > MRjob1{ > > > > > > Map(){ > > > > > > MRJob2(File2.txt) > > > > > > } > > > > > > } > > > > > > > > > > > > MRJob2{
-
RE: Calling one MR job within another MR job
Stuti Awasthi 2012-04-05, 06:07
Thanks everyone, So with this discussion, there are 2 main opinions I got : 1. Not to call one MR job from inside another MR job. 2. Can use distributed cache (but not good for very large file). I want to design the system so that I can efficiently do the processing. So if I run MR job to process File2 first and store its data in KeyValueFormat in HDFS. Once this job is complete, I start with another MR job to process File1. Now since each I/p line of File1 will require to get the some data from output of first MR job. 1. Normal way to do is , For each input line for 2nd MR job, it will loop through the contents of output from MR job1 and get the relevant data for processing. 2. Since I have stored output of File2 in key-value format, can I directly get the value for specific key. So I want to know that if I have output1 in KeyValueFormat in HDFS. I run a separate job with different I/p file and wants to access data from output1 on the basis of keys, can we attain that without looping output1. Thanks From: Praveen Kumar K J V S [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 04, 2012 6:43 PM To: [EMAIL PROTECTED] Subject: Re: Calling one MR job within another MR job Dear Stuti, As per the mail chain I uderstand you want to do SetJoin on two sets File1 and File2 with some join finction F(F1,F2). On this assumption, please find my reply below: Set join is not simple and that too if input the input is very large. It essestially does a cartesian product between the two sets F1 and F2 and filter out the required data based on some function F(F1, F2). What i mean is say you have two files each with 10Lakh lines, then to perform a set join you essentially do 100Lakh operations and filter phase works on these 100Lakh results to filter out the required ones. Hence such a problem being exponentially inreasing in input size, it is helpful if you know how to Set-Join funciton works. having such insight is helpful. Though I have to admit, that these kind of problems are still under active reasearch, please refer links below for more detail: * http://www.youtube.com/watch?v=kiuUGXWRzPA - google tech talks * http://www.slideshare.net/rvernica/efficient-parallel-setsimilarity-joins-using-mapreduce-sigmod-2010-slides * http://research.microsoft.com/apps/pubs/default.aspx?id=76165 * http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010@Distributed cahce: Its not great if you have huge files. By Default you have to size limit of 10GB as the max size for a distributed file @launching jobs inside a mapper: Not a great idea, because for every key value you will launch a job and so essentially you will end up launching very huge number of jobs. Absolutely No No. A bug in production can bring down the cluster. Also its difficult to track all these jobs. Thank, Praveen On Wed, Apr 4, 2012 at 6:17 PM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hello Stuti The way you have explained it seems we can think about caching the file2 already in nodes. -- Just out of context , In the same way replicated joins are being handled in Pig in which one file (file2) to be joined is cached in the memory by file1. Regards Jagat ----- Original Message ----- From: Stuti Awasthi Sent: 04/04/12 07:55 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: RE: Calling one MR job within another MR job Hi Ravi, There is no job dependency so I cannot use chaining MR or JobControl as you suggested. I have 2 relatively big files, I start processing with File1 as input to MR1 job , now this processing required to find the data from File2. One way to do is loop through File2 and get the data. Other way to pass File2 in MR2 job for parallel processing. Second option is making hinting me to call an MR2 job inside from MR1 job. I am sure this is the common problem that people usually face. What is the best way to resolve this kind of issue. Thanks From: Ravi teja ch n v [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Wednesday, April 04, 2012 4:35 PM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: RE: Calling one MR job within another MR job Hi Stuti, If you are looking for MRjob2 to run after MRjob1, ie the job dependency, you can use JobControl API, where you can manage the dependencies. Calling another Job from a Mapper is not a good idea. Thanks, Ravi Teja ________________________________ From: Stuti Awasthi [[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: 04 April 2012 16:04:19 To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Calling one MR job within another MR job Hi all, We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } } MRJob2{ Processing.... } My queries are is this kind of approach is possible and how much are the implications from the performance perspective. Regards, Stuti Awasthi HCL Comnet Systems and Services Ltd F-8/9 Basement, Sec-3,Noida. ________________________________ The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibit
-
Re: Calling one MR job within another MR job
Ashwanth Kumar 2012-04-05, 06:58
What I understand is you are looking for a value based on key, I guess you should look at a Key-Value Datastore (like Voldermort). But Again accessing the datastore for each key in 2nd MR Job would be a costly operation, which might require additional tuning the datastore. PS - I am not sure if this is a good practice. On Thu, Apr 5, 2012 at 11:37 AM, Stuti Awasthi <[EMAIL PROTECTED]> wrote: > Thanks everyone,**** > > ** ** > > So with this discussion, there are 2 main opinions I got :**** > > **1. **Not to call one MR job from inside another MR job.**** > > **2. **Can use distributed cache (but not good for very large file).* > *** > > I want to design the system so that I can efficiently do the processing. > So if I run MR job to process File2 first and store its data in > KeyValueFormat in HDFS.**** > > Once this job is complete, I start with another MR job to process File1. > Now since each I/p line of File1 will require to get the some data from > output of first MR job.**** > > **1. **Normal way to do is , For each input line for 2nd MR job, it > will loop through the contents of output from MR job1 and get the relevant > data for processing.**** > > **2. **Since I have stored output of File2 in key-value format, can > I directly get the value for specific key.**** > > ** ** > > So I want to know that if I have output1 in KeyValueFormat in HDFS. I run > a separate job with different I/p file and wants to access data from > output1 on the basis of keys, can we attain that without looping output1.* > *** > > ** ** > > Thanks**** > > ** ** > > *From:* Praveen Kumar K J V S [mailto:[EMAIL PROTECTED]] > *Sent:* Wednesday, April 04, 2012 6:43 PM > *To:* [EMAIL PROTECTED] > *Subject:* Re: Calling one MR job within another MR job**** > > ** ** > > Dear Stuti,**** > > **** > > As per the mail chain I uderstand you want to do SetJoin on two sets File1 > and File2 with some join finction F(F1,F2). On this assumption, please find > my reply below:**** > > **** > > Set join is not simple and that too if input the input is very large. It > essestially does a cartesian product between the two sets F1 and F2 and > filter out the required data based on some function F(F1, F2).**** > > **** > > What i mean is say you have two files each with 10Lakh lines, then to > perform a set join you essentially do 100Lakh operations and filter > phase works on these 100Lakh results to filter out the required ones.**** > > **** > > Hence such a problem being exponentially inreasing in input size, it is > helpful if you know how to Set-Join funciton works. having such insight is > helpful.**** > > **** > > Though I have to admit, that these kind of problems are still under active > reasearch, please refer links below for more detail:**** > > **** > > 1. http://www.youtube.com/watch?v=kiuUGXWRzPA - google tech talks**** > 2. > http://www.slideshare.net/rvernica/efficient-parallel-setsimilarity-joins-using-mapreduce-sigmod-2010-slides> **** > 3. http://research.microsoft.com/apps/pubs/default.aspx?id=76165****> 4. http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010*> *** > > @Distributed cahce: Its not great if you have huge files. By Default you > have to size limit of 10GB as the max size for a distributed file**** > > **** > > @launching jobs inside a mapper: Not a great idea, because for every key > value you will launch a job and so essentially you will end up launching > very huge number of jobs. Absolutely No No. A bug in production can bring > down the cluster. Also its difficult to track all these jobs.**** > > **** > > Thank,**** > > Praveen**** > > On Wed, Apr 4, 2012 at 6:17 PM, <[EMAIL PROTECTED]> wrote:**** > > Hello Stuti > > The way you have explained it seems we can think about caching the file2 > already in nodes. > > -- Just out of context , In the same way replicated joins are being > handled in Pig in which one file (file2) to be joined is cached in the Ashwanth Kumar / ashwanthkumar.in
|
|