|
Sheng Guo
2012-06-23, 02:27
Jagat Singh
2012-06-23, 04:31
Sheng Guo
2012-06-23, 07:30
Stan Rosenberg
2012-06-23, 15:13
Scott Foster
2012-06-23, 16:40
Sheng Guo
2012-06-23, 20:48
Yang
2012-06-23, 21:58
John Meagher
2012-06-23, 23:15
Scott Foster
2012-06-26, 23:47
|
-
How can I set the mapper number for pig script?Sheng Guo 2012-06-23, 02:27
Hi everyone,
Sorry to bother. How can I configure the number of mappers for my pig script? Thanks a lot! Sheng
-
Re: How can I set the mapper number for pig script?Jagat Singh 2012-06-23, 04:31
Numbers of mappers are selected based on input size by hadoop automatically
----------- Sent from Mobile , short and crisp. On 23-Jun-2012 7:57 AM, "Sheng Guo" <[EMAIL PROTECTED]> wrote: > Hi everyone, > > Sorry to bother. > > How can I configure the number of mappers for my pig script? > > Thanks a lot! > > Sheng >
-
Re: How can I set the mapper number for pig script?Sheng Guo 2012-06-23, 07:30
I know it is automatically set. But I have a large data set, I want it
allocate more mappers during midnight so that more computing resource could be used to speed up. Any suggestions? thanks, On Fri, Jun 22, 2012 at 9:31 PM, Jagat Singh <[EMAIL PROTECTED]> wrote: > Numbers of mappers are selected based on input size by hadoop automatically > > ----------- > Sent from Mobile , short and crisp. > On 23-Jun-2012 7:57 AM, "Sheng Guo" <[EMAIL PROTECTED]> wrote: > > > Hi everyone, > > > > Sorry to bother. > > > > How can I configure the number of mappers for my pig script? > > > > Thanks a lot! > > > > Sheng > > >
-
Re: How can I set the mapper number for pig script?Stan Rosenberg 2012-06-23, 15:13
On Sat, Jun 23, 2012 at 3:30 AM, Sheng Guo <[EMAIL PROTECTED]> wrote:
> I know it is automatically set. But I have a large data set, I want it > allocate more mappers during midnight so that more computing resource could > be used to speed up. > Any suggestions? Pig uses CombineInputFormat by default which attempts to combine a set of physical input splits into one logical input split. I use the following setting to control the number of mappers in some of my benchmarking scripts: -- combine upto this many bytes into a composite input split, i.e., per mapper SET pig.maxCombinedSplitSize 250000000; Note that your are absolute min. is constrained by the smallest block size in your input set.
-
Re: How can I set the mapper number for pig script?Scott Foster 2012-06-23, 16:40
You can also turn off split combination completely and then the number
of mappers will equal the number of blocks SET pig.noSplitCombination false; Adding mappers may not make your process run faster since the time to read the data may be less than the overhead of creating a new JVM for each map task. scott.
-
Re: How can I set the mapper number for pig script?Sheng Guo 2012-06-23, 20:48
Thanks for all your help.
My pig script may have some cpu-intensive job like nlp processing, so it would be helpful if I have multiple mappers running. Correct me if I am wrong. Thanks, Sheng On Sat, Jun 23, 2012 at 9:40 AM, Scott Foster <[EMAIL PROTECTED]>wrote: > You can also turn off split combination completely and then the number > of mappers will equal the number of blocks > SET pig.noSplitCombination false; > > Adding mappers may not make your process run faster since the time to > read the data may be less than the overhead of creating a new JVM for > each map task. > > scott. >
-
Re: How can I set the mapper number for pig script?Yang 2012-06-23, 21:58
hi Sheng:
I had exactly the same problem as you did. right now with hadoop 0.20 and above you can't do it anymore, because the new mapreduce.lib.input.FileInputFormat disabled the original mapred.map.tasks control to compute the goalSize in getSplits() method. ---- the old mapred.FileInputFormat class had this control I submitted https://issues.apache.org/jira/browse/HADOOP-8503 to add back this control because pig actually compiles some hadoop classes into its own jar , including this FileInputFormat class, you could actually work around this by patching your own hadoop jar, then build pig with this jar, and then use your re-built pig in production. you need to make sure to use the full pig jar instead of the pig-withouthadoop.jar. you can also kind of achieve part of the same goal by setting mapreduces.max.split.size, but this is rather inflexible, and if your pig script generates several MR jobs, the same split size will hold for all the jobs, which may not be ideal, if one stage consumes a lot more input data than another. Yang On Sat, Jun 23, 2012 at 1:48 PM, Sheng Guo <[EMAIL PROTECTED]> wrote: > Thanks for all your help. > > My pig script may have some cpu-intensive job like nlp processing, so it > would be helpful if I have multiple mappers running. Correct me if I am > wrong. > Thanks, > > Sheng > > On Sat, Jun 23, 2012 at 9:40 AM, Scott Foster <[EMAIL PROTECTED] > >wrote: > > > You can also turn off split combination completely and then the number > > of mappers will equal the number of blocks > > SET pig.noSplitCombination false; > > > > Adding mappers may not make your process run faster since the time to > > read the data may be less than the overhead of creating a new JVM for > > each map task. > > > > scott. > > >
-
Re: How can I set the mapper number for pig script?John Meagher 2012-06-23, 23:15
Another option is to either reduce the block sizes of the input data
or disabling the combine input format and splitting the data into more files. On Sat, Jun 23, 2012 at 5:58 PM, Yang <[EMAIL PROTECTED]> wrote: > hi Sheng: > > I had exactly the same problem as you did. > > right now with hadoop 0.20 and above you can't do it anymore, because the > new mapreduce.lib.input.FileInputFormat disabled the original > mapred.map.tasks control to compute the goalSize in > getSplits() method. ---- the old mapred.FileInputFormat class had this > control > > I submitted https://issues.apache.org/jira/browse/HADOOP-8503 to add back > this control > > > because pig actually compiles some hadoop classes into its own jar , > including this FileInputFormat class, you could actually work around this > by patching your own hadoop jar, then build pig with this jar, and then use > your re-built pig in production. you need to make sure to use the full pig > jar instead of the pig-withouthadoop.jar. > > you can also kind of achieve part of the same goal by setting > mapreduces.max.split.size, but this is rather inflexible, and if your pig > script generates several MR jobs, the same split size will hold for all the > jobs, which may not be ideal, if one stage consumes a lot more input data > than another. > > > Yang > > On Sat, Jun 23, 2012 at 1:48 PM, Sheng Guo <[EMAIL PROTECTED]> wrote: > >> Thanks for all your help. >> >> My pig script may have some cpu-intensive job like nlp processing, so it >> would be helpful if I have multiple mappers running. Correct me if I am >> wrong. >> Thanks, >> >> Sheng >> >> On Sat, Jun 23, 2012 at 9:40 AM, Scott Foster <[EMAIL PROTECTED] >> >wrote: >> >> > You can also turn off split combination completely and then the number >> > of mappers will equal the number of blocks >> > SET pig.noSplitCombination false; >> > >> > Adding mappers may not make your process run faster since the time to >> > read the data may be less than the overhead of creating a new JVM for >> > each map task. >> > >> > scott. >> > >>
-
Re: How can I set the mapper number for pig script?Scott Foster 2012-06-26, 23:47
You are right that if you have a CPU intensive mapper then having more
mappers will help in that case. As suggested you can reduce the block size of the files you are processing and disable split combination and you'll end up with more mappers in your job. One correction to my previous email, the setting to turn off split combination is: SET pig.noSplitCombination true; scott. |