|
Hans Uhlig
2012-03-11, 04:00
WangRamon
2012-03-11, 04:05
Hans Uhlig
2012-03-11, 04:08
Harsh J
2012-03-11, 04:41
Hans Uhlig
2012-03-11, 05:54
Harsh J
2012-03-11, 07:50
Hans Uhlig
2012-03-11, 08:06
Harsh J
2012-03-11, 13:38
Harsh J
2012-03-11, 13:39
George Datskos
2012-03-13, 02:02
|
-
Mapper Record SpillageHans Uhlig 2012-03-11, 04:00
I am attempting to speed up a mapping process whose input is GZIP compressed
CSV files. The files range from 1-2GB, I am running on a Cluster where each node has a total of 32GB memory available to use. I have attempted to tweak mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to accommodate the size but I keep getting java heap errors or other memory related problems. My row count per mapper is well below Integer.MAX_INTEGER limit by several orders of magnitude and the box is NOT using anywhere close to its full memory allotment. How can I specify that this map task can have 3-4 GB of memory for the collection, partition and sort process without constantly spilling records to disk? +
Hans Uhlig 2012-03-11, 04:00
-
RE: Mapper Record SpillageWangRamon 2012-03-11, 04:05
How man map/reduce tasks slots do you have for each node? If the total number is 10, then you will use 10 * 4096mb memory when all tasks are running, which is bigger than the total memory 32G you have for each node. Date: Sat, 10 Mar 2012 20:00:13 -0800 Subject: Mapper Record Spillage From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] I am attempting to speed up a mapping process whose input is GZIP compressed CSV files. The files range from 1-2GB, I am running on a Cluster where each node has a total of 32GB memory available to use. I have attempted to tweak mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to accommodate the size but I keep getting java heap errors or other memory related problems. My row count per mapper is well below Integer.MAX_INTEGER limit by several orders of magnitude and the box is NOT using anywhere close to its full memory allotment. How can I specify that this map task can have 3-4 GB of memory for the collection, partition and sort process without constantly spilling records to disk? +
WangRamon 2012-03-11, 04:05
-
Re: Mapper Record SpillageHans Uhlig 2012-03-11, 04:08
I am attempting to specify this for a single job during its
creation/submission. Not via the general construct. I am using the new api so I am adding the values to the conf passed into new Job(); 2012/3/10 WangRamon <[EMAIL PROTECTED]> > How man map/reduce tasks slots do you have for each node? If the > total number is 10, then you will use 10 * 4096mb memory when all tasks are > running, which is bigger than the total memory 32G you have for each node. > > ------------------------------ > Date: Sat, 10 Mar 2012 20:00:13 -0800 > Subject: Mapper Record Spillage > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > I am attempting to speed up a mapping process whose input is GZIP compressed > CSV files. The files range from 1-2GB, I am running on a Cluster where each > node has a total of 32GB memory available to use. I have attempted to tweak > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to accommodate > the size but I keep getting java heap errors or other memory related > problems. My row count per mapper is well below Integer.MAX_INTEGER limi t > by several orders of magnitude and the box is NOT using anywhere close to its > full memory allotment. How can I specify that this map task can have 3-4 > GB of memory for the collection, partition and sort process without constantly > spilling records to disk? > +
Hans Uhlig 2012-03-11, 04:08
-
Re: Mapper Record SpillageHarsh J 2012-03-11, 04:41
Hans,
Its possible you may have an typo issue: mapred.map.child.jvm.opts - Such a property does not exist. Perhaps you wanted "mapred.map.child.java.opts"? Additionally, the computation you need to do is (# of map slots on a TT * per-map-task-heap-requirement) should be at least < (Total RAM - 2/3 GB). With your 4 GB requirement, I guess you can support a max of 6-7 slots per machine (i.e. Not counting reducer heap requirements in parallel). On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: > I am attempting to speed up a mapping process whose input is GZIP compressed > CSV files. The files range from 1-2GB, I am running on a Cluster where each > node has a total of 32GB memory available to use. I have attempted to tweak > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to > accommodate the size but I keep getting java heap errors or other memory > related problems. My row count per mapper is well below Integer.MAX_INTEGER > limit by several orders of magnitude and the box is NOT using anywhere close > to its full memory allotment. How can I specify that this map task can have > 3-4 GB of memory for the collection, partition and sort process without > constantly spilling records to disk? -- Harsh J +
Harsh J 2012-03-11, 04:41
-
Re: Mapper Record SpillageHans Uhlig 2012-03-11, 05:54
That was a typo in my email not in the configuration. Is the memory
reserved for the tasks when the task tracker starts? You seem to be suggesting that I need to set the memory to be the same for all map tasks. Is there no way to override for a single map task? On Sat, Mar 10, 2012 at 8:41 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Hans, > > Its possible you may have an typo issue: mapred.map.child.jvm.opts - > Such a property does not exist. Perhaps you wanted > "mapred.map.child.java.opts"? > > Additionally, the computation you need to do is (# of map slots on a > TT * per-map-task-heap-requirement) should be at least < (Total RAM - > 2/3 GB). With your 4 GB requirement, I guess you can support a max of > 6-7 slots per machine (i.e. Not counting reducer heap requirements in > parallel). > > On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: > > I am attempting to speed up a mapping process whose input is GZIP > compressed > > CSV files. The files range from 1-2GB, I am running on a Cluster where > each > > node has a total of 32GB memory available to use. I have attempted to > tweak > > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to > > accommodate the size but I keep getting java heap errors or other memory > > related problems. My row count per mapper is well below > Integer.MAX_INTEGER > > limit by several orders of magnitude and the box is NOT using anywhere > close > > to its full memory allotment. How can I specify that this map task can > have > > 3-4 GB of memory for the collection, partition and sort process without > > constantly spilling records to disk? > > > > -- > Harsh J > +
Hans Uhlig 2012-03-11, 05:54
-
Re: Mapper Record SpillageHarsh J 2012-03-11, 07:50
Hans,
You can change memory requirements for tasks of a single job, but not of a single task inside that job. This is briefly how the 0.20 framework (by default) works: TT has notions only of "slots", and carries a maximum _number_ of simultaneous slots it may run. It does not know of what each task, occupying one slot, would demand in resource-terms. Your job then supplies a # of map tasks, and amount of memory required per map task in general, as a configuration. TTs then merely start the task JVMs with the provided heap configuration. On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: > That was a typo in my email not in the configuration. Is the memory reserved > for the tasks when the task tracker starts? You seem to be suggesting that I > need to set the memory to be the same for all map tasks. Is there no way to > override for a single map task? > > > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >> Hans, >> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts - >> Such a property does not exist. Perhaps you wanted >> "mapred.map.child.java.opts"? >> >> Additionally, the computation you need to do is (# of map slots on a >> TT * per-map-task-heap-requirement) should be at least < (Total RAM - >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in >> parallel). >> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: >> > I am attempting to speed up a mapping process whose input is GZIP >> > compressed >> > CSV files. The files range from 1-2GB, I am running on a Cluster where >> > each >> > node has a total of 32GB memory available to use. I have attempted to >> > tweak >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to >> > accommodate the size but I keep getting java heap errors or other memory >> > related problems. My row count per mapper is well below >> > Integer.MAX_INTEGER >> > limit by several orders of magnitude and the box is NOT using anywhere >> > close >> > to its full memory allotment. How can I specify that this map task can >> > have >> > 3-4 GB of memory for the collection, partition and sort process without >> > constantly spilling records to disk? >> >> >> >> -- >> Harsh J > > -- Harsh J +
Harsh J 2012-03-11, 07:50
-
Re: Mapper Record SpillageHans Uhlig 2012-03-11, 08:06
If that is the case then these two lines should make more than enough
memory. On a virtually unused cluster. job.getConfiguration().setInt("io.sort.mb", 2048); job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M"); Such that a conversion from 1GB of CSV Text to binary primitives should fit easily. but java still throws a heap error even when there is 25 GB of memory free. On Sat, Mar 10, 2012 at 11:50 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Hans, > > You can change memory requirements for tasks of a single job, but not > of a single task inside that job. > > This is briefly how the 0.20 framework (by default) works: TT has > notions only of "slots", and carries a maximum _number_ of > simultaneous slots it may run. It does not know of what each task, > occupying one slot, would demand in resource-terms. Your job then > supplies a # of map tasks, and amount of memory required per map task > in general, as a configuration. TTs then merely start the task JVMs > with the provided heap configuration. > > On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: > > That was a typo in my email not in the configuration. Is the memory > reserved > > for the tasks when the task tracker starts? You seem to be suggesting > that I > > need to set the memory to be the same for all map tasks. Is there no way > to > > override for a single map task? > > > > > > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J <[EMAIL PROTECTED]> wrote: > >> > >> Hans, > >> > >> Its possible you may have an typo issue: mapred.map.child.jvm.opts - > >> Such a property does not exist. Perhaps you wanted > >> "mapred.map.child.java.opts"? > >> > >> Additionally, the computation you need to do is (# of map slots on a > >> TT * per-map-task-heap-requirement) should be at least < (Total RAM - > >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of > >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in > >> parallel). > >> > >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: > >> > I am attempting to speed up a mapping process whose input is GZIP > >> > compressed > >> > CSV files. The files range from 1-2GB, I am running on a Cluster where > >> > each > >> > node has a total of 32GB memory available to use. I have attempted to > >> > tweak > >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to > >> > accommodate the size but I keep getting java heap errors or other > memory > >> > related problems. My row count per mapper is well below > >> > Integer.MAX_INTEGER > >> > limit by several orders of magnitude and the box is NOT using anywhere > >> > close > >> > to its full memory allotment. How can I specify that this map task can > >> > have > >> > 3-4 GB of memory for the collection, partition and sort process > without > >> > constantly spilling records to disk? > >> > >> > >> > >> -- > >> Harsh J > > > > > > > > -- > Harsh J > +
Hans Uhlig 2012-03-11, 08:06
-
Re: Mapper Record SpillageHarsh J 2012-03-11, 13:38
Hans,
I don't think io.sort.mb can support a whole 2048 value (it builds one array with the size, and JVM may not be allowing that). Can you lower it to 2000 ± 100 and try again? On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig <[EMAIL PROTECTED]> wrote: > If that is the case then these two lines should make more than enough > memory. On a virtually unused cluster. > > job.getConfiguration().setInt("io.sort.mb", 2048); > job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M"); > > Such that a conversion from 1GB of CSV Text to binary primitives should fit > easily. but java still throws a heap error even when there is 25 GB of > memory free. > > On Sat, Mar 10, 2012 at 11:50 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >> Hans, >> >> You can change memory requirements for tasks of a single job, but not >> of a single task inside that job. >> >> This is briefly how the 0.20 framework (by default) works: TT has >> notions only of "slots", and carries a maximum _number_ of >> simultaneous slots it may run. It does not know of what each task, >> occupying one slot, would demand in resource-terms. Your job then >> supplies a # of map tasks, and amount of memory required per map task >> in general, as a configuration. TTs then merely start the task JVMs >> with the provided heap configuration. >> >> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: >> > That was a typo in my email not in the configuration. Is the memory >> > reserved >> > for the tasks when the task tracker starts? You seem to be suggesting >> > that I >> > need to set the memory to be the same for all map tasks. Is there no way >> > to >> > override for a single map task? >> > >> > >> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >> >> >> Hans, >> >> >> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts - >> >> Such a property does not exist. Perhaps you wanted >> >> "mapred.map.child.java.opts"? >> >> >> >> Additionally, the computation you need to do is (# of map slots on a >> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM - >> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of >> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in >> >> parallel). >> >> >> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: >> >> > I am attempting to speed up a mapping process whose input is GZIP >> >> > compressed >> >> > CSV files. The files range from 1-2GB, I am running on a Cluster >> >> > where >> >> > each >> >> > node has a total of 32GB memory available to use. I have attempted to >> >> > tweak >> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to >> >> > accommodate the size but I keep getting java heap errors or other >> >> > memory >> >> > related problems. My row count per mapper is well below >> >> > Integer.MAX_INTEGER >> >> > limit by several orders of magnitude and the box is NOT using >> >> > anywhere >> >> > close >> >> > to its full memory allotment. How can I specify that this map task >> >> > can >> >> > have >> >> > 3-4 GB of memory for the collection, partition and sort process >> >> > without >> >> > constantly spilling records to disk? >> >> >> >> >> >> >> >> -- >> >> Harsh J >> > >> > >> >> >> >> -- >> Harsh J > > -- Harsh J +
Harsh J 2012-03-11, 13:38
-
Re: Mapper Record SpillageHarsh J 2012-03-11, 13:39
(Er, not sure how that ± got in there, I wished to type (-100, lowered
further if it continued to show problems)). On Sun, Mar 11, 2012 at 7:08 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Hans, > > I don't think io.sort.mb can support a whole 2048 value (it builds one > array with the size, and JVM may not be allowing that). Can you lower > it to 2000 ± 100 and try again? > > On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig <[EMAIL PROTECTED]> wrote: >> If that is the case then these two lines should make more than enough >> memory. On a virtually unused cluster. >> >> job.getConfiguration().setInt("io.sort.mb", 2048); >> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M"); >> >> Such that a conversion from 1GB of CSV Text to binary primitives should fit >> easily. but java still throws a heap error even when there is 25 GB of >> memory free. >> >> On Sat, Mar 10, 2012 at 11:50 PM, Harsh J <[EMAIL PROTECTED]> wrote: >>> >>> Hans, >>> >>> You can change memory requirements for tasks of a single job, but not >>> of a single task inside that job. >>> >>> This is briefly how the 0.20 framework (by default) works: TT has >>> notions only of "slots", and carries a maximum _number_ of >>> simultaneous slots it may run. It does not know of what each task, >>> occupying one slot, would demand in resource-terms. Your job then >>> supplies a # of map tasks, and amount of memory required per map task >>> in general, as a configuration. TTs then merely start the task JVMs >>> with the provided heap configuration. >>> >>> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: >>> > That was a typo in my email not in the configuration. Is the memory >>> > reserved >>> > for the tasks when the task tracker starts? You seem to be suggesting >>> > that I >>> > need to set the memory to be the same for all map tasks. Is there no way >>> > to >>> > override for a single map task? >>> > >>> > >>> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J <[EMAIL PROTECTED]> wrote: >>> >> >>> >> Hans, >>> >> >>> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts - >>> >> Such a property does not exist. Perhaps you wanted >>> >> "mapred.map.child.java.opts"? >>> >> >>> >> Additionally, the computation you need to do is (# of map slots on a >>> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM - >>> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of >>> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in >>> >> parallel). >>> >> >>> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote: >>> >> > I am attempting to speed up a mapping process whose input is GZIP >>> >> > compressed >>> >> > CSV files. The files range from 1-2GB, I am running on a Cluster >>> >> > where >>> >> > each >>> >> > node has a total of 32GB memory available to use. I have attempted to >>> >> > tweak >>> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to >>> >> > accommodate the size but I keep getting java heap errors or other >>> >> > memory >>> >> > related problems. My row count per mapper is well below >>> >> > Integer.MAX_INTEGER >>> >> > limit by several orders of magnitude and the box is NOT using >>> >> > anywhere >>> >> > close >>> >> > to its full memory allotment. How can I specify that this map task >>> >> > can >>> >> > have >>> >> > 3-4 GB of memory for the collection, partition and sort process >>> >> > without >>> >> > constantly spilling records to disk? >>> >> >>> >> >>> >> >>> >> -- >>> >> Harsh J >>> > >>> > >>> >>> >>> >>> -- >>> Harsh J >> >> > > > > -- > Harsh J -- Harsh J +
Harsh J 2012-03-11, 13:39
-
Re: Mapper Record SpillageGeorge Datskos 2012-03-13, 02:02
Actually if you set {io.sort.mb} to 2048, your map tasks will always
fail. The maximum {io.sort.mb} is hard-coded to 2047. Which means if you think you've set 2048 and your tasks aren't failing, then you probably haven't actually changed io.sort.mb. Double-check what configuration settings the Jobtracker actually saw by looking at $ hadoop fs -cat hdfs://<JOB_OUTPUT_DIR>/_logs/history/*.xml | grep io.sort.mb George On 2012/03/11 22:38, Harsh J wrote: > Hans, > > I don't think io.sort.mb can support a whole 2048 value (it builds one > array with the size, and JVM may not be allowing that). Can you lower > it to 2000 � 100 and try again? > > On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig<[EMAIL PROTECTED]> wrote: >> If that is the case then these two lines should make more than enough >> memory. On a virtually unused cluster. >> >> job.getConfiguration().setInt("io.sort.mb", 2048); >> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M"); >> >> Such that a conversion from 1GB of CSV Text to binary primitives should fit >> easily. but java still throws a heap error even when there is 25 GB of >> memory free. >> >> On Sat, Mar 10, 2012 at 11:50 PM, Harsh J<[EMAIL PROTECTED]> wrote: >>> Hans, >>> >>> You can change memory requirements for tasks of a single job, but not >>> of a single task inside that job. >>> >>> This is briefly how the 0.20 framework (by default) works: TT has >>> notions only of "slots", and carries a maximum _number_ of >>> simultaneous slots it may run. It does not know of what each task, >>> occupying one slot, would demand in resource-terms. Your job then >>> supplies a # of map tasks, and amount of memory required per map task >>> in general, as a configuration. TTs then merely start the task JVMs >>> with the provided heap configuration. >>> >>> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig<[EMAIL PROTECTED]> wrote: >>>> That was a typo in my email not in the configuration. Is the memory >>>> reserved >>>> for the tasks when the task tracker starts? You seem to be suggesting >>>> that I >>>> need to set the memory to be the same for all map tasks. Is there no way >>>> to >>>> override for a single map task? >>>> >>>> >>>> On Sat, Mar 10, 2012 at 8:41 PM, Harsh J<[EMAIL PROTECTED]> wrote: >>>>> Hans, >>>>> >>>>> Its possible you may have an typo issue: mapred.map.child.jvm.opts - >>>>> Such a property does not exist. Perhaps you wanted >>>>> "mapred.map.child.java.opts"? >>>>> >>>>> Additionally, the computation you need to do is (# of map slots on a >>>>> TT * per-map-task-heap-requirement) should be at least< (Total RAM - >>>>> 2/3 GB). With your 4 GB requirement, I guess you can support a max of >>>>> 6-7 slots per machine (i.e. Not counting reducer heap requirements in >>>>> parallel). >>>>> >>>>> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig<[EMAIL PROTECTED]> wrote: >>>>>> I am attempting to speed up a mapping process whose input is GZIP >>>>>> compressed >>>>>> CSV files. The files range from 1-2GB, I am running on a Cluster >>>>>> where >>>>>> each >>>>>> node has a total of 32GB memory available to use. I have attempted to >>>>>> tweak >>>>>> mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to >>>>>> accommodate the size but I keep getting java heap errors or other >>>>>> memory >>>>>> related problems. My row count per mapper is well below >>>>>> Integer.MAX_INTEGER >>>>>> limit by several orders of magnitude and the box is NOT using >>>>>> anywhere >>>>>> close >>>>>> to its full memory allotment. How can I specify that this map task >>>>>> can >>>>>> have >>>>>> 3-4 GB of memory for the collection, partition and sort process >>>>>> without >>>>>> constantly spilling records to disk? >>>>> >>>>> >>>>> -- >>>>> Harsh J >>>> >>> >>> >>> -- >>> Harsh J >> > > +
George Datskos 2012-03-13, 02:02
|