Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Mapper Record Spillage

Copy link to this message
Re: Mapper Record Spillage
Actually if you set {io.sort.mb} to 2048, your map tasks will always
fail.  The maximum {io.sort.mb} is hard-coded to 2047.  Which means if
you think you've set 2048 and your tasks aren't failing, then you
probably haven't actually changed io.sort.mb.  Double-check what
configuration settings the Jobtracker actually saw by looking at

$ hadoop fs -cat hdfs://<JOB_OUTPUT_DIR>/_logs/history/*.xml | grep

On 2012/03/11 22:38, Harsh J wrote:
> Hans,
> I don't think io.sort.mb can support a whole 2048 value (it builds one
> array with the size, and JVM may not be allowing that). Can you lower
> it to 2000 � 100 and try again?
> On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig<[EMAIL PROTECTED]>  wrote:
>> If that is the case then these two lines should make more than enough
>> memory. On a virtually unused cluster.
>> job.getConfiguration().setInt("io.sort.mb", 2048);
>> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");
>> Such that a conversion from 1GB of CSV Text to binary primitives should fit
>> easily. but java still throws a heap error even when there is 25 GB of
>> memory free.
>> On Sat, Mar 10, 2012 at 11:50 PM, Harsh J<[EMAIL PROTECTED]>  wrote:
>>> Hans,
>>> You can change memory requirements for tasks of a single job, but not
>>> of a single task inside that job.
>>> This is briefly how the 0.20 framework (by default) works: TT has
>>> notions only of "slots", and carries a maximum _number_ of
>>> simultaneous slots it may run. It does not know of what each task,
>>> occupying one slot, would demand in resource-terms. Your job then
>>> supplies a # of map tasks, and amount of memory required per map task
>>> in general, as a configuration. TTs then merely start the task JVMs
>>> with the provided heap configuration.
>>> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig<[EMAIL PROTECTED]>  wrote:
>>>> That was a typo in my email not in the configuration. Is the memory
>>>> reserved
>>>> for the tasks when the task tracker starts? You seem to be suggesting
>>>> that I
>>>> need to set the memory to be the same for all map tasks. Is there no way
>>>> to
>>>> override for a single map task?
>>>> On Sat, Mar 10, 2012 at 8:41 PM, Harsh J<[EMAIL PROTECTED]>  wrote:
>>>>> Hans,
>>>>> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
>>>>> Such a property does not exist. Perhaps you wanted
>>>>> "mapred.map.child.java.opts"?
>>>>> Additionally, the computation you need to do is (# of map slots on a
>>>>> TT * per-map-task-heap-requirement) should be at least<  (Total RAM -
>>>>> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
>>>>> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
>>>>> parallel).
>>>>> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig<[EMAIL PROTECTED]>  wrote:
>>>>>> I am attempting to speed up a mapping process whose input is GZIP
>>>>>> compressed
>>>>>> CSV files. The files range from 1-2GB, I am running on a Cluster
>>>>>> where
>>>>>> each
>>>>>> node has a total of 32GB memory available to use. I have attempted to
>>>>>> tweak
>>>>>> mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
>>>>>> accommodate the size but I keep getting java heap errors or other
>>>>>> memory
>>>>>> related problems. My row count per mapper is well below
>>>>>> Integer.MAX_INTEGER
>>>>>> limit by several orders of magnitude and the box is NOT using
>>>>>> anywhere
>>>>>> close
>>>>>> to its full memory allotment. How can I specify that this map task
>>>>>> can
>>>>>> have
>>>>>> 3-4 GB of memory for the collection, partition and sort process
>>>>>> without
>>>>>> constantly spilling records to disk?
>>>>> --
>>>>> Harsh J
>>> --
>>> Harsh J