Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Mapper Record Spillage


Copy link to this message
-
Re: Mapper Record Spillage
Hans Uhlig 2012-03-11, 08:06
If that is the case then these two lines should make more than enough
memory. On a virtually unused cluster.

job.getConfiguration().setInt("io.sort.mb", 2048);
job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");

Such that a conversion from 1GB of CSV Text to binary primitives should fit
easily. but java still throws a heap error even when there is 25 GB of
memory free.

On Sat, Mar 10, 2012 at 11:50 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hans,
>
> You can change memory requirements for tasks of a single job, but not
> of a single task inside that job.
>
> This is briefly how the 0.20 framework (by default) works: TT has
> notions only of "slots", and carries a maximum _number_ of
> simultaneous slots it may run. It does not know of what each task,
> occupying one slot, would demand in resource-terms. Your job then
> supplies a # of map tasks, and amount of memory required per map task
> in general, as a configuration. TTs then merely start the task JVMs
> with the provided heap configuration.
>
> On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote:
> > That was a typo in my email not in the configuration. Is the memory
> reserved
> > for the tasks when the task tracker starts? You seem to be suggesting
> that I
> > need to set the memory to be the same for all map tasks. Is there no way
> to
> > override for a single map task?
> >
> >
> > On Sat, Mar 10, 2012 at 8:41 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> Hans,
> >>
> >> Its possible you may have an typo issue: mapred.map.child.jvm.opts -
> >> Such a property does not exist. Perhaps you wanted
> >> "mapred.map.child.java.opts"?
> >>
> >> Additionally, the computation you need to do is (# of map slots on a
> >> TT * per-map-task-heap-requirement) should be at least < (Total RAM -
> >> 2/3 GB). With your 4 GB requirement, I guess you can support a max of
> >> 6-7 slots per machine (i.e. Not counting reducer heap requirements in
> >> parallel).
> >>
> >> On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig <[EMAIL PROTECTED]> wrote:
> >> > I am attempting to speed up a mapping process whose input is GZIP
> >> > compressed
> >> > CSV files. The files range from 1-2GB, I am running on a Cluster where
> >> > each
> >> > node has a total of 32GB memory available to use. I have attempted to
> >> > tweak
> >> > mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
> >> > accommodate the size but I keep getting java heap errors or other
> memory
> >> > related problems. My row count per mapper is well below
> >> > Integer.MAX_INTEGER
> >> > limit by several orders of magnitude and the box is NOT using anywhere
> >> > close
> >> > to its full memory allotment. How can I specify that this map task can
> >> > have
> >> > 3-4 GB of memory for the collection, partition and sort process
> without
> >> > constantly spilling records to disk?
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>