Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Re: too many memory spills


Copy link to this message
-
Re: too many memory spills
Norbert Burger 2013-03-08, 02:47
I thought Todd Lipcon's Hadoop Summit presentation [1] had some good info
on this topic.

[1] http://www.slideshare.net/cloudera/mr-perf

Norbert

On Thu, Mar 7, 2013 at 7:25 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> You can do a few things here
>
>
>    1. Increase mapred.child.java.opts to a higher number (default is
>    200MB). You will have to do this while making sure (# of MR slots/node X
>    mapred.child.java.opts + 387 < 4GB). May be you want to stay under 3.5GB
>    based on other stuff running on those nodes.
>    2. Increase "mapred.job.shuffle.input.buffer.percent" to have more heap
>    be available for the shuffle
>    3.
>    4. Set mapred.inmem.merge.threshold to 0
>    and mapred.job.reduce.input.buffer.percent to 0.8
>
> You will have to play around with these to see what works for your needs.
>
> You can additionally refer to "Hadoop: Definitive Guide" for tips on config
> tuning.
>
> On Thu, Mar 7, 2013 at 1:01 PM, Panshul Whisper <[EMAIL PROTECTED]
> >wrote:
>
> > Hello Prashant,
> >
> > I have a CDH installation and by default memory allocated to each task
> > tracker is 387 MB.
> > And yes these spills are happening on Map and Reduce side.
> >
> > Still not solved this problem...
> >
> > Suggestions are welcome.
> >
> > Thanking You,
> >
> > Regards,
> >
> >
> > On Thu, Mar 7, 2013 at 9:05 AM, Prashant Kommireddi <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Are these spills happening on map or reduce side? What is the memory
> > > allocated to each TaskTracker?
> > >
> > > On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > Hello,
> > > >
> > > > I have a file of size 9GB and having approximately 109.5 million
> > records.
> > > > I execute a pig script on this file that is doing:
> > > > 1. Group by on a field of the file
> > > > 2. Count number of records in every group
> > > > 3. Store the result in a CSV file using normal PigStorage(",")
> > > >
> > > > The job is completed successfully but the job details show a lot of
> > > memory
> > > > spills. *Out of 109.5 million records, it shows approximately 48
> > million
> > > > records spilled.*
> > > >
> > > > I am executing it on a* 4 node cluster each with a dual core
> processor
> > > > and 4GB ram*.
> > > >
> > > > How can I minimize the amount of record spills. It really makes the
> > > > execution really slow when the spilling starts.
> > > >
> > > > Any suggestions are welcome.
> > > >
> > > > Thanking You,
> > > >
> > > > --
> > > > Regards,
> > > > Ouch Whisper
> > > > 010101010101
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
> >
>