I have a CDH installation and by default memory allocated to each task
tracker is 387 MB.
And yes these spills are happening on Map and Reduce side.
Still not solved this problem...
Suggestions are welcome.
On Thu, Mar 7, 2013 at 9:05 AM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
> Are these spills happening on map or reduce side? What is the memory
> allocated to each TaskTracker?
> On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <[EMAIL PROTECTED]
> > Hello,
> > I have a file of size 9GB and having approximately 109.5 million records.
> > I execute a pig script on this file that is doing:
> > 1. Group by on a field of the file
> > 2. Count number of records in every group
> > 3. Store the result in a CSV file using normal PigStorage(",")
> > The job is completed successfully but the job details show a lot of
> > spills. *Out of 109.5 million records, it shows approximately 48 million
> > records spilled.*
> > I am executing it on a* 4 node cluster each with a dual core processor
> > and 4GB ram*.
> > How can I minimize the amount of record spills. It really makes the
> > execution really slow when the spilling starts.
> > Any suggestions are welcome.
> > Thanking You,
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101