Are these spills happening on map or reduce side? What is the memory
allocated to each TaskTracker?
On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <[EMAIL PROTECTED]>wrote:
> I have a file of size 9GB and having approximately 109.5 million records.
> I execute a pig script on this file that is doing:
> 1. Group by on a field of the file
> 2. Count number of records in every group
> 3. Store the result in a CSV file using normal PigStorage(",")
> The job is completed successfully but the job details show a lot of memory
> spills. *Out of 109.5 million records, it shows approximately 48 million
> records spilled.*
> I am executing it on a* 4 node cluster each with a dual core processor
> and 4GB ram*.
> How can I minimize the amount of record spills. It really makes the
> execution really slow when the spilling starts.
> Any suggestions are welcome.
> Thanking You,
> Ouch Whisper