I have a file of size 9GB and having approximately 109.5 million records.
I execute a pig script on this file that is doing:
1. Group by on a field of the file
2. Count number of records in every group
3. Store the result in a CSV file using normal PigStorage(",")
The job is completed successfully but the job details show a lot of memory
spills. *Out of 109.5 million records, it shows approximately 48 million
I am executing it on a* 4 node cluster each with a dual core processor and
How can I minimize the amount of record spills. It really makes the
execution really slow when the spilling starts.
Any suggestions are welcome.