Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> too many memory spills

Copy link to this message
too many memory spills

I have a file of size 9GB and having approximately 109.5 million records.
I execute a pig script on this file that is doing:
1. Group by on a field of the file
2. Count number of records in every group
3. Store the result in a CSV file using normal PigStorage(",")

The job is completed successfully but the job details show a lot of memory
spills. *Out of 109.5 million records, it shows approximately 48 million
records spilled.*

I am executing it on a* 4 node cluster each with a dual core processor and
4GB ram*.

How can I minimize the amount of record spills. It really makes the
execution really slow when the spilling starts.

Any suggestions are welcome.

Thanking You,

Ouch Whisper