Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> 0.9.1 out of memory problem


Copy link to this message
-
RE: 0.9.1 out of memory problem
Nested DISTINCT is a killer. See

https://mail-archives.apache.org/mod_mbox/pig-user/201201.mbox/%[EMAIL PROTECTED]%3E

for a discussion of a simple workaround that worked for me.

William F Dowling
Senior Technologist
Thomson Reuters
-----Original Message-----
From: Mario Lassnig [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, January 18, 2012 5:07 PM
To: [EMAIL PROTECTED]
Subject: 0.9.1 out of memory problem

Hello,

I'm having an out of memory problem that seems rather weird to me.
Perhaps you can help me.

Here's what I do:

dump = LOAD '/user/accounting/dump_2012-01-05.lst' AS (
ts:chararray,
duid:chararray,
owner:chararray,
hidden:chararray,
lgroup:chararray,
nbfiles:long,
length:long,
replicas:long,
provenance:chararray,
state:chararray,
campaign:chararray,
rlength:long,
rnbfiles:long,
rowner:chararray,
rgroup:chararray,
rarchived:chararray,
rsuspicious:chararray,
name:chararray,
ami:chararray,
site:chararray
);

wset = FOREACH dump GENERATE site, duid, replicas, nbfiles, rnbfiles,
length, rlength;

bySite = GROUP wset BY site;

report = FOREACH bySite {
   duids = DISTINCT wset.duid;
   GENERATE group, COUNT(duids), SUM(wset.replicas), SUM(wset.nbfiles),
SUM(wset.rnbfiles), SUM(wset.length), SUM(wset.rlength);
};

STORE report INTO 'testfile.out';
So far, nothing special. The dump file has about 5GB with ~500 million
lines.
The whole STORE process takes about 2 minutes until it ends up at the
last reducer,
which dies like this:

2012-01-18 22:45:42,461 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2012-01-18 22:45:42,706 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=SHUFFLE, sessionId2012-01-18 22:45:42,976 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=668126400, MaxSingleShuffleLimit=167031600
2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for merging on-disk files
2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for merging in memory files
2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Need another 89 map output(s) where 0 is already in progress
2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-01-18 22:45:47,986 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 5 outputs (0 slow hosts and0 dup hosts)
.....
2012-01-18 22:45:42,461 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2012-01-18 22:45:42,706 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=SHUFFLE, sessionId2012-01-18 22:45:42,976 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=668126400, MaxSingleShuffleLimit=167031600
2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for merging on-disk files
2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for merging in memory files
2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Need another 89 map output(s) where 0 is already in progress
2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-01-18 22:45:47,986 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 5 outputs (0 slow hosts and0 dup hosts)
2012-01-18 22:45:48,091 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and18 dup hosts)
2012-01-18 22:45:48,294 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and14 dup hosts)
2012-01-18 22:45:48,336 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and15 dup hosts)
2012-01-18 22:45:48,368 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and13 dup hosts)
2012-01-18 22:45:48,592 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and15 dup hosts)
2012-01-18 22:45:48,636 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and12 dup hosts)
2012-01-18 22:45:48,774 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and13 dup hosts)
2012-01-18 22:45:48,796 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and11 dup hosts)
2012-01-18 22:45:48,827 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and8 dup hosts)
2012-01-18 22:45:48,848 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and9 dup hosts)
2012-0
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB