Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - 0.9.1 out of memory problem


+
Mario Lassnig 2012-01-18, 22:07
Copy link to this message
-
RE: 0.9.1 out of memory problem
william.dowling@... 2012-01-18, 22:14
Nested DISTINCT is a killer. See

https://mail-archives.apache.org/mod_mbox/pig-user/201201.mbox/%[EMAIL PROTECTED]%3E

for a discussion of a simple workaround that worked for me.

William F Dowling
Senior Technologist
Thomson Reuters
-----Original Message-----
From: Mario Lassnig [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, January 18, 2012 5:07 PM
To: [EMAIL PROTECTED]
Subject: 0.9.1 out of memory problem

Hello,

I'm having an out of memory problem that seems rather weird to me.
Perhaps you can help me.

Here's what I do:

dump = LOAD '/user/accounting/dump_2012-01-05.lst' AS (
ts:chararray,
duid:chararray,
owner:chararray,
hidden:chararray,
lgroup:chararray,
nbfiles:long,
length:long,
replicas:long,
provenance:chararray,
state:chararray,
campaign:chararray,
rlength:long,
rnbfiles:long,
rowner:chararray,
rgroup:chararray,
rarchived:chararray,
rsuspicious:chararray,
name:chararray,
ami:chararray,
site:chararray
);

wset = FOREACH dump GENERATE site, duid, replicas, nbfiles, rnbfiles,
length, rlength;

bySite = GROUP wset BY site;

report = FOREACH bySite {
   duids = DISTINCT wset.duid;
   GENERATE group, COUNT(duids), SUM(wset.replicas), SUM(wset.nbfiles),
SUM(wset.rnbfiles), SUM(wset.length), SUM(wset.rlength);
};

STORE report INTO 'testfile.out';
So far, nothing special. The dump file has about 5GB with ~500 million
lines.
The whole STORE process takes about 2 minutes until it ends up at the
last reducer,
which dies like this:

2012-01-18 22:45:42,461 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2012-01-18 22:45:42,706 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=SHUFFLE, sessionId2012-01-18 22:45:42,976 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=668126400, MaxSingleShuffleLimit=167031600
2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for merging on-disk files
2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for merging in memory files
2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Need another 89 map output(s) where 0 is already in progress
2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-01-18 22:45:47,986 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 5 outputs (0 slow hosts and0 dup hosts)
.....
2012-01-18 22:45:42,461 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2012-01-18 22:45:42,706 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=SHUFFLE, sessionId2012-01-18 22:45:42,976 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=668126400, MaxSingleShuffleLimit=167031600
2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for merging on-disk files
2012-01-18 22:45:42,982 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for merging in memory files
2012-01-18 22:45:42,983 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Need another 89 map output(s) where 0 is already in progress
2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-01-18 22:45:42,984 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-01-18 22:45:47,986 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 5 outputs (0 slow hosts and0 dup hosts)
2012-01-18 22:45:48,091 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and18 dup hosts)
2012-01-18 22:45:48,294 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and14 dup hosts)
2012-01-18 22:45:48,336 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and15 dup hosts)
2012-01-18 22:45:48,368 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and13 dup hosts)
2012-01-18 22:45:48,592 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and15 dup hosts)
2012-01-18 22:45:48,636 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and12 dup hosts)
2012-01-18 22:45:48,774 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and13 dup hosts)
2012-01-18 22:45:48,796 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and11 dup hosts)
2012-01-18 22:45:48,827 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and8 dup hosts)
2012-01-18 22:45:48,848 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201201170946_0084_r_000000_0 Scheduled 1 outputs (0 slow hosts and9 dup hosts)
2012-0
+
Mario Lassnig 2012-01-18, 22:30
+
Jonathan Coveney 2012-01-20, 07:31