|
|
-
running bigger pig jobs on amazon ec2
jr 2010-12-08, 14:09
Hi guys, I'm having some trouble finished jobs that run smoothly on a smaller dataset, but always fail at 99% if i try to run the job on the whole set. i can see a few killed map and a few killed reduce, but quite a lot of failed reduce tasks that all show the same exception at the end. here is what i have in the logs:
2010-12-08 08:44:56,127 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring obsolete output of KILLED map-task: 'attempt_201012080810_0003_m_000009_1' 2010-12-08 08:45:08,152 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012080810_0003_r_000000_0: Got 1 new map-outputs 2010-12-08 08:45:13,103 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012080810_0003_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2010-12-08 08:45:13,241 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201012080810_0003_m_000003_0, compressed len: 3488519, decompressed len: 3488515 2010-12-08 08:45:13,241 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 3488515 bytes (3488519 raw bytes) into RAM from attempt_201012080810_0003_m_000003_0 2010-12-08 08:45:13,348 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Collection threshold exceeded) init = 5439488(5312K) used = 78403496(76565K) committed = 101908480(99520K) max = 139853824(136576K) 2010-12-08 08:45:13,404 INFO org.apache.hadoop.mapred.ReduceTask: Read 3488515 bytes from map-output for attempt_201012080810_0003_m_000003_0 2010-12-08 08:45:13,405 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201012080810_0003_m_000003_0 -> (142, 21) from ip-10-98-71-195.ec2.internal 2010-12-08 08:45:14,241 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting 2010-12-08 08:45:14,241 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined. 2010-12-08 08:45:14,242 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager 2010-12-08 08:45:14,253 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 2 files left. 2010-12-08 08:45:14,254 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 64 files left. 2010-12-08 08:45:14,312 INFO org.apache.hadoop.mapred.Merger: Merging 64 sorted segments 2010-12-08 08:45:14,313 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 64 segments left of total size: 82947024 bytes 2010-12-08 08:45:15,389 INFO org.apache.hadoop.mapred.ReduceTask: Merged 64 segments, 82947024 bytes to disk to satisfy reduce memory limit 2010-12-08 08:45:15,390 INFO org.apache.hadoop.mapred.ReduceTask: Merging 3 files, 214514578 bytes from disk 2010-12-08 08:45:15,392 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce 2010-12-08 08:45:15,392 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted segments 2010-12-08 08:45:15,397 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 214514566 bytes 2010-12-08 08:45:15,489 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library 2010-12-08 08:45:15,522 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 3e7c9dcf0ea0acbde146cb22b236978b344c5525] 2010-12-08 08:45:15,530 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. 2010-12-08 08:45:15,534 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. 2010-12-08 08:45:15,544 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. 2010-12-08 08:45:15,562 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. 2010-12-08 08:45:15,564 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. 2010-12-08 08:45:15,568 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. 2010-12-08 08:45:37,233 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.98.99.197:50010 2010-12-08 08:45:37,235 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_8615551403563164366_3938 2010-12-08 08:45:43,251 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.98.99.197:50010 2010-12-08 08:45:43,251 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_4074920756844442310_4023 2010-12-08 08:45:49,282 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.100.226.63:50010 2010-12-08 08:45:49,282 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-681320892856427804_4034 2010-12-08 08:45:55,292 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.99.26.80:50010 2010-12-08 08:45:55,292 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6999793088579291779_4039 2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2812) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)
2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-6999793088579291779_4039 bad datanode[1] nodes == null 2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/temp664356070/tmp-973959386/_temporary/_attempt_201012080810_0003_r_000000_0/winrar/output/extlink/2010-04/2010-04-00000" - Aborting... 2010-12-08 08:46:01,656 WARN org.apache.hadoop.mapred.TaskTracker: Error running child org.apache.pig.backend.executionengine.ExecException: ERROR 2135: Received error from s
-
Re: running bigger pig jobs on amazon ec2
Ashutosh Chauhan 2010-12-08, 17:11
>From the logs it looks like issue is not with Pig but with your hdfs. Either your hdfs is running out of space or some (or all) nodes in your cluster can't talk to each other (network issue ?)
Ashutosh On Wed, Dec 8, 2010 at 06:09, jr <[EMAIL PROTECTED]> wrote: > Hi guys, > I'm having some trouble finished jobs that run smoothly on a smaller > dataset, but always fail at 99% if i try to run the job on the whole > set. > i can see a few killed map and a few killed reduce, but quite a lot of > failed reduce tasks that all show the same exception at the end. > here is what i have in the logs: > > 2010-12-08 08:44:56,127 INFO org.apache.hadoop.mapred.ReduceTask: > Ignoring obsolete output of KILLED map-task: > 'attempt_201012080810_0003_m_000009_1' > 2010-12-08 08:45:08,152 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012080810_0003_r_000000_0: Got 1 new map-outputs > 2010-12-08 08:45:13,103 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012080810_0003_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) > 2010-12-08 08:45:13,241 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201012080810_0003_m_000003_0, compressed len: 3488519, decompressed len: 3488515 > 2010-12-08 08:45:13,241 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 3488515 bytes (3488519 raw bytes) into RAM from attempt_201012080810_0003_m_000003_0 > 2010-12-08 08:45:13,348 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Collection threshold exceeded) init = 5439488(5312K) used = 78403496(76565K) committed = 101908480(99520K) max = 139853824(136576K) > 2010-12-08 08:45:13,404 INFO org.apache.hadoop.mapred.ReduceTask: Read 3488515 bytes from map-output for attempt_201012080810_0003_m_000003_0 > 2010-12-08 08:45:13,405 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201012080810_0003_m_000003_0 -> (142, 21) from ip-10-98-71-195.ec2.internal > 2010-12-08 08:45:14,241 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting > 2010-12-08 08:45:14,241 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined. > 2010-12-08 08:45:14,242 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager > 2010-12-08 08:45:14,253 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 2 files left. > 2010-12-08 08:45:14,254 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 64 files left. > 2010-12-08 08:45:14,312 INFO org.apache.hadoop.mapred.Merger: Merging 64 sorted segments > 2010-12-08 08:45:14,313 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 64 segments left of total size: 82947024 bytes > 2010-12-08 08:45:15,389 INFO org.apache.hadoop.mapred.ReduceTask: Merged 64 segments, 82947024 bytes to disk to satisfy reduce memory limit > 2010-12-08 08:45:15,390 INFO org.apache.hadoop.mapred.ReduceTask: Merging 3 files, 214514578 bytes from disk > 2010-12-08 08:45:15,392 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce > 2010-12-08 08:45:15,392 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted segments > 2010-12-08 08:45:15,397 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 214514566 bytes > 2010-12-08 08:45:15,489 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library > 2010-12-08 08:45:15,522 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 3e7c9dcf0ea0acbde146cb22b236978b344c5525] > 2010-12-08 08:45:15,530 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. > 2010-12-08 08:45:15,534 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. > 2010-12-08 08:45:15,544 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created. > 2010-12-08 08:45:15,562 INFO com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader created.
-
Re: running bigger pig jobs on amazon ec2
jr 2010-12-10, 10:53
Hello Ashutosh,
I'm running entirely on amazon ec2, and while i get those errors, i seem to be able to access hdfs by using "hadoop fs" :/
regards, Johannes
Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan: > From the logs it looks like issue is not with Pig but with your hdfs. > Either your hdfs is running out of space or some (or all) nodes in > your cluster can't talk to each other (network issue ?) > > Ashutosh > On Wed, Dec 8, 2010 at 06:09, jr <[EMAIL PROTECTED]> wrote: > > Hi guys, > > I'm having some trouble finished jobs that run smoothly on a smaller > > dataset, but always fail at 99% if i try to run the job on the whole > > set. > > i can see a few killed map and a few killed reduce, but quite a lot of > > failed reduce tasks that all show the same exception at the end. > > here is what i have in the logs: > >
-
Re: running bigger pig jobs on amazon ec2
Dmitriy Ryaboy 2010-12-12, 11:18
Johannes, I wonder if something is putting enough pressure on the datanodes that they are unable to ack all the write requests fast enough, causing many tasks to give up due to what amounts to tcp throughput collapse.
The logs certainly seem to indicate something unhealthy happening at the DFS level. Bunch of questions below... I am stabbing in the dark here, as I don't run clusters in EC2.
Do you have any stats on the network traffic in your cluster while this is happening?
Same, but for disk/cpu utilization and similar metrics on the data nodes?
I am curious why there's a loader being instantiated in the reducer. Can you send along a relevant portion of the explain plan?
How many map tasks and reduce tasks are you running?
How big is the cluster?
Is the storefunc you are using doing something like writing multiple files?
When running a cluster in EC2, what are you using for storage? S3, EBS...?
D
On Fri, Dec 10, 2010 at 2:53 AM, jr <[EMAIL PROTECTED]>wrote:
> Hello Ashutosh, > > I'm running entirely on amazon ec2, and while i get those errors, i seem > to be able to access hdfs by using "hadoop fs" :/ > > regards, > Johannes > > Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan: > > From the logs it looks like issue is not with Pig but with your hdfs. > > Either your hdfs is running out of space or some (or all) nodes in > > your cluster can't talk to each other (network issue ?) > > > > Ashutosh > > On Wed, Dec 8, 2010 at 06:09, jr <[EMAIL PROTECTED]> > wrote: > > > Hi guys, > > > I'm having some trouble finished jobs that run smoothly on a smaller > > > dataset, but always fail at 99% if i try to run the job on the whole > > > set. > > > i can see a few killed map and a few killed reduce, but quite a lot of > > > failed reduce tasks that all show the same exception at the end. > > > here is what i have in the logs: > > > > >
-
Re: running bigger pig jobs on amazon ec2
Johannes Rußek 2010-12-14, 14:47
Hello Dmitriy,
thanks for the helpful questions. I'll gather all the relevant information when i'm going to kick off another run. What i can answer already:
the nodes are running on 4 cpus with a load of > 19 with about ~40-50 iowait% it's 20 nodes with one being the namenode. the storage is just a temporary HDFS being created on the "local" disks when the cluster is started each month. Yes, in fact I'm using a storefunc that writes multiple files (one for each "primary" key i have in the output).
i will send you the rest of the answers as soon as i gathered the needed information. Thanks! Johannes
Am 12.12.2010 12:18, schrieb Dmitriy Ryaboy: > Johannes, > I wonder if something is putting enough pressure on the datanodes that they > are unable to ack all the write requests fast enough, causing many tasks to > give up due to what amounts to tcp throughput collapse. > > The logs certainly seem to indicate something unhealthy happening at the DFS > level. Bunch of questions below... I am stabbing in the dark here, as I > don't run clusters in EC2. > > Do you have any stats on the network traffic in your cluster while this is > happening? > > Same, but for disk/cpu utilization and similar metrics on the data nodes? > > I am curious why there's a loader being instantiated in the reducer. Can you > send along a relevant portion of the explain plan? > > How many map tasks and reduce tasks are you running? > > How big is the cluster? > > Is the storefunc you are using doing something like writing multiple files? > > When running a cluster in EC2, what are you using for storage? S3, EBS...? > > D > > On Fri, Dec 10, 2010 at 2:53 AM, jr<[EMAIL PROTECTED]>wrote: > >> Hello Ashutosh, >> >> I'm running entirely on amazon ec2, and while i get those errors, i seem >> to be able to access hdfs by using "hadoop fs" :/ >> >> regards, >> Johannes >> >> Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan: >>> From the logs it looks like issue is not with Pig but with your hdfs. >>> Either your hdfs is running out of space or some (or all) nodes in >>> your cluster can't talk to each other (network issue ?) >>> >>> Ashutosh >>> On Wed, Dec 8, 2010 at 06:09, jr<[EMAIL PROTECTED]> >> wrote: >>>> Hi guys, >>>> I'm having some trouble finished jobs that run smoothly on a smaller >>>> dataset, but always fail at 99% if i try to run the job on the whole >>>> set. >>>> i can see a few killed map and a few killed reduce, but quite a lot of >>>> failed reduce tasks that all show the same exception at the end. >>>> here is what i have in the logs: >>>> >>
-
Re: running bigger pig jobs on amazon ec2
Dmitriy Ryaboy 2010-12-15, 02:05
Johannes, I strongly suspect it's the number of files you are trying to write at the same time. lsof output might help determine this to a greater degree of certainty, but seems extremely likely (likely enough I guessed it...). What's the cardinality of the primary key? Can you avoid writing such a large number of files?
D
On Tue, Dec 14, 2010 at 6:47 AM, Johannes Rußek <[EMAIL PROTECTED]> wrote: > Hello Dmitriy, > > thanks for the helpful questions. I'll gather all the relevant information > when i'm going to kick off another run. > What i can answer already: > > the nodes are running on 4 cpus with a load of > 19 with about ~40-50 > iowait% > it's 20 nodes with one being the namenode. > the storage is just a temporary HDFS being created on the "local" disks when > the cluster is started each month. > Yes, in fact I'm using a storefunc that writes multiple files (one for each > "primary" key i have in the output). > > i will send you the rest of the answers as soon as i gathered the needed > information. > Thanks! > Johannes > > Am 12.12.2010 12:18, schrieb Dmitriy Ryaboy: >> >> Johannes, >> I wonder if something is putting enough pressure on the datanodes that >> they >> are unable to ack all the write requests fast enough, causing many tasks >> to >> give up due to what amounts to tcp throughput collapse. >> >> The logs certainly seem to indicate something unhealthy happening at the >> DFS >> level. Bunch of questions below... I am stabbing in the dark here, as I >> don't run clusters in EC2. >> >> Do you have any stats on the network traffic in your cluster while this is >> happening? >> >> Same, but for disk/cpu utilization and similar metrics on the data nodes? >> >> I am curious why there's a loader being instantiated in the reducer. Can >> you >> send along a relevant portion of the explain plan? >> >> How many map tasks and reduce tasks are you running? >> >> How big is the cluster? >> >> Is the storefunc you are using doing something like writing multiple >> files? >> >> When running a cluster in EC2, what are you using for storage? S3, EBS...? >> >> D >> >> On Fri, Dec 10, 2010 at 2:53 AM, >> jr<[EMAIL PROTECTED]>wrote: >> >>> Hello Ashutosh, >>> >>> I'm running entirely on amazon ec2, and while i get those errors, i seem >>> to be able to access hdfs by using "hadoop fs" :/ >>> >>> regards, >>> Johannes >>> >>> Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan: >>>> >>>> From the logs it looks like issue is not with Pig but with your hdfs. >>>> Either your hdfs is running out of space or some (or all) nodes in >>>> your cluster can't talk to each other (network issue ?) >>>> >>>> Ashutosh >>>> On Wed, Dec 8, 2010 at 06:09, jr<[EMAIL PROTECTED]> >>> >>> wrote: >>>>> >>>>> Hi guys, >>>>> I'm having some trouble finished jobs that run smoothly on a smaller >>>>> dataset, but always fail at 99% if i try to run the job on the whole >>>>> set. >>>>> i can see a few killed map and a few killed reduce, but quite a lot of >>>>> failed reduce tasks that all show the same exception at the end. >>>>> here is what i have in the logs: >>>>> >>> > >
|
|