|
|
-
Re: Data loss on EMR cluster running Hadoop and Hive
Michael Segel 2012-09-04, 16:43
Next time, try reading and writing to S3 directly from your hive job.
Not sure why the block was bad... What did the AWS folks have to say?
-Mike
On Sep 4, 2012, at 11:30 AM, Max Hansmire <[EMAIL PROTECTED]> wrote:
> I ran into an issue yesterday where one of the blocks on HDFS seems to > have gone away. I would appreciate any help that you can provide. > > I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running > hadoop version 0.20.205 and hive version 0.8.1. > > I have a hive table that is written out in the reduce step of a map > reduce job created by hive. This step completed with no errors, but > the next map-reduce job that tries to read it failed with the > following error message. > > "Caused by: java.io.IOException: No live nodes contain current block" > > I ran hadoop fs -cat on the same file and got the same error. > > Looking more closely at the data and name node logs, I see this error > for the same problem block. It is in the name node when trying to read > the data. > > 2012-09-03 11:56:05,054 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode > (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0): > DatanodeRegistration(10.193.39.159:9200, > storageID=DS-2147477684-10.193.39.159-9200-1346659207926, > infoPort=9102, ipcPort=9201):sendBlock() : Offset 134217727 and > length 1 don't match block blk_-7100869813617535842_5426 ( blockLen > 120152064 ) > 2012-09-03 11:56:05,054 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode > (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0): > DatanodeRegistration(10.193.39.159:9200, > storageID=DS-2147477684-10.193.39.159-9200-1346659207926, > infoPort=9102, ipcPort=9201):Got exception while serving > blk_-7100869813617535842_5426 to /10.96.57.112: > java.io.IOException: Offset 134217727 and length 1 don't match block > blk_-7100869813617535842_5426 ( blockLen 120152064 ) > at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) > at java.lang.Thread.run(Thread.java:662) > > 2012-09-03 11:56:05,054 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode > (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0): > DatanodeRegistration(10.193.39.159:9200, > storageID=DS-2147477684-10.193.39.159-9200-1346659207926, > infoPort=9102, ipcPort=9201):DataXceiver > java.io.IOException: Offset 134217727 and length 1 don't match block > blk_-7100869813617535842_5426 ( blockLen 120152064 ) > at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) > at java.lang.Thread.run(Thread.java:662) > > Unfortunately the EMR cluster that had the data on it has since been > terminated. I have access to the logs, but I can't run an fsck. I can > provide more detailed stack traces etc. if you think it would be > helpful. Rerunning my process by re-generating the corrupted block > resolved the issue. > > Would really appreciate if anyone has a reasonable explanation of what > happened and how to avoid in the future. > > Max >
+
Michael Segel 2012-09-04, 16:43
-
Re: Data loss on EMR cluster running Hadoop and Hive
Max Hansmire 2012-09-04, 17:08
Especially where I am reading from from the file using a Map-Reduce job in the next step I am not sure that it makes sense in terms of performance to put the file on S3. I have not tested, but my suspicion is that the local disk reads on HDFS would outperform reading and writing the file to S3.
This is a bad block on HDFS and not the underlying filesystem. I thought that HDFS was supposed to be tolerant of native file system failures.
Max
On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel <[EMAIL PROTECTED]> wrote: > Next time, try reading and writing to S3 directly from your hive job. > > Not sure why the block was bad... What did the AWS folks have to say? > > -Mike > > On Sep 4, 2012, at 11:30 AM, Max Hansmire <[EMAIL PROTECTED]> wrote: > >> I ran into an issue yesterday where one of the blocks on HDFS seems to >> have gone away. I would appreciate any help that you can provide. >> >> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running >> hadoop version 0.20.205 and hive version 0.8.1. >> >> I have a hive table that is written out in the reduce step of a map >> reduce job created by hive. This step completed with no errors, but >> the next map-reduce job that tries to read it failed with the >> following error message. >> >> "Caused by: java.io.IOException: No live nodes contain current block" >> >> I ran hadoop fs -cat on the same file and got the same error. >> >> Looking more closely at the data and name node logs, I see this error >> for the same problem block. It is in the name node when trying to read >> the data. >> >> 2012-09-03 11:56:05,054 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode >> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0): >> DatanodeRegistration(10.193.39.159:9200, >> storageID=DS-2147477684-10.193.39.159-9200-1346659207926, >> infoPort=9102, ipcPort=9201):sendBlock() : Offset 134217727 and >> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen >> 120152064 ) >> 2012-09-03 11:56:05,054 WARN >> org.apache.hadoop.hdfs.server.datanode.DataNode >> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0): >> DatanodeRegistration(10.193.39.159:9200, >> storageID=DS-2147477684-10.193.39.159-9200-1346659207926, >> infoPort=9102, ipcPort=9201):Got exception while serving >> blk_-7100869813617535842_5426 to /10.96.57.112: >> java.io.IOException: Offset 134217727 and length 1 don't match block >> blk_-7100869813617535842_5426 ( blockLen 120152064 ) >> at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141) >> at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189) >> at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) >> at java.lang.Thread.run(Thread.java:662) >> >> 2012-09-03 11:56:05,054 ERROR >> org.apache.hadoop.hdfs.server.datanode.DataNode >> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0): >> DatanodeRegistration(10.193.39.159:9200, >> storageID=DS-2147477684-10.193.39.159-9200-1346659207926, >> infoPort=9102, ipcPort=9201):DataXceiver >> java.io.IOException: Offset 134217727 and length 1 don't match block >> blk_-7100869813617535842_5426 ( blockLen 120152064 ) >> at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141) >> at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189) >> at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) >> at java.lang.Thread.run(Thread.java:662) >> >> Unfortunately the EMR cluster that had the data on it has since been >> terminated. I have access to the logs, but I can't run an fsck. I can >> provide more detailed stack traces etc. if you think it would be >> helpful. Rerunning my process by re-generating the corrupted block >> resolved the issue. >> >> Would really appreciate if anyone has a reasonable explanation of what >> happened and how to avoid in the future. >> >> Max
+
Max Hansmire 2012-09-04, 17:08
|
|