|
|
-
Re: Different exception handling on corrupt GZip file readingAaron Kimball 2010-04-15, 16:28
If you ever wonder "why doesn't Hadoop do _REASONABLE_THING_X_", the answer
is usually one of: * Somebody made a mistake the first time it got written * Nobody needed quite that corner case before * Maybe people thought that was useful, but didn't know how to fix it, or were too lazy to contribute the code :) In any case, it's just code -- there's likely not an ideological reason that some feature is missing. I'd strongly encourage you to file a ticket on JIRA and post your code as a patch. Then we can help you clean it up and get it in there for everyone. - Aaron On Fri, Apr 9, 2010 at 6:34 AM, Richard Weber <[EMAIL PROTECTED]> wrote: > Maybe this is a ³dumb² question. In our situation, we process a ton of log > files all gzipped. Some of those files may be truncated for a a variety of > reasons resulting in a corrupted gzip file. > > Now using the default TextInputFormat and LineRecordReader, Hadoop will > happily churn along until it hits a corrupted file. Once it hits the file, > it throws exceptions, tries to restart on that file and ultimately fails. > I > originally tried using the Skipped Records feature, but these exceptions > are > happening at the IO level, not record level. > > My solution has been to just make a new SafeTextInputFormat and > SafeLineRecordReader class. The only difference between these classes and > the non-safe classes is that it has a try {} block in the nextKeyValue() > fn¹ > when it does the readLine. If an exception occurs, then the file is closed > out. > > My question really boils down to: Is there a reason this isn¹t in the > Hadoop > libary to start with? Even if there was a flag to raise the exception, or > just let it keep flowing with bad input data. > > It¹s really more of a gripe that I need to reimplement the above 2 classes > just to have a try catch block, and then to make sure I use these classes > for my input format. > > Thanks > > --Rick > |