Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Different exception handling on corrupt GZip file reading


Copy link to this message
-
Re: Different exception handling on corrupt GZip file reading
If you ever wonder "why doesn't Hadoop do _REASONABLE_THING_X_", the answer
is usually one of:

* Somebody made a mistake the first time it got written
* Nobody needed quite that corner case before
* Maybe people thought that was useful, but didn't know how to fix it, or
were too lazy to contribute the code :)

In any case, it's just code -- there's likely not an ideological reason that
some feature is missing. I'd strongly encourage you to file a ticket on JIRA
and post your code as a patch. Then we can help you clean it up and get it
in there for everyone.

- Aaron
On Fri, Apr 9, 2010 at 6:34 AM, Richard Weber <[EMAIL PROTECTED]> wrote:

> Maybe this is a ³dumb² question.  In our situation, we process a ton of log
> files all gzipped.  Some of those files may be truncated for a a variety of
> reasons resulting in a corrupted gzip file.
>
> Now using the default TextInputFormat and LineRecordReader, Hadoop will
> happily churn along until it hits a corrupted file.  Once it hits the file,
> it throws exceptions, tries to restart on that file and ultimately fails.
>  I
> originally tried using the Skipped Records feature, but these exceptions
> are
> happening at the IO level, not record level.
>
> My solution has been to just make a new SafeTextInputFormat and
> SafeLineRecordReader class.  The only difference between these classes and
> the non-safe classes is that it has a try {} block in the nextKeyValue()
> fn¹
> when it does the readLine.  If an exception occurs, then the file is closed
> out.
>
> My question really boils down to: Is there a reason this isn¹t in the
> Hadoop
> libary to start with?  Even if there was a flag to raise the exception, or
> just let it keep flowing with bad input data.
>
> It¹s really more of a gripe that I need to reimplement the above 2 classes
> just to have a try catch block, and then to make sure I use these classes
> for my input format.
>
> Thanks
>
> --Rick
>