Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Different exception handling on corrupt GZip file reading


Copy link to this message
-
Re: Different exception handling on corrupt GZip file reading
If you ever wonder "why doesn't Hadoop do _REASONABLE_THING_X_", the answer
is usually one of:

* Somebody made a mistake the first time it got written
* Nobody needed quite that corner case before
* Maybe people thought that was useful, but didn't know how to fix it, or
were too lazy to contribute the code :)

In any case, it's just code -- there's likely not an ideological reason that
some feature is missing. I'd strongly encourage you to file a ticket on JIRA
and post your code as a patch. Then we can help you clean it up and get it
in there for everyone.

- Aaron
On Fri, Apr 9, 2010 at 6:34 AM, Richard Weber <[EMAIL PROTECTED]> wrote:

> Maybe this is a ³dumb² question.  In our situation, we process a ton of log
> files all gzipped.  Some of those files may be truncated for a a variety of
> reasons resulting in a corrupted gzip file.
>
> Now using the default TextInputFormat and LineRecordReader, Hadoop will
> happily churn along until it hits a corrupted file.  Once it hits the file,
> it throws exceptions, tries to restart on that file and ultimately fails.
>  I
> originally tried using the Skipped Records feature, but these exceptions
> are
> happening at the IO level, not record level.
>
> My solution has been to just make a new SafeTextInputFormat and
> SafeLineRecordReader class.  The only difference between these classes and
> the non-safe classes is that it has a try {} block in the nextKeyValue()
> fn¹
> when it does the readLine.  If an exception occurs, then the file is closed
> out.
>
> My question really boils down to: Is there a reason this isn¹t in the
> Hadoop
> libary to start with?  Even if there was a flag to raise the exception, or
> just let it keep flowing with bad input data.
>
> It¹s really more of a gripe that I need to reimplement the above 2 classes
> just to have a try catch block, and then to make sure I use these classes
> for my input format.
>
> Thanks
>
> --Rick
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB