Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Unexpected empty result due to corrupted gz file input to Map?


Copy link to this message
-
Re: Unexpected empty result due to corrupted gz file input to Map?
Hi Michael,

gz'ed files cannot be split across maps. So, a whole gzip file will be
processed by one mapper. Now, if a gzip file is corrupted, then that map
task will keep failing and eventually hadoop will declare the whole job as
failed. So, even if you have one corrupted gzip file, hadoop (and thus Pig)
wont ignore it whole job will fail and as  a result your other gz files wont
be processed either.

In an nutshell if there is a possibility of corrupted gzip files in your
data you need to write and run a script to weed out the corrupted files
before launching a pig script.

Hope it helps,
Ashutosh

On Sat, Feb 20, 2010 at 22:47, jiang licht <[EMAIL PROTECTED]> wrote:

> I had a pig script which reads a folder of ".gz" files and perform some
> operation on the data.
>
> However, here's a problem. The folder contains some corrupted gz files and
> this causes the hadoop job generate empty result in the end, that is, all
> part-### files are zero-byte long. Though, non-empty result should be
> expected (this is tested by running against at least one good .gz file).
>
> As it turns out a corrupted .gz input to Map cause hadoop throw the
> following exception:
>
> "java.io.EOFException: Unexpected end of ZLIB input stream" were thrown.
>
> My guess is that such corrupted files will not be loaded (since the above
> exception will be
> thrown). But data from good .gz files still got loaded. Then why empty
> result is generated
> (0-sized part-####)? So, considering this situation of loading mixed good
> and corrupted ".gz"
> files, how to still get expected results?
>
> One way might be to write a map/reduce to detect each such corrupted .gz
> file and exclude it from loading into PIG. So, what is the easiest way to
> test integrity of a gz file in java, what package to use?
> But I am more interested in knowing if there a PIG solution since I guess
> it can ignore such files (but seems it is caught in trouble)? Any thoughts?
>
> Thanks!
>
> Michael
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB