-Re: Unexpected empty result due to corrupted gz file input to Map?
Ashutosh Chauhan 2010-02-22, 02:40
gz'ed files cannot be split across maps. So, a whole gzip file will be
processed by one mapper. Now, if a gzip file is corrupted, then that map
task will keep failing and eventually hadoop will declare the whole job as
failed. So, even if you have one corrupted gzip file, hadoop (and thus Pig)
wont ignore it whole job will fail and as a result your other gz files wont
be processed either.
In an nutshell if there is a possibility of corrupted gzip files in your
data you need to write and run a script to weed out the corrupted files
before launching a pig script.
Hope it helps,
On Sat, Feb 20, 2010 at 22:47, jiang licht <[EMAIL PROTECTED]> wrote:
> I had a pig script which reads a folder of ".gz" files and perform some
> operation on the data.
> However, here's a problem. The folder contains some corrupted gz files and
> this causes the hadoop job generate empty result in the end, that is, all
> part-### files are zero-byte long. Though, non-empty result should be
> expected (this is tested by running against at least one good .gz file).
> As it turns out a corrupted .gz input to Map cause hadoop throw the
> following exception:
> "java.io.EOFException: Unexpected end of ZLIB input stream" were thrown.
> My guess is that such corrupted files will not be loaded (since the above
> exception will be
> thrown). But data from good .gz files still got loaded. Then why empty
> result is generated
> (0-sized part-####)? So, considering this situation of loading mixed good
> and corrupted ".gz"
> files, how to still get expected results?
> One way might be to write a map/reduce to detect each such corrupted .gz
> file and exclude it from loading into PIG. So, what is the easiest way to
> test integrity of a gz file in java, what package to use?
> But I am more interested in knowing if there a PIG solution since I guess
> it can ignore such files (but seems it is caught in trouble)? Any thoughts?