-Re: TextInputFormat and Gzip encoding - wordcount displaying binary data
Niels Basjes 2011-03-21, 23:01
2011/3/21 Saptarshi Guha <[EMAIL PROTECTED]>:
> It's frustrating to be dealing with these simple problems (and I know
> the fault is mine, i'm missing something).
> I'm running word count (from 0.20-2) on a gzip file (very small), the
> output has binary characters.
> When I run the same on the ungzipped file, the output is correct ascii.
> I'm using the native gzip library. The command is
> hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-0.20.2-CDH3B4.jar
> wordcount /user/sguha/tmp/o.zip /user/sguha/tmp/o.wc.zip
> (zip is gzip)
No, .zip is "pkzip" and .gz is gzip.
The applicable hadoop code actually chooses the decompressor on the
extention of the filename.