|
|
-
TextInputFormat and Gzip encoding - wordcount displaying binary data
Saptarshi Guha 2011-03-21, 22:47
Hello,
It's frustrating to be dealing with these simple problems (and I know the fault is mine, i'm missing something). I'm running word count (from 0.20-2) on a gzip file (very small), the output has binary characters. When I run the same on the ungzipped file, the output is correct ascii.
I'm using the native gzip library. The command is
hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-0.20.2-CDH3B4.jar wordcount /user/sguha/tmp/o.zip /user/sguha/tmp/o.wc.zip
(zip is gzip)
Any ideas?
Thanks SG
-
Re: TextInputFormat and Gzip encoding - wordcount displaying binary data
Niels Basjes 2011-03-21, 23:01
Hi,
2011/3/21 Saptarshi Guha <[EMAIL PROTECTED]>: > It's frustrating to be dealing with these simple problems (and I know > the fault is mine, i'm missing something). > I'm running word count (from 0.20-2) on a gzip file (very small), the > output has binary characters. > When I run the same on the ungzipped file, the output is correct ascii. > > I'm using the native gzip library. The command is > > hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-0.20.2-CDH3B4.jar > wordcount /user/sguha/tmp/o.zip /user/sguha/tmp/o.wc.zip > > (zip is gzip)
No, .zip is "pkzip" and .gz is gzip.
The applicable hadoop code actually chooses the decompressor on the extention of the filename.
-- Niels Basjes
-
Re: TextInputFormat and Gzip encoding - wordcount displaying binary data
Saptarshi Guha 2011-03-21, 23:10
True, my naming is Hmm, now i know. thanks
On Mon, Mar 21, 2011 at 4:01 PM, Niels Basjes <[EMAIL PROTECTED]> wrote: > Hi, > > 2011/3/21 Saptarshi Guha <[EMAIL PROTECTED]>: >> It's frustrating to be dealing with these simple problems (and I know >> the fault is mine, i'm missing something). >> I'm running word count (from 0.20-2) on a gzip file (very small), the >> output has binary characters. >> When I run the same on the ungzipped file, the output is correct ascii. >> >> I'm using the native gzip library. The command is >> >> hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-0.20.2-CDH3B4.jar >> wordcount /user/sguha/tmp/o.zip /user/sguha/tmp/o.wc.zip >> >> (zip is gzip) > > No, .zip is "pkzip" and .gz is gzip. > > The applicable hadoop code actually chooses the decompressor on the > extention of the filename. > > -- > Niels Basjes >
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext