Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - TextInputFormat and Gzip encoding - wordcount displaying binary data


Copy link to this message
-
Re: TextInputFormat and Gzip encoding - wordcount displaying binary data
Saptarshi Guha 2011-03-21, 23:10
True, my naming is
Hmm, now i know.
thanks

On Mon, Mar 21, 2011 at 4:01 PM, Niels Basjes <[EMAIL PROTECTED]> wrote:
> Hi,
>
> 2011/3/21 Saptarshi Guha <[EMAIL PROTECTED]>:
>> It's frustrating to be dealing with these simple problems (and I know
>> the fault is mine, i'm missing something).
>> I'm running word count (from 0.20-2) on a gzip file (very small), the
>> output has binary characters.
>> When I run the same on the ungzipped file, the output is correct ascii.
>>
>> I'm using the native gzip library. The command is
>>
>>  hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-0.20.2-CDH3B4.jar
>> wordcount /user/sguha/tmp/o.zip /user/sguha/tmp/o.wc.zip
>>
>> (zip is gzip)
>
> No, .zip is "pkzip" and .gz is gzip.
>
> The applicable hadoop code actually chooses the decompressor on the
> extention of the filename.
>
> --
> Niels Basjes
>