Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> TextInputFormat and Gzip encoding - wordcount displaying binary data


Copy link to this message
-
Re: TextInputFormat and Gzip encoding - wordcount displaying binary data
Hi,

2011/3/21 Saptarshi Guha <[EMAIL PROTECTED]>:
> It's frustrating to be dealing with these simple problems (and I know
> the fault is mine, i'm missing something).
> I'm running word count (from 0.20-2) on a gzip file (very small), the
> output has binary characters.
> When I run the same on the ungzipped file, the output is correct ascii.
>
> I'm using the native gzip library. The command is
>
>  hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-0.20.2-CDH3B4.jar
> wordcount /user/sguha/tmp/o.zip /user/sguha/tmp/o.wc.zip
>
> (zip is gzip)

No, .zip is "pkzip" and .gz is gzip.

The applicable hadoop code actually chooses the decompressor on the
extention of the filename.

--
Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB