Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Check compression codec of an HDFS file


Copy link to this message
-
Re: Check compression codec of an HDFS file
If you're looking for file header/contents based inspection, you could
download the file and run the Linux utility 'file' on the file, and it
should tell you the format.

I don't know about Snappy (AFAIK, we don't have a snappy
frame/container format support in Hadoop yet, although upstream Snappy
issue 34 seems resolved now), but Gzip files can be identified simply
by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to
read its first few hundred bytes, which should have the codec string
in it. Programmatically you can use
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
for sequence files.

On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <[EMAIL PROTECTED]> wrote:
> What's the best way to check the compression codec that an HDFS file was
> written with?
>
> We use both Gzip and Snappy compression so I want a way to determine how a
> specific file is compressed.
>
> The closest I found is the getCodec but that relies on the file name suffix
> ... which don't exist since Reducers typically don't add a suffix to the
> filenames they create.
>
> Thanks

--
Harsh J