Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - external table or gz compressed file

Panshul Whisper 2013-05-02, 13:00
Copy link to this message
Re: external table or gz compressed file
Sanjay Subramanian 2013-05-02, 17:33

====Hive can handle gz files out of the box with NO additional configurations

=====If you want Hive to output to compressed files (say gz) then add the following as part of the hive SQL at the begining
SET hive.exec.compress.output=true;
SET mapred.reduce.tasks=16;    // this will create max 16 gzip files as part of your Hive output query
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SideNote (may or may not be relevant to u ….nevertheless)
You may know that GZIP is not splittable and unless u have a definite reason to use GZIP (like multiple lines in a log file actually constitute one logical Object or Record) , I would recommend LZO…
A little bit of plumbing is required since they discontinued LZO with Hadoop out of the box…..but its pretty straight forward….and remember to use the LZO indexer to create an index for your output so that the LZO files can be split going fwd


From: Panshul Whisper <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Thursday, May 2, 2013 6:00 AM
Subject: external table or gz compressed file


Can somebody please explain me or point me in the right direction for :
how Hive handles gz compressed files, If I create an external table pointing to a .gz compressed file stored on AWS S3.
Does hive copy the file to the HDFS and decompress it before it uses the file?
OR does it use the file directly?
If we use a decompressed file stored on S3... does hive still copy the file to HDFS or read records directly from S3?

Please help me understand the working.

Thanking You,

Ouch Whisper

=====================This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.