|
|
-
Hive compression with external table
Krishna Rao 2012-11-05, 15:57
Hi all,
I'm looking into finding a suitable format to store data in HDFS, so that it's available for processing by Hive. Ideally I would like to satisfy the following:
1. store the data in a format that is readable by multiple Hadoop projects (eg. Pig, Mahout, etc.), not just Hive 2. work with a Hive external table 3. store data in a compressed format that is splittable
(1) is a requirement because Hive isn't appropriate for all the problems that we want to throw at Hadoop.
(2) is really more of a consequence of (1). Ideally we want the data stored in some open format that is compressed in HDFS. This way we can just point Hive, Pig, Mahout, etc at it depending on the problem.
(3) is obviously so it plays well with Hadoop.
Gzip is no good because it is not splittable. Snappy looked promising, but it is splittable only if used with a non-external Hive table. LZO also looked promising, but I wonder about whether it is future proof given the licencing issues surrounding it.
So far, the only solution I could find that satisfies all the above seems to be bzip2 compression, but concerns about its performance make me wary about choosing it.
Is bzip2 the only option I have? Or have I missed some other compression option?
Cheers,
Krishna
-
Re: Hive compression with external table
Edward Capriolo 2012-11-05, 16:04
Compression is a confusing issue. Sequence files that are in block format are always split table regardless of what compression for the block is chosen.The Programming Hive book has an entire section dedicated to the permutations of compression options.
Edward On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm looking into finding a suitable format to store data in HDFS, so that > it's available for processing by Hive. Ideally I would like to satisfy the > following: > > 1. store the data in a format that is readable by multiple Hadoop projects > (eg. Pig, Mahout, etc.), not just Hive > 2. work with a Hive external table > 3. store data in a compressed format that is splittable > > (1) is a requirement because Hive isn't appropriate for all the problems > that we want to throw at Hadoop. > > (2) is really more of a consequence of (1). Ideally we want the data stored > in some open format that is compressed in HDFS. > This way we can just point Hive, Pig, Mahout, etc at it depending on the > problem. > > (3) is obviously so it plays well with Hadoop. > > Gzip is no good because it is not splittable. Snappy looked promising, but > it is splittable only if used with a non-external Hive table. > LZO also looked promising, but I wonder about whether it is future proof > given the licencing issues surrounding it. > > So far, the only solution I could find that satisfies all the above seems to > be bzip2 compression, but concerns about its performance make me wary about > choosing it. > > Is bzip2 the only option I have? Or have I missed some other compression > option? > > Cheers, > > Krishna
-
Re: Hive compression with external table
Krishna Rao 2012-11-06, 09:50
Thanks for the reply. Compressed sequence files with compression might work. However, it's not clear to me if it's possible to read Sequence files using an external table.
On 5 November 2012 16:04, Edward Capriolo <[EMAIL PROTECTED]> wrote:
> Compression is a confusing issue. Sequence files that are in block > format are always split table regardless of what compression for the > block is chosen.The Programming Hive book has an entire section > dedicated to the permutations of compression options. > > Edward > On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao <[EMAIL PROTECTED]> > wrote: > > Hi all, > > > > I'm looking into finding a suitable format to store data in HDFS, so that > > it's available for processing by Hive. Ideally I would like to satisfy > the > > following: > > > > 1. store the data in a format that is readable by multiple Hadoop > projects > > (eg. Pig, Mahout, etc.), not just Hive > > 2. work with a Hive external table > > 3. store data in a compressed format that is splittable > > > > (1) is a requirement because Hive isn't appropriate for all the problems > > that we want to throw at Hadoop. > > > > (2) is really more of a consequence of (1). Ideally we want the data > stored > > in some open format that is compressed in HDFS. > > This way we can just point Hive, Pig, Mahout, etc at it depending on the > > problem. > > > > (3) is obviously so it plays well with Hadoop. > > > > Gzip is no good because it is not splittable. Snappy looked promising, > but > > it is splittable only if used with a non-external Hive table. > > LZO also looked promising, but I wonder about whether it is future proof > > given the licencing issues surrounding it. > > > > So far, the only solution I could find that satisfies all the above > seems to > > be bzip2 compression, but concerns about its performance make me wary > about > > choosing it. > > > > Is bzip2 the only option I have? Or have I missed some other compression > > option? > > > > Cheers, > > > > Krishna >
-
Re: Hive compression with external table
Bejoy KS 2012-11-06, 17:22
Hi Krishna
Sequence Files + Snappy compressed would be my recommendation as well. It can be processed by managed as well as external tables.
There is no difference in storage formats for managed and external tables. Also this can be consumed by mapred or pig directly. Regards Bejoy KS
Sent from handheld, please excuse typos.
-----Original Message----- From: Krishna Rao <[EMAIL PROTECTED]> Date: Tue, 6 Nov 2012 09:50:33 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Re: Hive compression with external table
Thanks for the reply. Compressed sequence files with compression might work. However, it's not clear to me if it's possible to read Sequence files using an external table.
On 5 November 2012 16:04, Edward Capriolo <[EMAIL PROTECTED]> wrote:
> Compression is a confusing issue. Sequence files that are in block > format are always split table regardless of what compression for the > block is chosen.The Programming Hive book has an entire section > dedicated to the permutations of compression options. > > Edward > On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao <[EMAIL PROTECTED]> > wrote: > > Hi all, > > > > I'm looking into finding a suitable format to store data in HDFS, so that > > it's available for processing by Hive. Ideally I would like to satisfy > the > > following: > > > > 1. store the data in a format that is readable by multiple Hadoop > projects > > (eg. Pig, Mahout, etc.), not just Hive > > 2. work with a Hive external table > > 3. store data in a compressed format that is splittable > > > > (1) is a requirement because Hive isn't appropriate for all the problems > > that we want to throw at Hadoop. > > > > (2) is really more of a consequence of (1). Ideally we want the data > stored > > in some open format that is compressed in HDFS. > > This way we can just point Hive, Pig, Mahout, etc at it depending on the > > problem. > > > > (3) is obviously so it plays well with Hadoop. > > > > Gzip is no good because it is not splittable. Snappy looked promising, > but > > it is splittable only if used with a non-external Hive table. > > LZO also looked promising, but I wonder about whether it is future proof > > given the licencing issues surrounding it. > > > > So far, the only solution I could find that satisfies all the above > seems to > > be bzip2 compression, but concerns about its performance make me wary > about > > choosing it. > > > > Is bzip2 the only option I have? Or have I missed some other compression > > option? > > > > Cheers, > > > > Krishna >
|
|