Krishna Rao 2012-11-05, 15:57
Edward Capriolo 2012-11-05, 16:04
Thanks for the reply. Compressed sequence files with compression might work.
However, it's not clear to me if it's possible to read Sequence files using
an external table.
On 5 November 2012 16:04, Edward Capriolo <[EMAIL PROTECTED]> wrote:
> Compression is a confusing issue. Sequence files that are in block
> format are always split table regardless of what compression for the
> block is chosen.The Programming Hive book has an entire section
> dedicated to the permutations of compression options.
> On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao <[EMAIL PROTECTED]>
> > Hi all,
> > I'm looking into finding a suitable format to store data in HDFS, so that
> > it's available for processing by Hive. Ideally I would like to satisfy
> > following:
> > 1. store the data in a format that is readable by multiple Hadoop
> > (eg. Pig, Mahout, etc.), not just Hive
> > 2. work with a Hive external table
> > 3. store data in a compressed format that is splittable
> > (1) is a requirement because Hive isn't appropriate for all the problems
> > that we want to throw at Hadoop.
> > (2) is really more of a consequence of (1). Ideally we want the data
> > in some open format that is compressed in HDFS.
> > This way we can just point Hive, Pig, Mahout, etc at it depending on the
> > problem.
> > (3) is obviously so it plays well with Hadoop.
> > Gzip is no good because it is not splittable. Snappy looked promising,
> > it is splittable only if used with a non-external Hive table.
> > LZO also looked promising, but I wonder about whether it is future proof
> > given the licencing issues surrounding it.
> > So far, the only solution I could find that satisfies all the above
> seems to
> > be bzip2 compression, but concerns about its performance make me wary
> > choosing it.
> > Is bzip2 the only option I have? Or have I missed some other compression
> > option?
> > Cheers,
> > Krishna
Bejoy KS 2012-11-06, 17:22