I'm looking into finding a suitable format to store data in HDFS, so that
it's available for processing by Hive. Ideally I would like to satisfy the
1. store the data in a format that is readable by multiple Hadoop projects
(eg. Pig, Mahout, etc.), not just Hive
2. work with a Hive external table
3. store data in a compressed format that is splittable
(1) is a requirement because Hive isn't appropriate for all the problems
that we want to throw at Hadoop.
(2) is really more of a consequence of (1). Ideally we want the data stored
in some open format that is compressed in HDFS.
This way we can just point Hive, Pig, Mahout, etc at it depending on the
(3) is obviously so it plays well with Hadoop.
Gzip is no good because it is not splittable. Snappy looked promising, but
it is splittable only if used with a non-external Hive table.
LZO also looked promising, but I wonder about whether it is future proof
given the licencing issues surrounding it.
So far, the only solution I could find that satisfies all the above seems
to be bzip2 compression, but concerns about its performance make me wary
about choosing it.
Is bzip2 the only option I have? Or have I missed some other compression
Edward Capriolo 2012-11-05, 16:04
Krishna Rao 2012-11-06, 09:50
Bejoy KS 2012-11-06, 17:22