IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
and ORCFile, all of which are columnar formats for Hadoop that are
relatively new. Do we really need 3 columnar formats?
On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Fellow Hadoopers,
> We'd like to introduce a joint project between Twitter and Cloudera
> engineers -- a new columnar storage format for Hadoop called Parquet (
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
> Parquet is built from the ground up with complex nested data structures in
> mind. We adopted the repetition/definition level approach to encoding such
> data structures, as described in Google's Dremel paper; we have found this
> to be a very efficient method of encoding data in non-trivial object
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies.
> The initial code, available at https://github.com/Parquet, defines the
> format, provides Java building blocks for processing columnar data, and
> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
> of a complex integration -- Input/Output formats that can convert
> Parquet-stored data directly to and from Thrift objects.
> A preview version of Parquet support will be available in Cloudera's Impala
> Twitter is starting to convert some of its major data source to Parquet in
> order to take advantage of the compression and deserialization savings.
> Parquet is currently under heavy development. Parquet's near-term roadmap
> * Hive SerDes (Criteo)
> * Cascading Taps (Criteo)
> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> data (Cloudera and Twitter)
> * Further improvements to Pig support (Twitter)
> Company names in parenthesis indicate whose engineers signed up to do the
> work -- others can feel free to jump in too, of course.
> We've also heard requests to provide an Avro container layer, similar to
> what we do with Thrift. Seeking volunteers!
> We welcome all feedback, patches, and ideas; to foster community
> development, we plan to contribute Parquet to the Apache Incubator when the
> development is farther along.
> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> Jonathan Coveney, and friends.