Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Re: Introducing Parquet: efficient columnar storage for Hadoop.


Copy link to this message
-
Re: Introducing Parquet: efficient columnar storage for Hadoop.
Super excited that this is finally public. The benefits are huge, and
having an (eventually) battle tested columnar storage format developed for
a diverse set of needs will be awesome.
2013/3/12 Kevin Olson <[EMAIL PROTECTED]>

> Second on that. Parquet looks compelling, but I'm curious to understand why
> Cloudera suddenly switched from espousing future support for Trevni to
> teaming with Twitter on Parquet.
>
> On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
> <[EMAIL PROTECTED]>wrote:
>
> > Dmitriy,
> >
> > Please excuse my ignorance.  What is/was wrong with trevni
> > (https://github.com/cutting/trevni) ?
> >
> > Thanks,
> >
> > stan
> >
> > On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
> > wrote:
> > > Fellow Hadoopers,
> > >
> > > We'd like to introduce a joint project between Twitter and Cloudera
> > > engineers -- a new columnar storage format for Hadoop called Parquet (
> > > http://parquet.github.com).
> > >
> > > We created Parquet to make the advantages of compressed, efficient
> > columnar
> > > data representation available to any project in the Hadoop ecosystem,
> > > regardless of the choice of data processing framework, data model, or
> > > programming language.
> > >
> > > Parquet is built from the ground up with complex nested data structures
> > in
> > > mind. We adopted the repetition/definition level approach to encoding
> > such
> > > data structures, as described in Google's Dremel paper; we have found
> > this
> > > to be a very efficient method of encoding data in non-trivial object
> > > schemas.
> > >
> > > Parquet is built to support very efficient compression and encoding
> > > schemes. Parquet allows compression schemes to be specified on a
> > per-column
> > > level, and is future-proofed to allow adding more encodings as they are
> > > invented and implemented. We separate the concepts of encoding and
> > > compression, allowing parquet consumers to implement operators that
> work
> > > directly on encoded data without paying decompression and decoding
> > penalty
> > > when possible.
> > >
> > > Parquet is built to be used by anyone. The Hadoop ecosystem is rich
> with
> > > data processing frameworks, and we are not interested in playing
> > favorites.
> > > We believe that an efficient, well-implemented columnar storage
> substrate
> > > should be useful to all frameworks without the cost of extensive and
> > > difficult to set up dependencies.
> > >
> > > The initial code, available at https://github.com/Parquet, defines the
> > file
> > > format, provides Java building blocks for processing columnar data, and
> > > implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
> > example
> > > of a complex integration -- Input/Output formats that can convert
> > > Parquet-stored data directly to and from Thrift objects.
> > >
> > > A preview version of Parquet support will be available in Cloudera's
> > Impala
> > > 0.7.
> > >
> > > Twitter is starting to convert some of its major data source to Parquet
> > in
> > > order to take advantage of the compression and deserialization savings.
> > >
> > > Parquet is currently under heavy development. Parquet's near-term
> roadmap
> > > includes:
> > > * Hive SerDes (Criteo)
> > > * Cascading Taps (Criteo)
> > > * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> > > data (Cloudera and Twitter)
> > > * Further improvements to Pig support (Twitter)
> > >
> > > Company names in parenthesis indicate whose engineers signed up to do
> the
> > > work -- others can feel free to jump in too, of course.
> > >
> > > We've also heard requests to provide an Avro container layer, similar
> to
> > > what we do with Thrift. Seeking volunteers!
> > >
> > > We welcome all feedback, patches, and ideas; to foster community
> > > development, we plan to contribute Parquet to the Apache Incubator when
> > the
> > > development is farther along.
> > >
> > > Regards,
> > > Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,