Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Re: Introducing Parquet: efficient columnar storage for Hadoop.


+
Stan Rosenberg 2013-03-12, 18:01
+
Kevin Olson 2013-03-12, 20:06
+
Jonathan Coveney 2013-03-12, 22:07
Copy link to this message
-
Re: Introducing Parquet: efficient columnar storage for Hadoop.
Cloudera has published a blog post [1] about the Parquet which seems to be answering most of the questions. I would encourage to read that article. It specifically talks about relationship with Trevni:

Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, and learning from, the initial work done toward this goal in Trevni, Parquet includes the following enhancements:

* Efficiently encode nested structures and sparsely populated data based on the Google Dremel definition/repetition levels
* Provide extensible support for per-column encodings (e.g. delta, run length, etc)
* Provide extensibility of storing multiple types of data in column data (e.g. indexes, bloom filters, statistics)
* Offer better write performance by storing metadata at the end of the file

Jarcec

Links:
1: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

On Tue, Mar 12, 2013 at 01:06:04PM -0700, Kevin Olson wrote:
> Second on that. Parquet looks compelling, but I'm curious to understand why
> Cloudera suddenly switched from espousing future support for Trevni to
> teaming with Twitter on Parquet.
>
> On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
> <[EMAIL PROTECTED]>wrote:
>
> > Dmitriy,
> >
> > Please excuse my ignorance.  What is/was wrong with trevni
> > (https://github.com/cutting/trevni) ?
> >
> > Thanks,
> >
> > stan
> >
> > On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]>
> > wrote:
> > > Fellow Hadoopers,
> > >
> > > We'd like to introduce a joint project between Twitter and Cloudera
> > > engineers -- a new columnar storage format for Hadoop called Parquet (
> > > http://parquet.github.com).
> > >
> > > We created Parquet to make the advantages of compressed, efficient
> > columnar
> > > data representation available to any project in the Hadoop ecosystem,
> > > regardless of the choice of data processing framework, data model, or
> > > programming language.
> > >
> > > Parquet is built from the ground up with complex nested data structures
> > in
> > > mind. We adopted the repetition/definition level approach to encoding
> > such
> > > data structures, as described in Google's Dremel paper; we have found
> > this
> > > to be a very efficient method of encoding data in non-trivial object
> > > schemas.
> > >
> > > Parquet is built to support very efficient compression and encoding
> > > schemes. Parquet allows compression schemes to be specified on a
> > per-column
> > > level, and is future-proofed to allow adding more encodings as they are
> > > invented and implemented. We separate the concepts of encoding and
> > > compression, allowing parquet consumers to implement operators that work
> > > directly on encoded data without paying decompression and decoding
> > penalty
> > > when possible.
> > >
> > > Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> > > data processing frameworks, and we are not interested in playing
> > favorites.
> > > We believe that an efficient, well-implemented columnar storage substrate
> > > should be useful to all frameworks without the cost of extensive and
> > > difficult to set up dependencies.
> > >
> > > The initial code, available at https://github.com/Parquet, defines the
> > file
> > > format, provides Java building blocks for processing columnar data, and
> > > implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
> > example
> > > of a complex integration -- Input/Output formats that can convert
> > > Parquet-stored data directly to and from Thrift objects.
> > >
> > > A preview version of Parquet support will be available in Cloudera's
> > Impala
> > > 0.7.
> > >
> > > Twitter is starting to convert some of its major data source to Parquet
> > in
> > > order to take advantage of the compression and deserialization savings.
> > >
> > > Parquet is currently under heavy development. Parquet's near-term roadmap
> > > includes:
> > > * Hive SerDes (Criteo)
> > > * Cascading Taps (Criteo)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB