Jacques Nadeau 2013-05-01, 19:08
I think that it is likely that ORC and Parquet will experiment with
alternative encoding techniques for compression/performance purposes.
Also, as you point out, a field level encoding may actually be
sub-composed of multiple types of data structures. While these are
fine at the storage layer, it is hard for the execution layer to
directly operate on these variations. When I say container, I am
trying to call our these points of flexibility and clarify that these
explorations are generally outside the domain of what we're initially
focused on for Drill.
As per your other statement regarding a shift in encoding formats.
ProtoBuf, Thrift and Avro all describe a set of things including apis,
schemas and serialization formats. I agree that the on disk format of
these objects is morphing (and thus serialization is changing). In
fact, that is part of what we're betting on with Drill's more
pipelined vectorized model of execution. Exciting times!
On Tue, Apr 30, 2013 at 9:02 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>> > 2, The in-memory format that supports either ValueVector, RLE or Dict, I
>> > assume RLE or Dict will be leveraging either Orc or Parquet right?
>> Kind of. RLE and Dict are abstraction where a particular operator can take
>> advantage of the nature of that encoding. Parquet and ORC are really
>> container formats as opposed to field level formats.
> Not really. Unless you mean something very specific that I'm missing, they
> are field level formats. ORC relies on the fact that the types are known to
> pick the right encoder for each column. For example, ORC uses RLE for
> integer data. (In fact, because the dictionary encoding includes integer
> data, so do string columns.) In some cases, the ORC writer has a choice of
> encodings, but it is focused on picking the right encoding for a particular
> set of data. For example, if a string column has enough duplicated values
> it will chose a dictionary encoder instead of a direct encoder. But it is
> certainly not the case that ORC is a container format where the choice of
> serialization is an additional choice.
> Unlike RCFile, SequenceFile, TFile, or HFile, it doesn't make sense to
> store ProtoBuf or Writables in an ORC file. One of the amusing
> characteristics of these new file formats is EXACTLY that. In 2 years, I
> would be surprised if anyone is writing new data to files in ProtoBuf,
> Thrift, or Avro. It will be one of these new formats. That is a big change.
> -- Owen