|
|
Zheng Shao 2010-01-23, 02:20
I noticed that avro has the "skip" functions which can help skip a field when deserializing data. This is good for column pruning in most cases, but we might be able to do better in the following case. Let's say we have a query like this:
CREATE TABLE t (col1 STRING, col2 STRING, col3 STRING); SELECT col2 FROM t WHERE col3 = 'abcde';
We want to get field col3 first, if that matches what we want, then we want to get to field col2. Is there anyway to "remember" the current location of deserialization, so that we can "resume" from that point? -- Yours, Zheng
-
Re: lazy deserialization?
Philip Zeyliger 2010-01-23, 02:38
Not with any of today's APIs. "SELECT col1, col3 FROM t" is handled easily: you construct a schema that only has those columns, and col2 is skipped at read time.
Does Hive have a use case for this that you're interested in? If you don't mind paying the buffer copy, you could probably write a "DeferredFoo" class that doesn't de-serialize certain structures...
-- Philip
On Fri, Jan 22, 2010 at 6:20 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: > I noticed that avro has the "skip" functions which can help skip a > field when deserializing data. > This is good for column pruning in most cases, but we might be able to > do better in the following case. > > > Let's say we have a query like this: > > CREATE TABLE t (col1 STRING, col2 STRING, col3 STRING); > SELECT col2 FROM t WHERE col3 = 'abcde'; > > We want to get field col3 first, if that matches what we want, then we > want to get to field col2. > > > Is there anyway to "remember" the current location of deserialization, > so that we can "resume" from that point? > > > -- > Yours, > Zheng >
-
Re: lazy deserialization?
Scott Carey 2010-01-23, 02:45
The binary decoder needs some work to improve performance that requires some extra buffering. (AVRO-327). Once that is done, adding on some deferred lazy load capabilities wouldn't be that intrusive, and I am willing to build it into the Java BinaryDecoder if it is needed.
-Scott
On Jan 22, 2010, at 6:38 PM, Philip Zeyliger wrote:
> Not with any of today's APIs. "SELECT col1, col3 FROM t" is handled > easily: you construct a schema that only has those columns, and col2 > is skipped at read time. > > Does Hive have a use case for this that you're interested in? If you > don't mind paying the buffer copy, you could probably write a > "DeferredFoo" class that doesn't de-serialize certain structures... > > -- Philip > > On Fri, Jan 22, 2010 at 6:20 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: >> I noticed that avro has the "skip" functions which can help skip a >> field when deserializing data. >> This is good for column pruning in most cases, but we might be able to >> do better in the following case. >> >> >> Let's say we have a query like this: >> >> CREATE TABLE t (col1 STRING, col2 STRING, col3 STRING); >> SELECT col2 FROM t WHERE col3 = 'abcde'; >> >> We want to get field col3 first, if that matches what we want, then we >> want to get to field col2. >> >> >> Is there anyway to "remember" the current location of deserialization, >> so that we can "resume" from that point? >> >> >> -- >> Yours, >> Zheng >>
|
|