Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill, mail # dev - buffer allocation of cast into var length type


Copy link to this message
-
Re: buffer allocation of cast into var length type
Jason Altekruse 2013-12-04, 03:27
Jinfeng,

I did not even think of actually turning integers in ascii, while I know it
is part of SQL it seems like such a crazy thing to do on a short lived
query on large dataset.

I would take a look at the code we are using for the project operator, that
is the last time I remember discussing passing buffers between different
value vectors. There we used it for simply changing the metadata for a
column where all that the project involved was a column name change, not a
mathematical operation.

In regards to the more involved case where you need convert an integer to
its ascii implementation, how would the consumer know how big of a buffer
to allocate? Would there be a pre-processing step where you determine the
number of digits needed to represent the integers/doubles in base 10? For
integers I guess we could zero fill them all to the same length, but that
seems like it wouldn't be worth it for the little time we would save
scanning through the dataset.

Another option is that we could always over-allocate the buffers and then
slice off the excess, but there is no really good way to avoid waste.

Not sure if we want to open this can of worms, but there is another
possible solution that is related to some thoughts I have around making the
parquet reader faster. It is possible that we might have to break our
design of a single column always being represented by a single buffer.

In cases like this where it is hard to know the final buffer length, it
might be easier to allocate a reasonable guess and then just tack on
another buffer if we guessed wrong. I know that one of the main goals of
value vectors is that they are random access, with minimal overhead for
value extraction, but I think this might be a case where it would be worth
breaking it.

The simple implementation might look like the variable length vectors, with
a metadata buffer sitting in front of the data to describe ranges of values
held in each of the buffers. i.e values 1-400 are in buffer 1 : 401-1000
are in buffer 2. (I would assume we we never exceed 5 or so buffers, but it
could provide extra flexibility).

To prevent the need for an extra step of indirection with each value
extraction, we could change the interfaces on value vectors a bit to make
them expose an interator, rather than get(index) method. This would allow
for fetching the first buffer, reading all of its values with the same
overhead as we have now, until we hit the end of the buffer, and then we
could rely on an exception to indicate we ran out of values and at that
time swap to the second buffer.

-Jason
On Tue, Dec 3, 2013 at 8:59 PM, Jinfeng Ni <[EMAIL PROTECTED]> wrote:

> Hi Jason,
>
> Good question.
>
> Actually, for some type cast, it is *binary coercible, *means there is no
> need internally to do any conversion. for instance, char --> varchar,
> varchar --> varbinary, etc.
>
> For other cases, some transformation is required, since the binary
> representation of source type is different from the binary representation
> of target type.
> For instance, int -> varchar.  The target type need keep each digit of the
> integer, while the source type is a 4-byte representation.
>
> I will look into whether it's possible to use the buffer in the output
> value vector directly, without copying into new buffer.
>
>
>
>
>
> On Tue, Dec 3, 2013 at 6:29 PM, Jason Altekruse <[EMAIL PROTECTED]
> >wrote:
>
> > Hi Jinfeng,
> >
> > This might be a dumb question, but is there any transformation being
> > performed when going from a fixed length type to a variable length type?
> > That is, are the bytes in the buffer coming in going to be the same as
> the
> > bytes coming out of the cast?
> >
> > I understand that for casts like int-> long we need to add extra space
> > between each value, but is it possible that we could just hand the buffer
> > from one value vector type to the other without copying it into a new
> > buffer?
> >
> > We would still have to create a new buffer with the offsets of the