Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Semantic cleanup: How to adding two bytearray

Copy link to this message
Re: Semantic cleanup: How to adding two bytearray
What would happen in case the loader is PigStorage ? The bytearray type
would actually be a DataByteArray . Will it be cast to double in that case ?


On 1/13/11 8:58 PM, "Daniel Dai" <[EMAIL PROTECTED]> wrote:

> One goal of semantic cleanup work undergoing is to clarify the usage of
> unknown type.
> In Pig schema system, user can define output schema for
> LoadFunc/EvalFunc. Pig will propagate those schema to the entire script.
> Defining schema for LoadFunc/EvalFunc is optional. If user don't define
> schema, Pig will mark them bytearray. However, in the run time, user can
> feed any data type in. Before, Pig assumes the runtime type for
> bytearray is DataByteArray, which arose several issues (PIG-1277,
> PIG-999, PIG-1016).
> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
> object to figure out what the real type is at runtime. We've done that
> for all shuffle keys (PIG-1277). However, there are other cases. One
> case is adding two bytearray. For example,
> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader
> does not define schema, but actually feed Integer
> b = foreach a generate a0+a1;
> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and
> mark the output schema for a0+a1 as double. Here is something
> interesting, SomeLoader loads Integer, and we get Double after adding.
> We can change it if we do the following:
> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
> 2. Change POAdd(Similarly, all other ExpressionOperators, multply,
> divide, etc) to handle bytearray. When the schema for POAdd is
> bytearray, Pig will figure out the data type at runtime, and process
> adding according to the real type
> Pro:
> 1. Consistent with the goal for unknown type cleanup: treat all
> bytearray as unknown type. In the runtime, inspect the object to find
> the real type
> Cons:
> 1. Slow down the processing since we need to inspect object type at runtime
> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
> downstream schema is more clear.
> Any comments?
> Daniel