Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Semantic cleanup: How to adding two bytearray


+
Daniel Dai 2011-01-14, 04:58
+
Dmitriy Ryaboy 2011-01-14, 06:54
+
Daniel Dai 2011-01-14, 20:11
+
Olga Natkovich 2011-01-14, 21:12
+
Julien Le Dem 2011-01-14, 22:01
+
Scott Carey 2011-01-14, 20:27
+
Dmitriy Ryaboy 2011-01-14, 21:34
+
Dmitriy Ryaboy 2011-01-14, 21:35
+
Alan Gates 2011-01-14, 22:00
+
Dmitriy Ryaboy 2011-01-14, 22:15
+
Julien Le Dem 2011-01-14, 22:40
+
Julien Le Dem 2011-01-14, 21:57
Copy link to this message
-
Re: Semantic cleanup: How to adding two bytearray
What would happen in case the loader is PigStorage ? The bytearray type
would actually be a DataByteArray . Will it be cast to double in that case ?

-Thejas

On 1/13/11 8:58 PM, "Daniel Dai" <[EMAIL PROTECTED]> wrote:

> One goal of semantic cleanup work undergoing is to clarify the usage of
> unknown type.
>
> In Pig schema system, user can define output schema for
> LoadFunc/EvalFunc. Pig will propagate those schema to the entire script.
> Defining schema for LoadFunc/EvalFunc is optional. If user don't define
> schema, Pig will mark them bytearray. However, in the run time, user can
> feed any data type in. Before, Pig assumes the runtime type for
> bytearray is DataByteArray, which arose several issues (PIG-1277,
> PIG-999, PIG-1016).
>
> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
> object to figure out what the real type is at runtime. We've done that
> for all shuffle keys (PIG-1277). However, there are other cases. One
> case is adding two bytearray. For example,
>
> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader
> does not define schema, but actually feed Integer
> b = foreach a generate a0+a1;
>
> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and
> mark the output schema for a0+a1 as double. Here is something
> interesting, SomeLoader loads Integer, and we get Double after adding.
> We can change it if we do the following:
> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
> 2. Change POAdd(Similarly, all other ExpressionOperators, multply,
> divide, etc) to handle bytearray. When the schema for POAdd is
> bytearray, Pig will figure out the data type at runtime, and process
> adding according to the real type
>
> Pro:
> 1. Consistent with the goal for unknown type cleanup: treat all
> bytearray as unknown type. In the runtime, inspect the object to find
> the real type
>
> Cons:
> 1. Slow down the processing since we need to inspect object type at runtime
> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
> downstream schema is more clear.
>
> Any comments?
>
> Daniel
>
+
Olga Natkovich 2011-01-14, 21:16