Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Semantic cleanup: How to adding two bytearray

Copy link to this message
RE: Semantic cleanup: How to adding two bytearray
Olga Natkovich 2011-01-14, 21:12
I think the tradeoff between fully dynamic types and static types are between convenience (why should I tell you what the type is if the data is properly typed) and type-safety (what if your data has invalid values) and performance (dynamic typing would be slower.)

My vote is for static typing because I believe the type-safety (and clear schema definition) and performance are more important.


-----Original Message-----
From: Daniel Dai [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 14, 2011 12:12 PM
Subject: Re: Semantic cleanup: How to adding two bytearray

Runtime detection can be done row by row. This will solve the problem in
your sample, though it suffers a little bit performance.

Require casting before adding is also clean. However, this would break
backward compatibility.

Dmitriy Ryaboy wrote:
> How is runtime detection done? I worry that if 1.txt contains:
> 1, 2
> 1.1, 2.2
> We get into a situation where addition of the fields in the first tuple
> produces integers, and adding the fields of the second tuple produces
> doubles.
> A more invasive but perhaps easier to reason about solution might be to be
> stricter about types, and require bytearrays to be cast to whatever type
> they are supposed to be if you want to add / delete / do non-byte-things to
> them.
> This is a problem if UDFs that output tuples or bags don't specify schemas
> (and specifying schemas of tuples and bags is fairly onerous right now in
> Java). I am not sure what the solution here is, other than finding a clean,
> less onerous way of declaring schemas, fixing up everything in builtin and
> piggybank to only use the new clean sparkly api and document the heck out of
> it.
> D
> On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>> One goal of semantic cleanup work undergoing is to clarify the usage of
>> unknown type.
>> In Pig schema system, user can define output schema for LoadFunc/EvalFunc.
>> Pig will propagate those schema to the entire script. Defining schema for
>> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark
>> them bytearray. However, in the run time, user can feed any data type in.
>> Before, Pig assumes the runtime type for bytearray is DataByteArray, which
>> arose several issues (PIG-1277, PIG-999, PIG-1016).
>> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
>> object to figure out what the real type is at runtime. We've done that for
>> all shuffle keys (PIG-1277). However, there are other cases. One case is
>> adding two bytearray. For example,
>> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader does
>> not define schema, but actually feed Integer
>> b = foreach a generate a0+a1;
>> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
>> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark
>> the output schema for a0+a1 as double. Here is something interesting,
>> SomeLoader loads Integer, and we get Double after adding. We can change it
>> if we do the following:
>> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
>> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide,
>> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will
>> figure out the data type at runtime, and process adding according to the
>> real type
>> Pro:
>> 1. Consistent with the goal for unknown type cleanup: treat all bytearray
>> as unknown type. In the runtime, inspect the object to find the real type
>> Cons:
>> 1. Slow down the processing since we need to inspect object type at runtime
>> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
>> downstream schema is more clear.
>> Any comments?
>> Daniel