Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Semantic cleanup: How to adding two bytearray


Copy link to this message
-
Re: Semantic cleanup: How to adding two bytearray
Julien Le Dem 2011-01-14, 22:01
I vote for static typing and clear schema definition as well.
If the store implementation does not provide a schema, then the user should.
Julien

On 1/14/11 1:12 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote:

I think the tradeoff between fully dynamic types and static types are between convenience (why should I tell you what the type is if the data is properly typed) and type-safety (what if your data has invalid values) and performance (dynamic typing would be slower.)

My vote is for static typing because I believe the type-safety (and clear schema definition) and performance are more important.

Olga

-----Original Message-----
From: Daniel Dai [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 14, 2011 12:12 PM
To: [EMAIL PROTECTED]
Subject: Re: Semantic cleanup: How to adding two bytearray

Runtime detection can be done row by row. This will solve the problem in
your sample, though it suffers a little bit performance.

Require casting before adding is also clean. However, this would break
backward compatibility.

Dmitriy Ryaboy wrote:
> How is runtime detection done? I worry that if 1.txt contains:
> 1, 2
> 1.1, 2.2
>
> We get into a situation where addition of the fields in the first tuple
> produces integers, and adding the fields of the second tuple produces
> doubles.
>
> A more invasive but perhaps easier to reason about solution might be to be
> stricter about types, and require bytearrays to be cast to whatever type
> they are supposed to be if you want to add / delete / do non-byte-things to
> them.
>
> This is a problem if UDFs that output tuples or bags don't specify schemas
> (and specifying schemas of tuples and bags is fairly onerous right now in
> Java). I am not sure what the solution here is, other than finding a clean,
> less onerous way of declaring schemas, fixing up everything in builtin and
> piggybank to only use the new clean sparkly api and document the heck out of
> it.
>
> D
>
> On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>
>
>> One goal of semantic cleanup work undergoing is to clarify the usage of
>> unknown type.
>>
>> In Pig schema system, user can define output schema for LoadFunc/EvalFunc.
>> Pig will propagate those schema to the entire script. Defining schema for
>> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark
>> them bytearray. However, in the run time, user can feed any data type in.
>> Before, Pig assumes the runtime type for bytearray is DataByteArray, which
>> arose several issues (PIG-1277, PIG-999, PIG-1016).
>>
>> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
>> object to figure out what the real type is at runtime. We've done that for
>> all shuffle keys (PIG-1277). However, there are other cases. One case is
>> adding two bytearray. For example,
>>
>> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader does
>> not define schema, but actually feed Integer
>> b = foreach a generate a0+a1;
>>
>> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
>> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark
>> the output schema for a0+a1 as double. Here is something interesting,
>> SomeLoader loads Integer, and we get Double after adding. We can change it
>> if we do the following:
>> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
>> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide,
>> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will
>> figure out the data type at runtime, and process adding according to the
>> real type
>>
>> Pro:
>> 1. Consistent with the goal for unknown type cleanup: treat all bytearray
>> as unknown type. In the runtime, inspect the object to find the real type
>>
>> Cons:
>> 1. Slow down the processing since we need to inspect object type at runtime
>> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
>> downstream schema is more clear.