Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Semantic cleanup: How to adding two bytearray


Copy link to this message
-
Re: Semantic cleanup: How to adding two bytearray
Julien Le Dem 2011-01-14, 21:57
As part of PIG-1480 I've implemented an annotation based outputSchema definition similar to what I've done for Jython UDFs:
@OutputSchema("relationships:{t:(id1:chararray, id2:chararray, status:chararray)}")

Parsing a schema like this in pig can be done using org.apache.pig.impl.logicalLayer.parser.QueryParser
QueryParser parser = new QueryParser(new StringReader("relationships:{t:(id1:chararray, id2:chararray, status:chararray)}"));
outputSchema = parser.TupleSchema();

In trunk you can use:
org.apache.pig.impl.util.Utils.getSchemaFromString(String schemaString)

That could certainly can be pulled as an independent Jira.

Julien

On 1/14/11 12:27 PM, "Scott Carey" <[EMAIL PROTECTED]> wrote:
On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

>How is runtime detection done? I worry that if 1.txt contains:
>1, 2
>1.1, 2.2
>
>We get into a situation where addition of the fields in the first tuple
>produces integers, and adding the fields of the second tuple produces
>doubles.
>
>A more invasive but perhaps easier to reason about solution might be to be
>stricter about types, and require bytearrays to be cast to whatever type
>they are supposed to be if you want to add / delete / do non-byte-things
>to
>them.
>
>This is a problem if UDFs that output tuples or bags don't specify schemas
>(and specifying schemas of tuples and bags is fairly onerous right now in
>Java). I am not sure what the solution here is, other than finding a
>clean,
>less onerous way of declaring schemas, fixing up everything in builtin and
>piggybank to only use the new clean sparkly api and document the heck out
>of
>it.

A longer term approach would likely strive to make schema specification of
inputs and outputs for UDFs declarative and restrict the scope of the
unknown.  Building schema data structures procedurally is NotFun(tm).
All languages could support a string based schema representation, and many
could use more type-safe declarations like Java annotations.  I think
there is a long-term opportunity to make Pig's type system easier to work
with and higher performance but its no small project.  Pig certainly isn't
alone with these sort of issues.

>
>D
>
>On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]>
>wrote:
>
>> One goal of semantic cleanup work undergoing is to clarify the usage of
>> unknown type.
>>
>> In Pig schema system, user can define output schema for
>>LoadFunc/EvalFunc.
>> Pig will propagate those schema to the entire script. Defining schema
>>for
>> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will
>>mark
>> them bytearray. However, in the run time, user can feed any data type
>>in.
>> Before, Pig assumes the runtime type for bytearray is DataByteArray,
>>which
>> arose several issues (PIG-1277, PIG-999, PIG-1016).
>>
>> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
>> object to figure out what the real type is at runtime. We've done that
>>for
>> all shuffle keys (PIG-1277). However, there are other cases. One case is
>> adding two bytearray. For example,
>>
>> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader
>>does
>> not define schema, but actually feed Integer
>> b = foreach a generate a0+a1;
>>
>> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
>> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and
>>mark
>> the output schema for a0+a1 as double. Here is something interesting,
>> SomeLoader loads Integer, and we get Double after adding. We can change
>>it
>> if we do the following:
>> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
>> 2. Change POAdd(Similarly, all other ExpressionOperators, multply,
>>divide,
>> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig
>>will
>> figure out the data type at runtime, and process adding according to the
>> real type
>>
>> Pro:
>> 1. Consistent with the goal for unknown type cleanup: treat all
>>bytearray
>> as unknown type. In the runtime, inspect the object to find the real