Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Semantic cleanup: How to adding two bytearray


Copy link to this message
-
RE: Semantic cleanup: How to adding two bytearray
Then the true type is DataByteArray so it would be used.

Olga

-----Original Message-----
From: Thejas M Nair [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 14, 2011 1:01 PM
To: [EMAIL PROTECTED]; Jianyong Dai
Subject: Re: Semantic cleanup: How to adding two bytearray

What would happen in case the loader is PigStorage ? The bytearray type
would actually be a DataByteArray . Will it be cast to double in that case ?

-Thejas

On 1/13/11 8:58 PM, "Daniel Dai" <[EMAIL PROTECTED]> wrote:

> One goal of semantic cleanup work undergoing is to clarify the usage of
> unknown type.
>
> In Pig schema system, user can define output schema for
> LoadFunc/EvalFunc. Pig will propagate those schema to the entire script.
> Defining schema for LoadFunc/EvalFunc is optional. If user don't define
> schema, Pig will mark them bytearray. However, in the run time, user can
> feed any data type in. Before, Pig assumes the runtime type for
> bytearray is DataByteArray, which arose several issues (PIG-1277,
> PIG-999, PIG-1016).
>
> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
> object to figure out what the real type is at runtime. We've done that
> for all shuffle keys (PIG-1277). However, there are other cases. One
> case is adding two bytearray. For example,
>
> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader
> does not define schema, but actually feed Integer
> b = foreach a generate a0+a1;
>
> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and
> mark the output schema for a0+a1 as double. Here is something
> interesting, SomeLoader loads Integer, and we get Double after adding.
> We can change it if we do the following:
> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
> 2. Change POAdd(Similarly, all other ExpressionOperators, multply,
> divide, etc) to handle bytearray. When the schema for POAdd is
> bytearray, Pig will figure out the data type at runtime, and process
> adding according to the real type
>
> Pro:
> 1. Consistent with the goal for unknown type cleanup: treat all
> bytearray as unknown type. In the runtime, inspect the object to find
> the real type
>
> Cons:
> 1. Slow down the processing since we need to inspect object type at runtime
> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
> downstream schema is more clear.
>
> Any comments?
>
> Daniel
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB