Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Semantic cleanup: How to adding two bytearray


+
Daniel Dai 2011-01-14, 04:58
+
Dmitriy Ryaboy 2011-01-14, 06:54
+
Daniel Dai 2011-01-14, 20:11
+
Olga Natkovich 2011-01-14, 21:12
Copy link to this message
-
Re: Semantic cleanup: How to adding two bytearray
I vote for static typing and clear schema definition as well.
If the store implementation does not provide a schema, then the user should.
Julien

On 1/14/11 1:12 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote:

I think the tradeoff between fully dynamic types and static types are between convenience (why should I tell you what the type is if the data is properly typed) and type-safety (what if your data has invalid values) and performance (dynamic typing would be slower.)

My vote is for static typing because I believe the type-safety (and clear schema definition) and performance are more important.

Olga

-----Original Message-----
From: Daniel Dai [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 14, 2011 12:12 PM
To: [EMAIL PROTECTED]
Subject: Re: Semantic cleanup: How to adding two bytearray

Runtime detection can be done row by row. This will solve the problem in
your sample, though it suffers a little bit performance.

Require casting before adding is also clean. However, this would break
backward compatibility.

Dmitriy Ryaboy wrote:
> How is runtime detection done? I worry that if 1.txt contains:
> 1, 2
> 1.1, 2.2
>
> We get into a situation where addition of the fields in the first tuple
> produces integers, and adding the fields of the second tuple produces
> doubles.
>
> A more invasive but perhaps easier to reason about solution might be to be
> stricter about types, and require bytearrays to be cast to whatever type
> they are supposed to be if you want to add / delete / do non-byte-things to
> them.
>
> This is a problem if UDFs that output tuples or bags don't specify schemas
> (and specifying schemas of tuples and bags is fairly onerous right now in
> Java). I am not sure what the solution here is, other than finding a clean,
> less onerous way of declaring schemas, fixing up everything in builtin and
> piggybank to only use the new clean sparkly api and document the heck out of
> it.
>
> D
>
> On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:
>
>
>> One goal of semantic cleanup work undergoing is to clarify the usage of
>> unknown type.
>>
>> In Pig schema system, user can define output schema for LoadFunc/EvalFunc.
>> Pig will propagate those schema to the entire script. Defining schema for
>> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark
>> them bytearray. However, in the run time, user can feed any data type in.
>> Before, Pig assumes the runtime type for bytearray is DataByteArray, which
>> arose several issues (PIG-1277, PIG-999, PIG-1016).
>>
>> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
>> object to figure out what the real type is at runtime. We've done that for
>> all shuffle keys (PIG-1277). However, there are other cases. One case is
>> adding two bytearray. For example,
>>
>> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader does
>> not define schema, but actually feed Integer
>> b = foreach a generate a0+a1;
>>
>> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
>> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark
>> the output schema for a0+a1 as double. Here is something interesting,
>> SomeLoader loads Integer, and we get Double after adding. We can change it
>> if we do the following:
>> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
>> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide,
>> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will
>> figure out the data type at runtime, and process adding according to the
>> real type
>>
>> Pro:
>> 1. Consistent with the goal for unknown type cleanup: treat all bytearray
>> as unknown type. In the runtime, inspect the object to find the real type
>>
>> Cons:
>> 1. Slow down the processing since we need to inspect object type at runtime
>> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
>> downstream schema is more clear.
+
Scott Carey 2011-01-14, 20:27
+
Dmitriy Ryaboy 2011-01-14, 21:34
+
Dmitriy Ryaboy 2011-01-14, 21:35
+
Alan Gates 2011-01-14, 22:00
+
Dmitriy Ryaboy 2011-01-14, 22:15
+
Julien Le Dem 2011-01-14, 22:40
+
Julien Le Dem 2011-01-14, 21:57
+
Thejas M Nair 2011-01-14, 21:00
+
Olga Natkovich 2011-01-14, 21:16
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB