Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Semantic cleanup: How to adding two bytearray


+
Daniel Dai 2011-01-14, 04:58
+
Dmitriy Ryaboy 2011-01-14, 06:54
+
Daniel Dai 2011-01-14, 20:11
+
Olga Natkovich 2011-01-14, 21:12
+
Julien Le Dem 2011-01-14, 22:01
+
Scott Carey 2011-01-14, 20:27
+
Dmitriy Ryaboy 2011-01-14, 21:34
+
Dmitriy Ryaboy 2011-01-14, 21:35
+
Alan Gates 2011-01-14, 22:00
+
Dmitriy Ryaboy 2011-01-14, 22:15
Copy link to this message
-
Re: Semantic cleanup: How to adding two bytearray
Maps are sometimes used to represent JSON or similar data structures.
The resulting Pig objects are Maps with String keys and values being either: String, Number, Map, Bag (and recursively).
Julien

On 1/14/11 2:15 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

fwiw most of our maps wind up being mixes of string->double and
string->string.  Sometimes string->map and string->bag . Having non-string
keys would really help us but I know that was pulled for a reason..

D

On Fri, Jan 14, 2011 at 2:00 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

> I think the big win of static typing is that from examining the script
> alone you can know the output:
>
> A = load 'bla' using BinStorage();
> B = foreach A generate $0 + $1;
>
> With static typing $0 and $1 will both be viewed as bytearrays and thus
> will be cast to doubles, regardless of how BinStorage actually instantiated
> them.  With dynamic types we cannot know the answers without knowing the
> data that is fed through.
>
> The downside of the static typing case is that we explicitly allow unknown
> types in maps:
>
> A = load 'bla' using AvroStorage(); -- assume bla has a schema of m:map
>                                                       -- and that m has two
> keys, k1 and k2
>                                                       -- both with integer
> values
> B = foreach  A generate m#k1 + m#k2;
>
> Using static types, B.$0 will be a double, even though the underlying types
> are ints.  Users will not see that as intuitive even though the semantic is
> clear.  In the dynamic model proposed by Daniel, B.$0 will be an int.
>
> We are mitigating this case by allowing typed maps (where the value type of
> the map is declarable) in 0.9.  But maps with heterogenous values types will
> still suffer from this issue.
>
> I vote for static types for several reasons:
>
> 1) I like being able to know the output of the script by examining the
> script alone.  It provides a clear semantic that we can explain to users.
> 2) It's less of a maintenance cost, as the need to deal with dynamic type
> discovery is confined to the cast operator.  If we go full out dynamic types
> every expression operator has to be able to manage dynamism for byte arrays.
> 3) In my experience almost all maps are string->string so once we allow
> typed maps I suspect people will start using them heavily.
>
> I'm not sure there's a performance gain either way, since in both cases we
> have to manage the case where we think something is a bytearray and it turns
> out to be something else.
>
> Alan.
>
>
>
> On Jan 14, 2011, at 1:34 PM, Dmitriy Ryaboy wrote:
>
>  Agreed with what Scott said about procedurally building schemas, and what
>> Olga said about static typing.
>>
>> Daniel, I am not sure what you mean about run-time typing on a row by row
>> basis.  Certainly winding up with columns that are sometimes doubles,
>> sometimes floats, and sometimes ints can only lead to unexpected bugs?
>>
>> I know Yahoo went through a lot of pain with the LoadStore rework in 0.7
>> (heck I am still dealing with it), but seems like breaking compatibility
>> in
>> a minor way in order to clean up semantics is ok given that we had a
>> "stable" version in between. I don't think conversion would be too
>> onerous,
>> especially if declaring schemas is simplified.
>>
>> We can just say that odd versions can break apis and even can't :).
>>
>> D
>>
>> On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <[EMAIL PROTECTED]
>> >wrote:
>>
>>
>>>
>>> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>>>
>>>  How is runtime detection done? I worry that if 1.txt contains:
>>>> 1, 2
>>>> 1.1, 2.2
>>>>
>>>> We get into a situation where addition of the fields in the first tuple
>>>> produces integers, and adding the fields of the second tuple produces
>>>> doubles.
>>>>
>>>> A more invasive but perhaps easier to reason about solution might be to
>>>> be
>>>> stricter about types, and require bytearrays to be cast to whatever type
+
Julien Le Dem 2011-01-14, 21:57
+
Thejas M Nair 2011-01-14, 21:00
+
Olga Natkovich 2011-01-14, 21:16
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB