Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Semantic cleanup: How to adding two bytearray

Copy link to this message
Re: Semantic cleanup: How to adding two bytearray
Maps are sometimes used to represent JSON or similar data structures.
The resulting Pig objects are Maps with String keys and values being either: String, Number, Map, Bag (and recursively).

On 1/14/11 2:15 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

fwiw most of our maps wind up being mixes of string->double and
string->string.  Sometimes string->map and string->bag . Having non-string
keys would really help us but I know that was pulled for a reason..


On Fri, Jan 14, 2011 at 2:00 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

> I think the big win of static typing is that from examining the script
> alone you can know the output:
> A = load 'bla' using BinStorage();
> B = foreach A generate $0 + $1;
> With static typing $0 and $1 will both be viewed as bytearrays and thus
> will be cast to doubles, regardless of how BinStorage actually instantiated
> them.  With dynamic types we cannot know the answers without knowing the
> data that is fed through.
> The downside of the static typing case is that we explicitly allow unknown
> types in maps:
> A = load 'bla' using AvroStorage(); -- assume bla has a schema of m:map
>                                                       -- and that m has two
> keys, k1 and k2
>                                                       -- both with integer
> values
> B = foreach  A generate m#k1 + m#k2;
> Using static types, B.$0 will be a double, even though the underlying types
> are ints.  Users will not see that as intuitive even though the semantic is
> clear.  In the dynamic model proposed by Daniel, B.$0 will be an int.
> We are mitigating this case by allowing typed maps (where the value type of
> the map is declarable) in 0.9.  But maps with heterogenous values types will
> still suffer from this issue.
> I vote for static types for several reasons:
> 1) I like being able to know the output of the script by examining the
> script alone.  It provides a clear semantic that we can explain to users.
> 2) It's less of a maintenance cost, as the need to deal with dynamic type
> discovery is confined to the cast operator.  If we go full out dynamic types
> every expression operator has to be able to manage dynamism for byte arrays.
> 3) In my experience almost all maps are string->string so once we allow
> typed maps I suspect people will start using them heavily.
> I'm not sure there's a performance gain either way, since in both cases we
> have to manage the case where we think something is a bytearray and it turns
> out to be something else.
> Alan.
> On Jan 14, 2011, at 1:34 PM, Dmitriy Ryaboy wrote:
>  Agreed with what Scott said about procedurally building schemas, and what
>> Olga said about static typing.
>> Daniel, I am not sure what you mean about run-time typing on a row by row
>> basis.  Certainly winding up with columns that are sometimes doubles,
>> sometimes floats, and sometimes ints can only lead to unexpected bugs?
>> I know Yahoo went through a lot of pain with the LoadStore rework in 0.7
>> (heck I am still dealing with it), but seems like breaking compatibility
>> in
>> a minor way in order to clean up semantics is ok given that we had a
>> "stable" version in between. I don't think conversion would be too
>> onerous,
>> especially if declaring schemas is simplified.
>> We can just say that odd versions can break apis and even can't :).
>> D
>> On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <[EMAIL PROTECTED]
>> >wrote:
>>> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
>>>  How is runtime detection done? I worry that if 1.txt contains:
>>>> 1, 2
>>>> 1.1, 2.2
>>>> We get into a situation where addition of the fields in the first tuple
>>>> produces integers, and adding the fields of the second tuple produces
>>>> doubles.
>>>> A more invasive but perhaps easier to reason about solution might be to
>>>> be
>>>> stricter about types, and require bytearrays to be cast to whatever type