|
Daniel Dai
2011-01-14, 04:58
Dmitriy Ryaboy
2011-01-14, 06:54
Daniel Dai
2011-01-14, 20:11
Scott Carey
2011-01-14, 20:27
Thejas M Nair
2011-01-14, 21:00
Olga Natkovich
2011-01-14, 21:12
Olga Natkovich
2011-01-14, 21:16
Dmitriy Ryaboy
2011-01-14, 21:34
Dmitriy Ryaboy
2011-01-14, 21:35
Julien Le Dem
2011-01-14, 21:57
Alan Gates
2011-01-14, 22:00
Julien Le Dem
2011-01-14, 22:01
Dmitriy Ryaboy
2011-01-14, 22:15
Julien Le Dem
2011-01-14, 22:40
|
-
Semantic cleanup: How to adding two bytearrayDaniel Dai 2011-01-14, 04:58
One goal of semantic cleanup work undergoing is to clarify the usage of
unknown type. In Pig schema system, user can define output schema for LoadFunc/EvalFunc. Pig will propagate those schema to the entire script. Defining schema for LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark them bytearray. However, in the run time, user can feed any data type in. Before, Pig assumes the runtime type for bytearray is DataByteArray, which arose several issues (PIG-1277, PIG-999, PIG-1016). In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the object to figure out what the real type is at runtime. We've done that for all shuffle keys (PIG-1277). However, there are other cases. One case is adding two bytearray. For example, a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader does not define schema, but actually feed Integer b = foreach a generate a0+a1; In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark the output schema for a0+a1 as double. Here is something interesting, SomeLoader loads Integer, and we get Double after adding. We can change it if we do the following: 1. Don't cast bytearray into Double (in TypeCheckingVisitor) 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide, etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will figure out the data type at runtime, and process adding according to the real type Pro: 1. Consistent with the goal for unknown type cleanup: treat all bytearray as unknown type. In the runtime, inspect the object to find the real type Cons: 1. Slow down the processing since we need to inspect object type at runtime 2. Bring some indeterminism to schema system. Before a0+a1 is double, downstream schema is more clear. Any comments? Daniel
-
Re: Semantic cleanup: How to adding two bytearrayDmitriy Ryaboy 2011-01-14, 06:54
How is runtime detection done? I worry that if 1.txt contains:
1, 2 1.1, 2.2 We get into a situation where addition of the fields in the first tuple produces integers, and adding the fields of the second tuple produces doubles. A more invasive but perhaps easier to reason about solution might be to be stricter about types, and require bytearrays to be cast to whatever type they are supposed to be if you want to add / delete / do non-byte-things to them. This is a problem if UDFs that output tuples or bags don't specify schemas (and specifying schemas of tuples and bags is fairly onerous right now in Java). I am not sure what the solution here is, other than finding a clean, less onerous way of declaring schemas, fixing up everything in builtin and piggybank to only use the new clean sparkly api and document the heck out of it. D On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > One goal of semantic cleanup work undergoing is to clarify the usage of > unknown type. > > In Pig schema system, user can define output schema for LoadFunc/EvalFunc. > Pig will propagate those schema to the entire script. Defining schema for > LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark > them bytearray. However, in the run time, user can feed any data type in. > Before, Pig assumes the runtime type for bytearray is DataByteArray, which > arose several issues (PIG-1277, PIG-999, PIG-1016). > > In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the > object to figure out what the real type is at runtime. We've done that for > all shuffle keys (PIG-1277). However, there are other cases. One case is > adding two bytearray. For example, > > a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader does > not define schema, but actually feed Integer > b = foreach a generate a0+a1; > > In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of > a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark > the output schema for a0+a1 as double. Here is something interesting, > SomeLoader loads Integer, and we get Double after adding. We can change it > if we do the following: > 1. Don't cast bytearray into Double (in TypeCheckingVisitor) > 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide, > etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will > figure out the data type at runtime, and process adding according to the > real type > > Pro: > 1. Consistent with the goal for unknown type cleanup: treat all bytearray > as unknown type. In the runtime, inspect the object to find the real type > > Cons: > 1. Slow down the processing since we need to inspect object type at runtime > 2. Bring some indeterminism to schema system. Before a0+a1 is double, > downstream schema is more clear. > > Any comments? > > Daniel >
-
Re: Semantic cleanup: How to adding two bytearrayDaniel Dai 2011-01-14, 20:11
Runtime detection can be done row by row. This will solve the problem in
your sample, though it suffers a little bit performance. Require casting before adding is also clean. However, this would break backward compatibility. Dmitriy Ryaboy wrote: > How is runtime detection done? I worry that if 1.txt contains: > 1, 2 > 1.1, 2.2 > > We get into a situation where addition of the fields in the first tuple > produces integers, and adding the fields of the second tuple produces > doubles. > > A more invasive but perhaps easier to reason about solution might be to be > stricter about types, and require bytearrays to be cast to whatever type > they are supposed to be if you want to add / delete / do non-byte-things to > them. > > This is a problem if UDFs that output tuples or bags don't specify schemas > (and specifying schemas of tuples and bags is fairly onerous right now in > Java). I am not sure what the solution here is, other than finding a clean, > less onerous way of declaring schemas, fixing up everything in builtin and > piggybank to only use the new clean sparkly api and document the heck out of > it. > > D > > On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > > >> One goal of semantic cleanup work undergoing is to clarify the usage of >> unknown type. >> >> In Pig schema system, user can define output schema for LoadFunc/EvalFunc. >> Pig will propagate those schema to the entire script. Defining schema for >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark >> them bytearray. However, in the run time, user can feed any data type in. >> Before, Pig assumes the runtime type for bytearray is DataByteArray, which >> arose several issues (PIG-1277, PIG-999, PIG-1016). >> >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the >> object to figure out what the real type is at runtime. We've done that for >> all shuffle keys (PIG-1277). However, there are other cases. One case is >> adding two bytearray. For example, >> >> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader does >> not define schema, but actually feed Integer >> b = foreach a generate a0+a1; >> >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark >> the output schema for a0+a1 as double. Here is something interesting, >> SomeLoader loads Integer, and we get Double after adding. We can change it >> if we do the following: >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide, >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will >> figure out the data type at runtime, and process adding according to the >> real type >> >> Pro: >> 1. Consistent with the goal for unknown type cleanup: treat all bytearray >> as unknown type. In the runtime, inspect the object to find the real type >> >> Cons: >> 1. Slow down the processing since we need to inspect object type at runtime >> 2. Bring some indeterminism to schema system. Before a0+a1 is double, >> downstream schema is more clear. >> >> Any comments? >> >> Daniel >> >>
-
Re: Semantic cleanup: How to adding two bytearrayScott Carey 2011-01-14, 20:27
On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >How is runtime detection done? I worry that if 1.txt contains: >1, 2 >1.1, 2.2 > >We get into a situation where addition of the fields in the first tuple >produces integers, and adding the fields of the second tuple produces >doubles. > >A more invasive but perhaps easier to reason about solution might be to be >stricter about types, and require bytearrays to be cast to whatever type >they are supposed to be if you want to add / delete / do non-byte-things >to >them. > >This is a problem if UDFs that output tuples or bags don't specify schemas >(and specifying schemas of tuples and bags is fairly onerous right now in >Java). I am not sure what the solution here is, other than finding a >clean, >less onerous way of declaring schemas, fixing up everything in builtin and >piggybank to only use the new clean sparkly api and document the heck out >of >it. A longer term approach would likely strive to make schema specification of inputs and outputs for UDFs declarative and restrict the scope of the unknown. Building schema data structures procedurally is NotFun(tm). All languages could support a string based schema representation, and many could use more type-safe declarations like Java annotations. I think there is a long-term opportunity to make Pig's type system easier to work with and higher performance but its no small project. Pig certainly isn't alone with these sort of issues. > >D > >On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> >wrote: > >> One goal of semantic cleanup work undergoing is to clarify the usage of >> unknown type. >> >> In Pig schema system, user can define output schema for >>LoadFunc/EvalFunc. >> Pig will propagate those schema to the entire script. Defining schema >>for >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will >>mark >> them bytearray. However, in the run time, user can feed any data type >>in. >> Before, Pig assumes the runtime type for bytearray is DataByteArray, >>which >> arose several issues (PIG-1277, PIG-999, PIG-1016). >> >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the >> object to figure out what the real type is at runtime. We've done that >>for >> all shuffle keys (PIG-1277). However, there are other cases. One case is >> adding two bytearray. For example, >> >> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader >>does >> not define schema, but actually feed Integer >> b = foreach a generate a0+a1; >> >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and >>mark >> the output schema for a0+a1 as double. Here is something interesting, >> SomeLoader loads Integer, and we get Double after adding. We can change >>it >> if we do the following: >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, >>divide, >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig >>will >> figure out the data type at runtime, and process adding according to the >> real type >> >> Pro: >> 1. Consistent with the goal for unknown type cleanup: treat all >>bytearray >> as unknown type. In the runtime, inspect the object to find the real >>type >> >> Cons: >> 1. Slow down the processing since we need to inspect object type at >>runtime >> 2. Bring some indeterminism to schema system. Before a0+a1 is double, >> downstream schema is more clear. >> >> Any comments? >> >> Daniel >>
-
Re: Semantic cleanup: How to adding two bytearrayThejas M Nair 2011-01-14, 21:00
What would happen in case the loader is PigStorage ? The bytearray type
would actually be a DataByteArray . Will it be cast to double in that case ? -Thejas On 1/13/11 8:58 PM, "Daniel Dai" <[EMAIL PROTECTED]> wrote: > One goal of semantic cleanup work undergoing is to clarify the usage of > unknown type. > > In Pig schema system, user can define output schema for > LoadFunc/EvalFunc. Pig will propagate those schema to the entire script. > Defining schema for LoadFunc/EvalFunc is optional. If user don't define > schema, Pig will mark them bytearray. However, in the run time, user can > feed any data type in. Before, Pig assumes the runtime type for > bytearray is DataByteArray, which arose several issues (PIG-1277, > PIG-999, PIG-1016). > > In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the > object to figure out what the real type is at runtime. We've done that > for all shuffle keys (PIG-1277). However, there are other cases. One > case is adding two bytearray. For example, > > a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader > does not define schema, but actually feed Integer > b = foreach a generate a0+a1; > > In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of > a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and > mark the output schema for a0+a1 as double. Here is something > interesting, SomeLoader loads Integer, and we get Double after adding. > We can change it if we do the following: > 1. Don't cast bytearray into Double (in TypeCheckingVisitor) > 2. Change POAdd(Similarly, all other ExpressionOperators, multply, > divide, etc) to handle bytearray. When the schema for POAdd is > bytearray, Pig will figure out the data type at runtime, and process > adding according to the real type > > Pro: > 1. Consistent with the goal for unknown type cleanup: treat all > bytearray as unknown type. In the runtime, inspect the object to find > the real type > > Cons: > 1. Slow down the processing since we need to inspect object type at runtime > 2. Bring some indeterminism to schema system. Before a0+a1 is double, > downstream schema is more clear. > > Any comments? > > Daniel >
-
RE: Semantic cleanup: How to adding two bytearrayOlga Natkovich 2011-01-14, 21:12
I think the tradeoff between fully dynamic types and static types are between convenience (why should I tell you what the type is if the data is properly typed) and type-safety (what if your data has invalid values) and performance (dynamic typing would be slower.)
My vote is for static typing because I believe the type-safety (and clear schema definition) and performance are more important. Olga -----Original Message----- From: Daniel Dai [mailto:[EMAIL PROTECTED]] Sent: Friday, January 14, 2011 12:12 PM To: [EMAIL PROTECTED] Subject: Re: Semantic cleanup: How to adding two bytearray Runtime detection can be done row by row. This will solve the problem in your sample, though it suffers a little bit performance. Require casting before adding is also clean. However, this would break backward compatibility. Dmitriy Ryaboy wrote: > How is runtime detection done? I worry that if 1.txt contains: > 1, 2 > 1.1, 2.2 > > We get into a situation where addition of the fields in the first tuple > produces integers, and adding the fields of the second tuple produces > doubles. > > A more invasive but perhaps easier to reason about solution might be to be > stricter about types, and require bytearrays to be cast to whatever type > they are supposed to be if you want to add / delete / do non-byte-things to > them. > > This is a problem if UDFs that output tuples or bags don't specify schemas > (and specifying schemas of tuples and bags is fairly onerous right now in > Java). I am not sure what the solution here is, other than finding a clean, > less onerous way of declaring schemas, fixing up everything in builtin and > piggybank to only use the new clean sparkly api and document the heck out of > it. > > D > > On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > > >> One goal of semantic cleanup work undergoing is to clarify the usage of >> unknown type. >> >> In Pig schema system, user can define output schema for LoadFunc/EvalFunc. >> Pig will propagate those schema to the entire script. Defining schema for >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark >> them bytearray. However, in the run time, user can feed any data type in. >> Before, Pig assumes the runtime type for bytearray is DataByteArray, which >> arose several issues (PIG-1277, PIG-999, PIG-1016). >> >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the >> object to figure out what the real type is at runtime. We've done that for >> all shuffle keys (PIG-1277). However, there are other cases. One case is >> adding two bytearray. For example, >> >> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader does >> not define schema, but actually feed Integer >> b = foreach a generate a0+a1; >> >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark >> the output schema for a0+a1 as double. Here is something interesting, >> SomeLoader loads Integer, and we get Double after adding. We can change it >> if we do the following: >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide, >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will >> figure out the data type at runtime, and process adding according to the >> real type >> >> Pro: >> 1. Consistent with the goal for unknown type cleanup: treat all bytearray >> as unknown type. In the runtime, inspect the object to find the real type >> >> Cons: >> 1. Slow down the processing since we need to inspect object type at runtime >> 2. Bring some indeterminism to schema system. Before a0+a1 is double, >> downstream schema is more clear. >> >> Any comments? >> >> Daniel >> >>
-
RE: Semantic cleanup: How to adding two bytearrayOlga Natkovich 2011-01-14, 21:16
Then the true type is DataByteArray so it would be used.
Olga -----Original Message----- From: Thejas M Nair [mailto:[EMAIL PROTECTED]] Sent: Friday, January 14, 2011 1:01 PM To: [EMAIL PROTECTED]; Jianyong Dai Subject: Re: Semantic cleanup: How to adding two bytearray What would happen in case the loader is PigStorage ? The bytearray type would actually be a DataByteArray . Will it be cast to double in that case ? -Thejas On 1/13/11 8:58 PM, "Daniel Dai" <[EMAIL PROTECTED]> wrote: > One goal of semantic cleanup work undergoing is to clarify the usage of > unknown type. > > In Pig schema system, user can define output schema for > LoadFunc/EvalFunc. Pig will propagate those schema to the entire script. > Defining schema for LoadFunc/EvalFunc is optional. If user don't define > schema, Pig will mark them bytearray. However, in the run time, user can > feed any data type in. Before, Pig assumes the runtime type for > bytearray is DataByteArray, which arose several issues (PIG-1277, > PIG-999, PIG-1016). > > In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the > object to figure out what the real type is at runtime. We've done that > for all shuffle keys (PIG-1277). However, there are other cases. One > case is adding two bytearray. For example, > > a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader > does not define schema, but actually feed Integer > b = foreach a generate a0+a1; > > In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of > a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and > mark the output schema for a0+a1 as double. Here is something > interesting, SomeLoader loads Integer, and we get Double after adding. > We can change it if we do the following: > 1. Don't cast bytearray into Double (in TypeCheckingVisitor) > 2. Change POAdd(Similarly, all other ExpressionOperators, multply, > divide, etc) to handle bytearray. When the schema for POAdd is > bytearray, Pig will figure out the data type at runtime, and process > adding according to the real type > > Pro: > 1. Consistent with the goal for unknown type cleanup: treat all > bytearray as unknown type. In the runtime, inspect the object to find > the real type > > Cons: > 1. Slow down the processing since we need to inspect object type at runtime > 2. Bring some indeterminism to schema system. Before a0+a1 is double, > downstream schema is more clear. > > Any comments? > > Daniel >
-
Re: Semantic cleanup: How to adding two bytearrayDmitriy Ryaboy 2011-01-14, 21:34
Agreed with what Scott said about procedurally building schemas, and what
Olga said about static typing. Daniel, I am not sure what you mean about run-time typing on a row by row basis. Certainly winding up with columns that are sometimes doubles, sometimes floats, and sometimes ints can only lead to unexpected bugs? I know Yahoo went through a lot of pain with the LoadStore rework in 0.7 (heck I am still dealing with it), but seems like breaking compatibility in a minor way in order to clean up semantics is ok given that we had a "stable" version in between. I don't think conversion would be too onerous, especially if declaring schemas is simplified. We can just say that odd versions can break apis and even can't :). D On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <[EMAIL PROTECTED]>wrote: > > > On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: > > >How is runtime detection done? I worry that if 1.txt contains: > >1, 2 > >1.1, 2.2 > > > >We get into a situation where addition of the fields in the first tuple > >produces integers, and adding the fields of the second tuple produces > >doubles. > > > >A more invasive but perhaps easier to reason about solution might be to be > >stricter about types, and require bytearrays to be cast to whatever type > >they are supposed to be if you want to add / delete / do non-byte-things > >to > >them. > > > >This is a problem if UDFs that output tuples or bags don't specify schemas > >(and specifying schemas of tuples and bags is fairly onerous right now in > >Java). I am not sure what the solution here is, other than finding a > >clean, > >less onerous way of declaring schemas, fixing up everything in builtin and > >piggybank to only use the new clean sparkly api and document the heck out > >of > >it. > > A longer term approach would likely strive to make schema specification of > inputs and outputs for UDFs declarative and restrict the scope of the > unknown. Building schema data structures procedurally is NotFun(tm). > All languages could support a string based schema representation, and many > could use more type-safe declarations like Java annotations. I think > there is a long-term opportunity to make Pig's type system easier to work > with and higher performance but its no small project. Pig certainly isn't > alone with these sort of issues. > > > > >D > > > >On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> > >wrote: > > > >> One goal of semantic cleanup work undergoing is to clarify the usage of > >> unknown type. > >> > >> In Pig schema system, user can define output schema for > >>LoadFunc/EvalFunc. > >> Pig will propagate those schema to the entire script. Defining schema > >>for > >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will > >>mark > >> them bytearray. However, in the run time, user can feed any data type > >>in. > >> Before, Pig assumes the runtime type for bytearray is DataByteArray, > >>which > >> arose several issues (PIG-1277, PIG-999, PIG-1016). > >> > >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the > >> object to figure out what the real type is at runtime. We've done that > >>for > >> all shuffle keys (PIG-1277). However, there are other cases. One case is > >> adding two bytearray. For example, > >> > >> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader > >>does > >> not define schema, but actually feed Integer > >> b = foreach a generate a0+a1; > >> > >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of > >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and > >>mark > >> the output schema for a0+a1 as double. Here is something interesting, > >> SomeLoader loads Integer, and we get Double after adding. We can change > >>it > >> if we do the following: > >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) > >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, > >>divide, > >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig
-
Re: Semantic cleanup: How to adding two bytearrayDmitriy Ryaboy 2011-01-14, 21:35
>
> Certainly winding up with columns that are sometimes doubles, sometimes > floats, and sometimes ints can only lead to unexpected bugs? > > As opposed to expected bugs I guess... :-)
-
Re: Semantic cleanup: How to adding two bytearrayJulien Le Dem 2011-01-14, 21:57
As part of PIG-1480 I've implemented an annotation based outputSchema definition similar to what I've done for Jython UDFs:
@OutputSchema("relationships:{t:(id1:chararray, id2:chararray, status:chararray)}") Parsing a schema like this in pig can be done using org.apache.pig.impl.logicalLayer.parser.QueryParser QueryParser parser = new QueryParser(new StringReader("relationships:{t:(id1:chararray, id2:chararray, status:chararray)}")); outputSchema = parser.TupleSchema(); In trunk you can use: org.apache.pig.impl.util.Utils.getSchemaFromString(String schemaString) That could certainly can be pulled as an independent Jira. Julien On 1/14/11 12:27 PM, "Scott Carey" <[EMAIL PROTECTED]> wrote: On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >How is runtime detection done? I worry that if 1.txt contains: >1, 2 >1.1, 2.2 > >We get into a situation where addition of the fields in the first tuple >produces integers, and adding the fields of the second tuple produces >doubles. > >A more invasive but perhaps easier to reason about solution might be to be >stricter about types, and require bytearrays to be cast to whatever type >they are supposed to be if you want to add / delete / do non-byte-things >to >them. > >This is a problem if UDFs that output tuples or bags don't specify schemas >(and specifying schemas of tuples and bags is fairly onerous right now in >Java). I am not sure what the solution here is, other than finding a >clean, >less onerous way of declaring schemas, fixing up everything in builtin and >piggybank to only use the new clean sparkly api and document the heck out >of >it. A longer term approach would likely strive to make schema specification of inputs and outputs for UDFs declarative and restrict the scope of the unknown. Building schema data structures procedurally is NotFun(tm). All languages could support a string based schema representation, and many could use more type-safe declarations like Java annotations. I think there is a long-term opportunity to make Pig's type system easier to work with and higher performance but its no small project. Pig certainly isn't alone with these sort of issues. > >D > >On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> >wrote: > >> One goal of semantic cleanup work undergoing is to clarify the usage of >> unknown type. >> >> In Pig schema system, user can define output schema for >>LoadFunc/EvalFunc. >> Pig will propagate those schema to the entire script. Defining schema >>for >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will >>mark >> them bytearray. However, in the run time, user can feed any data type >>in. >> Before, Pig assumes the runtime type for bytearray is DataByteArray, >>which >> arose several issues (PIG-1277, PIG-999, PIG-1016). >> >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the >> object to figure out what the real type is at runtime. We've done that >>for >> all shuffle keys (PIG-1277). However, there are other cases. One case is >> adding two bytearray. For example, >> >> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader >>does >> not define schema, but actually feed Integer >> b = foreach a generate a0+a1; >> >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and >>mark >> the output schema for a0+a1 as double. Here is something interesting, >> SomeLoader loads Integer, and we get Double after adding. We can change >>it >> if we do the following: >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, >>divide, >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig >>will >> figure out the data type at runtime, and process adding according to the >> real type >> >> Pro: >> 1. Consistent with the goal for unknown type cleanup: treat all >>bytearray >> as unknown type. In the runtime, inspect the object to find the real
-
Re: Semantic cleanup: How to adding two bytearrayAlan Gates 2011-01-14, 22:00
I think the big win of static typing is that from examining the script
alone you can know the output: A = load 'bla' using BinStorage(); B = foreach A generate $0 + $1; With static typing $0 and $1 will both be viewed as bytearrays and thus will be cast to doubles, regardless of how BinStorage actually instantiated them. With dynamic types we cannot know the answers without knowing the data that is fed through. The downside of the static typing case is that we explicitly allow unknown types in maps: A = load 'bla' using AvroStorage(); -- assume bla has a schema of m:map -- and that m has two keys, k1 and k2 -- both with integer values B = foreach A generate m#k1 + m#k2; Using static types, B.$0 will be a double, even though the underlying types are ints. Users will not see that as intuitive even though the semantic is clear. In the dynamic model proposed by Daniel, B.$0 will be an int. We are mitigating this case by allowing typed maps (where the value type of the map is declarable) in 0.9. But maps with heterogenous values types will still suffer from this issue. I vote for static types for several reasons: 1) I like being able to know the output of the script by examining the script alone. It provides a clear semantic that we can explain to users. 2) It's less of a maintenance cost, as the need to deal with dynamic type discovery is confined to the cast operator. If we go full out dynamic types every expression operator has to be able to manage dynamism for byte arrays. 3) In my experience almost all maps are string->string so once we allow typed maps I suspect people will start using them heavily. I'm not sure there's a performance gain either way, since in both cases we have to manage the case where we think something is a bytearray and it turns out to be something else. Alan. On Jan 14, 2011, at 1:34 PM, Dmitriy Ryaboy wrote: > Agreed with what Scott said about procedurally building schemas, and > what > Olga said about static typing. > > Daniel, I am not sure what you mean about run-time typing on a row > by row > basis. Certainly winding up with columns that are sometimes doubles, > sometimes floats, and sometimes ints can only lead to unexpected bugs? > > I know Yahoo went through a lot of pain with the LoadStore rework in > 0.7 > (heck I am still dealing with it), but seems like breaking > compatibility in > a minor way in order to clean up semantics is ok given that we had a > "stable" version in between. I don't think conversion would be too > onerous, > especially if declaring schemas is simplified. > > We can just say that odd versions can break apis and even can't :). > > D > > On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey > <[EMAIL PROTECTED]>wrote: > >> >> >> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >> >>> How is runtime detection done? I worry that if 1.txt contains: >>> 1, 2 >>> 1.1, 2.2 >>> >>> We get into a situation where addition of the fields in the first >>> tuple >>> produces integers, and adding the fields of the second tuple >>> produces >>> doubles. >>> >>> A more invasive but perhaps easier to reason about solution might >>> be to be >>> stricter about types, and require bytearrays to be cast to >>> whatever type >>> they are supposed to be if you want to add / delete / do non-byte- >>> things >>> to >>> them. >>> >>> This is a problem if UDFs that output tuples or bags don't specify >>> schemas >>> (and specifying schemas of tuples and bags is fairly onerous right >>> now in >>> Java). I am not sure what the solution here is, other than finding a >>> clean, >>> less onerous way of declaring schemas, fixing up everything in >>> builtin and >>> piggybank to only use the new clean sparkly api and document the >>> heck out >>> of >>> it. >> >> A longer term approach would likely strive to make schema
-
Re: Semantic cleanup: How to adding two bytearrayJulien Le Dem 2011-01-14, 22:01
I vote for static typing and clear schema definition as well.
If the store implementation does not provide a schema, then the user should. Julien On 1/14/11 1:12 PM, "Olga Natkovich" <[EMAIL PROTECTED]> wrote: I think the tradeoff between fully dynamic types and static types are between convenience (why should I tell you what the type is if the data is properly typed) and type-safety (what if your data has invalid values) and performance (dynamic typing would be slower.) My vote is for static typing because I believe the type-safety (and clear schema definition) and performance are more important. Olga -----Original Message----- From: Daniel Dai [mailto:[EMAIL PROTECTED]] Sent: Friday, January 14, 2011 12:12 PM To: [EMAIL PROTECTED] Subject: Re: Semantic cleanup: How to adding two bytearray Runtime detection can be done row by row. This will solve the problem in your sample, though it suffers a little bit performance. Require casting before adding is also clean. However, this would break backward compatibility. Dmitriy Ryaboy wrote: > How is runtime detection done? I worry that if 1.txt contains: > 1, 2 > 1.1, 2.2 > > We get into a situation where addition of the fields in the first tuple > produces integers, and adding the fields of the second tuple produces > doubles. > > A more invasive but perhaps easier to reason about solution might be to be > stricter about types, and require bytearrays to be cast to whatever type > they are supposed to be if you want to add / delete / do non-byte-things to > them. > > This is a problem if UDFs that output tuples or bags don't specify schemas > (and specifying schemas of tuples and bags is fairly onerous right now in > Java). I am not sure what the solution here is, other than finding a clean, > less onerous way of declaring schemas, fixing up everything in builtin and > piggybank to only use the new clean sparkly api and document the heck out of > it. > > D > > On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > > >> One goal of semantic cleanup work undergoing is to clarify the usage of >> unknown type. >> >> In Pig schema system, user can define output schema for LoadFunc/EvalFunc. >> Pig will propagate those schema to the entire script. Defining schema for >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark >> them bytearray. However, in the run time, user can feed any data type in. >> Before, Pig assumes the runtime type for bytearray is DataByteArray, which >> arose several issues (PIG-1277, PIG-999, PIG-1016). >> >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the >> object to figure out what the real type is at runtime. We've done that for >> all shuffle keys (PIG-1277). However, there are other cases. One case is >> adding two bytearray. For example, >> >> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader does >> not define schema, but actually feed Integer >> b = foreach a generate a0+a1; >> >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark >> the output schema for a0+a1 as double. Here is something interesting, >> SomeLoader loads Integer, and we get Double after adding. We can change it >> if we do the following: >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide, >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will >> figure out the data type at runtime, and process adding according to the >> real type >> >> Pro: >> 1. Consistent with the goal for unknown type cleanup: treat all bytearray >> as unknown type. In the runtime, inspect the object to find the real type >> >> Cons: >> 1. Slow down the processing since we need to inspect object type at runtime >> 2. Bring some indeterminism to schema system. Before a0+a1 is double, >> downstream schema is more clear.
-
Re: Semantic cleanup: How to adding two bytearrayDmitriy Ryaboy 2011-01-14, 22:15
fwiw most of our maps wind up being mixes of string->double and
string->string. Sometimes string->map and string->bag . Having non-string keys would really help us but I know that was pulled for a reason.. D On Fri, Jan 14, 2011 at 2:00 PM, Alan Gates <[EMAIL PROTECTED]> wrote: > I think the big win of static typing is that from examining the script > alone you can know the output: > > A = load 'bla' using BinStorage(); > B = foreach A generate $0 + $1; > > With static typing $0 and $1 will both be viewed as bytearrays and thus > will be cast to doubles, regardless of how BinStorage actually instantiated > them. With dynamic types we cannot know the answers without knowing the > data that is fed through. > > The downside of the static typing case is that we explicitly allow unknown > types in maps: > > A = load 'bla' using AvroStorage(); -- assume bla has a schema of m:map > -- and that m has two > keys, k1 and k2 > -- both with integer > values > B = foreach A generate m#k1 + m#k2; > > Using static types, B.$0 will be a double, even though the underlying types > are ints. Users will not see that as intuitive even though the semantic is > clear. In the dynamic model proposed by Daniel, B.$0 will be an int. > > We are mitigating this case by allowing typed maps (where the value type of > the map is declarable) in 0.9. But maps with heterogenous values types will > still suffer from this issue. > > I vote for static types for several reasons: > > 1) I like being able to know the output of the script by examining the > script alone. It provides a clear semantic that we can explain to users. > 2) It's less of a maintenance cost, as the need to deal with dynamic type > discovery is confined to the cast operator. If we go full out dynamic types > every expression operator has to be able to manage dynamism for byte arrays. > 3) In my experience almost all maps are string->string so once we allow > typed maps I suspect people will start using them heavily. > > I'm not sure there's a performance gain either way, since in both cases we > have to manage the case where we think something is a bytearray and it turns > out to be something else. > > Alan. > > > > On Jan 14, 2011, at 1:34 PM, Dmitriy Ryaboy wrote: > > Agreed with what Scott said about procedurally building schemas, and what >> Olga said about static typing. >> >> Daniel, I am not sure what you mean about run-time typing on a row by row >> basis. Certainly winding up with columns that are sometimes doubles, >> sometimes floats, and sometimes ints can only lead to unexpected bugs? >> >> I know Yahoo went through a lot of pain with the LoadStore rework in 0.7 >> (heck I am still dealing with it), but seems like breaking compatibility >> in >> a minor way in order to clean up semantics is ok given that we had a >> "stable" version in between. I don't think conversion would be too >> onerous, >> especially if declaring schemas is simplified. >> >> We can just say that odd versions can break apis and even can't :). >> >> D >> >> On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <[EMAIL PROTECTED] >> >wrote: >> >> >>> >>> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >>> >>> How is runtime detection done? I worry that if 1.txt contains: >>>> 1, 2 >>>> 1.1, 2.2 >>>> >>>> We get into a situation where addition of the fields in the first tuple >>>> produces integers, and adding the fields of the second tuple produces >>>> doubles. >>>> >>>> A more invasive but perhaps easier to reason about solution might be to >>>> be >>>> stricter about types, and require bytearrays to be cast to whatever type >>>> they are supposed to be if you want to add / delete / do non-byte-things >>>> to >>>> them. >>>> >>>> This is a problem if UDFs that output tuples or bags don't specify >>>> schemas >>>> (and specifying schemas of tuples and bags is fairly onerous right now
-
Re: Semantic cleanup: How to adding two bytearrayJulien Le Dem 2011-01-14, 22:40
Maps are sometimes used to represent JSON or similar data structures.
The resulting Pig objects are Maps with String keys and values being either: String, Number, Map, Bag (and recursively). Julien On 1/14/11 2:15 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: fwiw most of our maps wind up being mixes of string->double and string->string. Sometimes string->map and string->bag . Having non-string keys would really help us but I know that was pulled for a reason.. D On Fri, Jan 14, 2011 at 2:00 PM, Alan Gates <[EMAIL PROTECTED]> wrote: > I think the big win of static typing is that from examining the script > alone you can know the output: > > A = load 'bla' using BinStorage(); > B = foreach A generate $0 + $1; > > With static typing $0 and $1 will both be viewed as bytearrays and thus > will be cast to doubles, regardless of how BinStorage actually instantiated > them. With dynamic types we cannot know the answers without knowing the > data that is fed through. > > The downside of the static typing case is that we explicitly allow unknown > types in maps: > > A = load 'bla' using AvroStorage(); -- assume bla has a schema of m:map > -- and that m has two > keys, k1 and k2 > -- both with integer > values > B = foreach A generate m#k1 + m#k2; > > Using static types, B.$0 will be a double, even though the underlying types > are ints. Users will not see that as intuitive even though the semantic is > clear. In the dynamic model proposed by Daniel, B.$0 will be an int. > > We are mitigating this case by allowing typed maps (where the value type of > the map is declarable) in 0.9. But maps with heterogenous values types will > still suffer from this issue. > > I vote for static types for several reasons: > > 1) I like being able to know the output of the script by examining the > script alone. It provides a clear semantic that we can explain to users. > 2) It's less of a maintenance cost, as the need to deal with dynamic type > discovery is confined to the cast operator. If we go full out dynamic types > every expression operator has to be able to manage dynamism for byte arrays. > 3) In my experience almost all maps are string->string so once we allow > typed maps I suspect people will start using them heavily. > > I'm not sure there's a performance gain either way, since in both cases we > have to manage the case where we think something is a bytearray and it turns > out to be something else. > > Alan. > > > > On Jan 14, 2011, at 1:34 PM, Dmitriy Ryaboy wrote: > > Agreed with what Scott said about procedurally building schemas, and what >> Olga said about static typing. >> >> Daniel, I am not sure what you mean about run-time typing on a row by row >> basis. Certainly winding up with columns that are sometimes doubles, >> sometimes floats, and sometimes ints can only lead to unexpected bugs? >> >> I know Yahoo went through a lot of pain with the LoadStore rework in 0.7 >> (heck I am still dealing with it), but seems like breaking compatibility >> in >> a minor way in order to clean up semantics is ok given that we had a >> "stable" version in between. I don't think conversion would be too >> onerous, >> especially if declaring schemas is simplified. >> >> We can just say that odd versions can break apis and even can't :). >> >> D >> >> On Fri, Jan 14, 2011 at 12:27 PM, Scott Carey <[EMAIL PROTECTED] >> >wrote: >> >> >>> >>> On 1/13/11 10:54 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote: >>> >>> How is runtime detection done? I worry that if 1.txt contains: >>>> 1, 2 >>>> 1.1, 2.2 >>>> >>>> We get into a situation where addition of the fields in the first tuple >>>> produces integers, and adding the fields of the second tuple produces >>>> doubles. >>>> >>>> A more invasive but perhaps easier to reason about solution might be to >>>> be >>>> stricter about types, and require bytearrays to be cast to whatever type |