|
Jie Li
2011-12-03, 00:05
Jonathan Coveney
2011-12-03, 00:33
Jie Li
2011-12-03, 00:45
Dmitriy Ryaboy
2011-12-03, 02:16
Jie Li
2011-12-03, 02:42
Dmitriy Ryaboy
2011-12-04, 16:15
Dmitriy Ryaboy
2011-12-04, 16:17
|
-
Early projection and lazy castingJie Li 2011-12-03, 00:05
Hi all,
We just figured out Pig 0.9.1 doesn't drop those non-necessary fields asap, which really affects the performance. Though http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid that "As part of its optimizations Pig analyzes Pig Latin scripts and determines what fields in an input it needs at each step in the script. It uses this information to aggressively drop fields it no longer needs." We also found that Pig casts the data into the types defined in the schema, which is usually unnecessary, as most of them will be soon dropped. To work around these, we have to manually drop those fields and remove the types in the schema, which are really not interesting. Jie
-
Re: Early projection and lazy castingJonathan Coveney 2011-12-03, 00:33
In what context? I always thought that it generally could, but that if you
do joins it doesn't. Would be curious to know more from someone who knows... 2011/12/2 Jie Li <[EMAIL PROTECTED]> > Hi all, > > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields asap, > which really affects the performance. Though > > http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid > that "As part of its optimizations Pig analyzes Pig Latin scripts and > determines what fields in an input it needs at each step in the script. It > uses this information to aggressively drop fields it no longer needs." > > We also found that Pig casts the data into the types defined in the schema, > which is usually unnecessary, as most of them will be soon dropped. > > To work around these, we have to manually drop those fields and remove the > types in the schema, which are really not interesting. > > Jie >
-
Re: Early projection and lazy castingJie Li 2011-12-03, 00:45
Why do joins prevent the early projection? Actually join has the greatest
need for it. Jie On Fri, Dec 2, 2011 at 7:33 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > In what context? I always thought that it generally could, but that if you > do joins it doesn't. Would be curious to know more from someone who > knows... > > 2011/12/2 Jie Li <[EMAIL PROTECTED]> > > > Hi all, > > > > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields > asap, > > which really affects the performance. Though > > > > > http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid > > that "As part of its optimizations Pig analyzes Pig Latin scripts and > > determines what fields in an input it needs at each step in the script. > It > > uses this information to aggressively drop fields it no longer needs." > > > > We also found that Pig casts the data into the types defined in the > schema, > > which is usually unnecessary, as most of them will be soon dropped. > > > > To work around these, we have to manually drop those fields and remove > the > > types in the schema, which are really not interesting. > > > > Jie > > >
-
Re: Early projection and lazy castingDmitriy Ryaboy 2011-12-03, 02:16
Can you provide a script that shows projection not happening? We've
observed the opposite (and use that fact extensively) D On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote: > Hi all, > > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields asap, > which really affects the performance. Though > http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid > that "As part of its optimizations Pig analyzes Pig Latin scripts and > determines what fields in an input it needs at each step in the script. It > uses this information to aggressively drop fields it no longer needs." > > We also found that Pig casts the data into the types defined in the schema, > which is usually unnecessary, as most of them will be soon dropped. > > To work around these, we have to manually drop those fields and remove the > types in the schema, which are really not interesting. > > Jie
-
Re: Early projection and lazy castingJie Li 2011-12-03, 02:42
Sure. The two lines in bold are just dropping out non-necessary fields.
Without them Pig would not project, especially for the table lineitem. lineitem = load '$input/lineitem' USING PigStorage('|') as (l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long, l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double, l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray, l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray, l_shipmode:chararray, l_comment:chararray); part = load '$input/part' USING PigStorage('|') as (p_partkey:long, p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray, p_size:long, p_container:chararray, p_retailprice:double, p_comment:chararray); *lineitem = foreach lineitem generate l_partkey, l_quantity, l_extendedprice ;* part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX'; *part = foreach part generate p_partkey;* COG1 = COGROUP part by p_partkey, lineitem by l_partkey; COG1 = filter COG1 by COUNT(part) > 0; COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem), 0.2 * AVG(lineitem.l_quantity) as l_avg; COG3 = filter COG2 by l_quantity < l_avg; COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum; G1 = group COG ALL; result = foreach G1 generate SUM(COG.l_sum)/7.0; On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Can you provide a script that shows projection not happening? We've > observed the opposite (and use that fact extensively) > > D > > On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote: > > Hi all, > > > > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields > asap, > > which really affects the performance. Though > > > http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid > > that "As part of its optimizations Pig analyzes Pig Latin scripts and > > determines what fields in an input it needs at each step in the script. > It > > uses this information to aggressively drop fields it no longer needs." > > > > We also found that Pig casts the data into the types defined in the > schema, > > which is usually unnecessary, as most of them will be soon dropped. > > > > To work around these, we have to manually drop those fields and remove > the > > types in the schema, which are really not interesting. > > > > Jie > >
-
Re: Early projection and lazy castingDmitriy Ryaboy 2011-12-04, 16:15
flatten(lineitem) uses all the fields from lineitem, hence no pruning.
On Fri, Dec 2, 2011 at 6:42 PM, Jie Li <[EMAIL PROTECTED]> wrote: > Sure. The two lines in bold are just dropping out non-necessary fields. > Without them Pig would not project, especially for the table lineitem. > > lineitem = load '$input/lineitem' USING PigStorage('|') as > (l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long, > l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double, > l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray, > l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray, > l_shipmode:chararray, l_comment:chararray); > > part = load '$input/part' USING PigStorage('|') as (p_partkey:long, > p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray, > p_size:long, p_container:chararray, p_retailprice:double, > p_comment:chararray); > > *lineitem = foreach lineitem generate l_partkey, l_quantity, > l_extendedprice ;* > part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX'; > *part = foreach part generate p_partkey;* > > COG1 = COGROUP part by p_partkey, lineitem by l_partkey; > COG1 = filter COG1 by COUNT(part) > 0; > COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem), > 0.2 * AVG(lineitem.l_quantity) as l_avg; > > COG3 = filter COG2 by l_quantity < l_avg; > COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum; > > G1 = group COG ALL; > > result = foreach G1 generate SUM(COG.l_sum)/7.0; > > > > On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> Can you provide a script that shows projection not happening? We've >> observed the opposite (and use that fact extensively) >> >> D >> >> On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote: >> > Hi all, >> > >> > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields >> asap, >> > which really affects the performance. Though >> > >> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid >> > that "As part of its optimizations Pig analyzes Pig Latin scripts and >> > determines what fields in an input it needs at each step in the script. >> It >> > uses this information to aggressively drop fields it no longer needs." >> > >> > We also found that Pig casts the data into the types defined in the >> schema, >> > which is usually unnecessary, as most of them will be soon dropped. >> > >> > To work around these, we have to manually drop those fields and remove >> the >> > types in the schema, which are really not interesting. >> > >> > Jie >> >>
-
Re: Early projection and lazy castingDmitriy Ryaboy 2011-12-04, 16:17
Ah I see, PIG-1324..
On Sun, Dec 4, 2011 at 8:15 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > flatten(lineitem) uses all the fields from lineitem, hence no pruning. > > On Fri, Dec 2, 2011 at 6:42 PM, Jie Li <[EMAIL PROTECTED]> wrote: >> Sure. The two lines in bold are just dropping out non-necessary fields. >> Without them Pig would not project, especially for the table lineitem. >> >> lineitem = load '$input/lineitem' USING PigStorage('|') as >> (l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long, >> l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double, >> l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray, >> l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray, >> l_shipmode:chararray, l_comment:chararray); >> >> part = load '$input/part' USING PigStorage('|') as (p_partkey:long, >> p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray, >> p_size:long, p_container:chararray, p_retailprice:double, >> p_comment:chararray); >> >> *lineitem = foreach lineitem generate l_partkey, l_quantity, >> l_extendedprice ;* >> part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX'; >> *part = foreach part generate p_partkey;* >> >> COG1 = COGROUP part by p_partkey, lineitem by l_partkey; >> COG1 = filter COG1 by COUNT(part) > 0; >> COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem), >> 0.2 * AVG(lineitem.l_quantity) as l_avg; >> >> COG3 = filter COG2 by l_quantity < l_avg; >> COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum; >> >> G1 = group COG ALL; >> >> result = foreach G1 generate SUM(COG.l_sum)/7.0; >> >> >> >> On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> >>> Can you provide a script that shows projection not happening? We've >>> observed the opposite (and use that fact extensively) >>> >>> D >>> >>> On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote: >>> > Hi all, >>> > >>> > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields >>> asap, >>> > which really affects the performance. Though >>> > >>> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid >>> > that "As part of its optimizations Pig analyzes Pig Latin scripts and >>> > determines what fields in an input it needs at each step in the script. >>> It >>> > uses this information to aggressively drop fields it no longer needs." >>> > >>> > We also found that Pig casts the data into the types defined in the >>> schema, >>> > which is usually unnecessary, as most of them will be soon dropped. >>> > >>> > To work around these, we have to manually drop those fields and remove >>> the >>> > types in the schema, which are really not interesting. >>> > >>> > Jie >>> >>> |