|
|
-
Early projection and lazy casting
Jie Li 2011-12-03, 00:05
Hi all, We just figured out Pig 0.9.1 doesn't drop those non-necessary fields asap, which really affects the performance. Though http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaidthat "As part of its optimizations Pig analyzes Pig Latin scripts and determines what fields in an input it needs at each step in the script. It uses this information to aggressively drop fields it no longer needs." We also found that Pig casts the data into the types defined in the schema, which is usually unnecessary, as most of them will be soon dropped. To work around these, we have to manually drop those fields and remove the types in the schema, which are really not interesting. Jie
-
Re: Early projection and lazy casting
Jonathan Coveney 2011-12-03, 00:33
In what context? I always thought that it generally could, but that if you do joins it doesn't. Would be curious to know more from someone who knows... 2011/12/2 Jie Li <[EMAIL PROTECTED]> > Hi all, > > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields asap, > which really affects the performance. Though > > http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid> that "As part of its optimizations Pig analyzes Pig Latin scripts and > determines what fields in an input it needs at each step in the script. It > uses this information to aggressively drop fields it no longer needs." > > We also found that Pig casts the data into the types defined in the schema, > which is usually unnecessary, as most of them will be soon dropped. > > To work around these, we have to manually drop those fields and remove the > types in the schema, which are really not interesting. > > Jie >
-
Re: Early projection and lazy casting
Jie Li 2011-12-03, 00:45
Why do joins prevent the early projection? Actually join has the greatest need for it. Jie On Fri, Dec 2, 2011 at 7:33 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > In what context? I always thought that it generally could, but that if you > do joins it doesn't. Would be curious to know more from someone who > knows... > > 2011/12/2 Jie Li <[EMAIL PROTECTED]> > > > Hi all, > > > > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields > asap, > > which really affects the performance. Though > > > > > http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid> > that "As part of its optimizations Pig analyzes Pig Latin scripts and > > determines what fields in an input it needs at each step in the script. > It > > uses this information to aggressively drop fields it no longer needs." > > > > We also found that Pig casts the data into the types defined in the > schema, > > which is usually unnecessary, as most of them will be soon dropped. > > > > To work around these, we have to manually drop those fields and remove > the > > types in the schema, which are really not interesting. > > > > Jie > > >
-
Re: Early projection and lazy casting
Dmitriy Ryaboy 2011-12-03, 02:16
Can you provide a script that shows projection not happening? We've observed the opposite (and use that fact extensively) D On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote: > Hi all, > > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields asap, > which really affects the performance. Though > http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid> that "As part of its optimizations Pig analyzes Pig Latin scripts and > determines what fields in an input it needs at each step in the script. It > uses this information to aggressively drop fields it no longer needs." > > We also found that Pig casts the data into the types defined in the schema, > which is usually unnecessary, as most of them will be soon dropped. > > To work around these, we have to manually drop those fields and remove the > types in the schema, which are really not interesting. > > Jie
-
Re: Early projection and lazy casting
Jie Li 2011-12-03, 02:42
Sure. The two lines in bold are just dropping out non-necessary fields. Without them Pig would not project, especially for the table lineitem. lineitem = load '$input/lineitem' USING PigStorage('|') as (l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long, l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double, l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray, l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray, l_shipmode:chararray, l_comment:chararray); part = load '$input/part' USING PigStorage('|') as (p_partkey:long, p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray, p_size:long, p_container:chararray, p_retailprice:double, p_comment:chararray); *lineitem = foreach lineitem generate l_partkey, l_quantity, l_extendedprice ;* part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX'; *part = foreach part generate p_partkey;* COG1 = COGROUP part by p_partkey, lineitem by l_partkey; COG1 = filter COG1 by COUNT(part) > 0; COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem), 0.2 * AVG(lineitem.l_quantity) as l_avg; COG3 = filter COG2 by l_quantity < l_avg; COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum; G1 = group COG ALL; result = foreach G1 generate SUM(COG.l_sum)/7.0; On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Can you provide a script that shows projection not happening? We've > observed the opposite (and use that fact extensively) > > D > > On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote: > > Hi all, > > > > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields > asap, > > which really affects the performance. Though > > > http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid> > that "As part of its optimizations Pig analyzes Pig Latin scripts and > > determines what fields in an input it needs at each step in the script. > It > > uses this information to aggressively drop fields it no longer needs." > > > > We also found that Pig casts the data into the types defined in the > schema, > > which is usually unnecessary, as most of them will be soon dropped. > > > > To work around these, we have to manually drop those fields and remove > the > > types in the schema, which are really not interesting. > > > > Jie > >
-
Re: Early projection and lazy casting
Dmitriy Ryaboy 2011-12-04, 16:15
flatten(lineitem) uses all the fields from lineitem, hence no pruning. On Fri, Dec 2, 2011 at 6:42 PM, Jie Li <[EMAIL PROTECTED]> wrote: > Sure. The two lines in bold are just dropping out non-necessary fields. > Without them Pig would not project, especially for the table lineitem. > > lineitem = load '$input/lineitem' USING PigStorage('|') as > (l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long, > l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double, > l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray, > l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray, > l_shipmode:chararray, l_comment:chararray); > > part = load '$input/part' USING PigStorage('|') as (p_partkey:long, > p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray, > p_size:long, p_container:chararray, p_retailprice:double, > p_comment:chararray); > > *lineitem = foreach lineitem generate l_partkey, l_quantity, > l_extendedprice ;* > part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX'; > *part = foreach part generate p_partkey;* > > COG1 = COGROUP part by p_partkey, lineitem by l_partkey; > COG1 = filter COG1 by COUNT(part) > 0; > COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem), > 0.2 * AVG(lineitem.l_quantity) as l_avg; > > COG3 = filter COG2 by l_quantity < l_avg; > COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum; > > G1 = group COG ALL; > > result = foreach G1 generate SUM(COG.l_sum)/7.0; > > > > On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> Can you provide a script that shows projection not happening? We've >> observed the opposite (and use that fact extensively) >> >> D >> >> On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote: >> > Hi all, >> > >> > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields >> asap, >> > which really affects the performance. Though >> > >> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid>> > that "As part of its optimizations Pig analyzes Pig Latin scripts and >> > determines what fields in an input it needs at each step in the script. >> It >> > uses this information to aggressively drop fields it no longer needs." >> > >> > We also found that Pig casts the data into the types defined in the >> schema, >> > which is usually unnecessary, as most of them will be soon dropped. >> > >> > To work around these, we have to manually drop those fields and remove >> the >> > types in the schema, which are really not interesting. >> > >> > Jie >> >>
-
Re: Early projection and lazy casting
Dmitriy Ryaboy 2011-12-04, 16:17
Ah I see, PIG-1324.. On Sun, Dec 4, 2011 at 8:15 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > flatten(lineitem) uses all the fields from lineitem, hence no pruning. > > On Fri, Dec 2, 2011 at 6:42 PM, Jie Li <[EMAIL PROTECTED]> wrote: >> Sure. The two lines in bold are just dropping out non-necessary fields. >> Without them Pig would not project, especially for the table lineitem. >> >> lineitem = load '$input/lineitem' USING PigStorage('|') as >> (l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long, >> l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double, >> l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray, >> l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray, >> l_shipmode:chararray, l_comment:chararray); >> >> part = load '$input/part' USING PigStorage('|') as (p_partkey:long, >> p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray, >> p_size:long, p_container:chararray, p_retailprice:double, >> p_comment:chararray); >> >> *lineitem = foreach lineitem generate l_partkey, l_quantity, >> l_extendedprice ;* >> part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX'; >> *part = foreach part generate p_partkey;* >> >> COG1 = COGROUP part by p_partkey, lineitem by l_partkey; >> COG1 = filter COG1 by COUNT(part) > 0; >> COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem), >> 0.2 * AVG(lineitem.l_quantity) as l_avg; >> >> COG3 = filter COG2 by l_quantity < l_avg; >> COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum; >> >> G1 = group COG ALL; >> >> result = foreach G1 generate SUM(COG.l_sum)/7.0; >> >> >> >> On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> >>> Can you provide a script that shows projection not happening? We've >>> observed the opposite (and use that fact extensively) >>> >>> D >>> >>> On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote: >>> > Hi all, >>> > >>> > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields >>> asap, >>> > which really affects the performance. Though >>> > >>> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid>>> > that "As part of its optimizations Pig analyzes Pig Latin scripts and >>> > determines what fields in an input it needs at each step in the script. >>> It >>> > uses this information to aggressively drop fields it no longer needs." >>> > >>> > We also found that Pig casts the data into the types defined in the >>> schema, >>> > which is usually unnecessary, as most of them will be soon dropped. >>> > >>> > To work around these, we have to manually drop those fields and remove >>> the >>> > types in the schema, which are really not interesting. >>> > >>> > Jie >>> >>>
|
|