Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Early projection and lazy casting


Copy link to this message
-
Re: Early projection and lazy casting
Sure. The two lines in bold are just dropping out non-necessary fields.
Without them Pig would not project, especially for the table lineitem.

lineitem = load '$input/lineitem' USING PigStorage('|') as
(l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long,
l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double,
l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray,
l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray,
l_shipmode:chararray, l_comment:chararray);

part = load '$input/part' USING PigStorage('|') as (p_partkey:long,
p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray,
p_size:long, p_container:chararray, p_retailprice:double,
p_comment:chararray);

*lineitem = foreach lineitem generate l_partkey, l_quantity,
l_extendedprice ;*
part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX';
*part = foreach part generate p_partkey;*

COG1 = COGROUP part by p_partkey, lineitem by l_partkey;
COG1 = filter COG1 by COUNT(part) > 0;
COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem),
0.2 * AVG(lineitem.l_quantity) as l_avg;

COG3 = filter COG2 by l_quantity < l_avg;
COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum;

G1 = group COG ALL;

result = foreach G1 generate SUM(COG.l_sum)/7.0;

On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> Can you provide a script that shows projection not happening? We've
> observed the opposite (and use that fact extensively)
>
> D
>
> On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields
> asap,
> > which really affects the performance. Though
> >
> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid
> > that "As part of its optimizations Pig analyzes Pig Latin scripts and
> > determines what fields in an input it needs at each step in the script.
> It
> > uses this information to aggressively drop fields it no longer needs."
> >
> > We also found that Pig casts the data into the types defined in the
> schema,
> > which is usually unnecessary, as most of them will be soon dropped.
> >
> > To work around these, we have to manually drop those fields and remove
> the
> > types in the schema, which are really not interesting.
> >
> > Jie
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB