Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Early projection and lazy casting


Copy link to this message
-
Early projection and lazy casting
Hi all,

We just figured out Pig 0.9.1 doesn't drop those non-necessary fields asap,
which really affects the performance. Though
http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid
that "As part of its optimizations Pig analyzes Pig Latin scripts and
determines what fields in an input it needs at each step in the script. It
uses this information to aggressively drop fields it no longer needs."

We also found that Pig casts the data into the types defined in the schema,
which is usually unnecessary, as most of them will be soon dropped.

To work around these, we have to manually drop those fields and remove the
types in the schema, which are really not interesting.

Jie
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB