Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Storing statistics of input dataset


Copy link to this message
-
Storing statistics of input dataset
Hello everyone

Came across this excellent post about storing column statistics in Hive http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/

Does pig gather statistics similar to what hive does? I think gathering such statistics will be very helpful not only for cost based optimizer but in other cases like knowing the count of rows, knowing the histogram of underlying data etc.. In my case, I am working on cube computation for holistic measure where I need to know the count of rows, based on it I can load sample data set for determining the partition factor for large groups. I am sure gathering statistics and persisting it will help in other cases/optimizations as well.

If I am right, pig doesn't use cost based estimation while optimizing the logical plan instead I believe it uses rules of thumb (Plz. correct me if I am wrong). Having statistics about the datasets would help to provide better optimization (similar to the join optimization in the blog post). Any thoughts about having such statistics in pig and implementing ANALYZE command for gathering statistics?

Thanks
-- Prasanth Jayachandran

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB