Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Storing statistics of input dataset


Copy link to this message
-
Re: Storing statistics of input dataset
Bill Graham 2012-08-07, 03:51
There are a few open JIRAs that are related to refactoring the query plan
code to allow for stats-based runtime optimizations:

https://issues.apache.org/jira/browse/PIG-483
https://issues.apache.org/jira/browse/PIG-2784

If anyone has thoughts/opinions around suggested design changes, those
JIRAs could be a good place to chime it.
On Mon, Aug 6, 2012 at 5:18 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> + 1 to that.
>
> We can get stats from the Hive metadata catalog via HCat. Loaders can
> already implement the LoadStatistics interface -- and if HCatLoader
> does this, we can create them via Hive and use that team's great work.
> We should also allow stats to be passed (and modified appropriately)
> through the dag, and instrument intermediate data writers to collect
> stats and send telemetry back for improved flow planning, but that's a
> separate conversation.
>
> D
>
> On Mon, Aug 6, 2012 at 10:35 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
> > Pig does not have a metadata store, so it doesn't store statistics on
> data.  However, through HCatalog it will have access to the same statistics
> that Hive stores.
> >
> > As far as using this data to optimize Pig operations, I'd like to rework
> the backend to start taking advantage of such statistics when available
> (either from metadata like this or statistics that are generated on the fly
> as scripts are executed).  I also hope to share as much of this work as
> possible with Hive so that both can benefit.
> >
> > Alan.
> >
> > On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:
> >
> >> Hello everyone
> >>
> >> Came across this excellent post about storing column statistics in Hive
> http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/
> >>
> >> Does pig gather statistics similar to what hive does? I think gathering
> such statistics will be very helpful not only for cost based optimizer but
> in other cases like knowing the count of rows, knowing the histogram of
> underlying data etc.. In my case, I am working on cube computation for
> holistic measure where I need to know the count of rows, based on it I can
> load sample data set for determining the partition factor for large groups.
> I am sure gathering statistics and persisting it will help in other
> cases/optimizations as well.
> >>
> >> If I am right, pig doesn't use cost based estimation while optimizing
> the logical plan instead I believe it uses rules of thumb (Plz. correct me
> if I am wrong). Having statistics about the datasets would help to provide
> better optimization (similar to the join optimization in the blog post).
> Any thoughts about having such statistics in pig and implementing ANALYZE
> command for gathering statistics?
> >>
> >> Thanks
> >> -- Prasanth Jayachandran
> >>
> >
>

--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*