-Re: Storing statistics of input dataset
Bill Graham 2012-08-07, 03:51
There are a few open JIRAs that are related to refactoring the query plan
code to allow for stats-based runtime optimizations:
If anyone has thoughts/opinions around suggested design changes, those
JIRAs could be a good place to chime it.
On Mon, Aug 6, 2012 at 5:18 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> + 1 to that.
> We can get stats from the Hive metadata catalog via HCat. Loaders can
> already implement the LoadStatistics interface -- and if HCatLoader
> does this, we can create them via Hive and use that team's great work.
> We should also allow stats to be passed (and modified appropriately)
> through the dag, and instrument intermediate data writers to collect
> stats and send telemetry back for improved flow planning, but that's a
> separate conversation.
> On Mon, Aug 6, 2012 at 10:35 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
> > Pig does not have a metadata store, so it doesn't store statistics on
> data. However, through HCatalog it will have access to the same statistics
> that Hive stores.
> > As far as using this data to optimize Pig operations, I'd like to rework
> the backend to start taking advantage of such statistics when available
> (either from metadata like this or statistics that are generated on the fly
> as scripts are executed). I also hope to share as much of this work as
> possible with Hive so that both can benefit.
> > Alan.
> > On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:
> >> Hello everyone
> >> Came across this excellent post about storing column statistics in Hive
> >> Does pig gather statistics similar to what hive does? I think gathering
> such statistics will be very helpful not only for cost based optimizer but
> in other cases like knowing the count of rows, knowing the histogram of
> underlying data etc.. In my case, I am working on cube computation for
> holistic measure where I need to know the count of rows, based on it I can
> load sample data set for determining the partition factor for large groups.
> I am sure gathering statistics and persisting it will help in other
> cases/optimizations as well.
> >> If I am right, pig doesn't use cost based estimation while optimizing
> the logical plan instead I believe it uses rules of thumb (Plz. correct me
> if I am wrong). Having statistics about the datasets would help to provide
> better optimization (similar to the join optimization in the blog post).
> Any thoughts about having such statistics in pig and implementing ANALYZE
> command for gathering statistics?
> >> Thanks
> >> -- Prasanth Jayachandran
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*