Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Re: Pig Roadmap


Copy link to this message
-
Re: Pig Roadmap
Hi Daniel,
You misunderstood me :) I'm proposing to do it until lets say end of the
year. I'm not proposing for gsoc 2013. It's different topic. I've been
working on Oracle so that to improve my sql knowledge I looked for a lot of
resource about CBO. Using this information I can help you and we can
implement the CBO. Of course there're multiple steps to do it.
Here's my suggestions:

- I think CBO is mainly responsible for explaining plan.
- There are two different ways to make a decision which way should use. It
can be used to decide the plan what it's got such as file size, column
size..etc or sample the data to get histogram, distinct count of file,
distinct count of each column...etc  There's already an issue about
sampling. After that we should gather stats. I think sampling is better
option.
- When pig statement is executed, there must be generated a explain plan.
For instance which join method should use and which order should be. or it
should use index and if not exist then create it..etc. So that pig cbo
rewrite the statement to get most efficient way. Maybe we can implement it
with visual screen to have a look at it from user perspective. To do that,
How can I read the plan?
- After completing the above items, We can improve the CBO to rewrite the
join methods. For example when user gives the join method, it calculates
the cost for every join method and as a result of it, it could choose the
join method even if it's not method chosen by the user.

What do you think these items?

Thanks
Best Regards...

On Mon, Apr 8, 2013 at 9:13 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:

> I don't realize there's open Jira tickets for that but we can create
> one easily. I am interested in cost-based optimizer, however, this is
> a big topic. We will need to figure out how to collect stats, what
> stats to collect, where to store stats, and how to use the stats, I
> wonder if this could be finished in GSoC time frame. It seems more
> realistic to get some join improvements done such as fuzzy join you
> proposed within the time frame (other join improvements I can think of
> are indexed join, unequal join, semijoin)
>
> Thanks,
> Daniel
>
> On Sat, Apr 6, 2013 at 2:17 AM, burakkk <[EMAIL PROTECTED]> wrote:
> > Hi,
> > I examined a little bit about pig's roadmap page and I'm interested in
> > working on some of them. I found that you might be working on in these
> > items. But I couldn't find any issue on jira about them. Is anyone
> working
> > on them and if not, how can I contribute it? I mean should I create
> issues
> > about them or what should I do?
> >
> > - Statistics for Optimizer
> > - Cost-Based Optimizer Impl.
> > - Runtime Optimizations (Query rewrite)
> >
> >
> > Thanks
> > Best regards...
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
> >
> >
> >
> > --
> >
> > *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> > *
> > *
>

--

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB