Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Re: [jira] [Commented] (PIG-1324) Logical Optimizer: Nested column pruning


Copy link to this message
-
Re: [jira] [Commented] (PIG-1324) Logical Optimizer: Nested column pruning
Hi Daniel,

Thanks for the example. Does the current pruning happen before each
statement, or just after LOAD? Because I can only see one-shot pruning for
each table from the output.

Besides the implementation, is there any semantic issue about the pruning?
For example,

A = load '1.txt' as (a0, a1, a2);
B = group A by a0;
C = foreach B generate COUNT(A);

If we prune A.a1 and A.a2, then A becomes NULL if a0 is NULL. Maybe the
COUNT operator is a little special.

Jie

On Sun, Dec 4, 2011 at 2:40 PM, Daniel Dai (Commented) (JIRA) <
[EMAIL PROTECTED]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/PIG-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162448#comment-13162448]
>
> Daniel Dai commented on PIG-1324:
> ---------------------------------
>
> Hi, Jie,
> It's certainly solvable but we need some new data structure and algorithm.
> Currently the algorithm is from bottom up, find the required input columns
> of each statement. But if the input column is a bag, we don't trace into
> the bag. Here is an example:
>
> A = load '1.txt' as (a0, a1, a2);
> B = filter A by a0==1;
> C = foreach B generate a1;
>
> From bottom up, we first C needs B.a1, and B needs A.a0(plus A.a1 C
> needs), so the loader in A infers a2 is unnecessary. However, in the group
> by sample:
>
> A = load '1.txt' as (a0, a1, a2);
> B = group A by a0;
> C = foreach B generate group, SUM(A.a1);
>
> From C, we figures required fields B.group, B.A, we didn't further mark we
> only need B.A.a1, current data structure does not support it.
>
> > Logical Optimizer: Nested column pruning
> > ----------------------------------------
> >
> >                 Key: PIG-1324
> >                 URL: https://issues.apache.org/jira/browse/PIG-1324
> >             Project: Pig
> >          Issue Type: Sub-task
> >          Components: impl
> >    Affects Versions: 0.7.0
> >            Reporter: Daniel Dai
> >            Assignee: Daniel Dai
> >
> > Currently, column pruning does not prune sub-fields inside a complex
> data-type. For example:
> > A = load '1.txt' as (a0, a1, a2);
> > B = group A by a0;
> > C = foreach B generate group, SUM(A.a1);
> > Currently, since we group A as a bag, and some part of the bag is used
> in the following statement, so none of the fields inside A can be pruned.
> We shall keep track of sub-fields and figure out a2 is not actually needed.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB