Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # dev - Re: [jira] [Commented] (PIG-1324) Logical Optimizer: Nested column pruning


Copy link to this message
-
Re: [jira] [Commented] (PIG-1324) Logical Optimizer: Nested column pruning
Jie Li 2011-12-04, 20:50
Hi Daniel,

Thanks for the example. Does the current pruning happen before each
statement, or just after LOAD? Because I can only see one-shot pruning for
each table from the output.

Besides the implementation, is there any semantic issue about the pruning?
For example,

A = load '1.txt' as (a0, a1, a2);
B = group A by a0;
C = foreach B generate COUNT(A);

If we prune A.a1 and A.a2, then A becomes NULL if a0 is NULL. Maybe the
COUNT operator is a little special.

Jie

On Sun, Dec 4, 2011 at 2:40 PM, Daniel Dai (Commented) (JIRA) <
[EMAIL PROTECTED]> wrote:

>
>    [
> https://issues.apache.org/jira/browse/PIG-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162448#comment-13162448]
>
> Daniel Dai commented on PIG-1324:
> ---------------------------------
>
> Hi, Jie,
> It's certainly solvable but we need some new data structure and algorithm.
> Currently the algorithm is from bottom up, find the required input columns
> of each statement. But if the input column is a bag, we don't trace into
> the bag. Here is an example:
>
> A = load '1.txt' as (a0, a1, a2);
> B = filter A by a0==1;
> C = foreach B generate a1;
>
> From bottom up, we first C needs B.a1, and B needs A.a0(plus A.a1 C
> needs), so the loader in A infers a2 is unnecessary. However, in the group
> by sample:
>
> A = load '1.txt' as (a0, a1, a2);
> B = group A by a0;
> C = foreach B generate group, SUM(A.a1);
>
> From C, we figures required fields B.group, B.A, we didn't further mark we
> only need B.A.a1, current data structure does not support it.
>
> > Logical Optimizer: Nested column pruning
> > ----------------------------------------
> >
> >                 Key: PIG-1324
> >                 URL: https://issues.apache.org/jira/browse/PIG-1324
> >             Project: Pig
> >          Issue Type: Sub-task
> >          Components: impl
> >    Affects Versions: 0.7.0
> >            Reporter: Daniel Dai
> >            Assignee: Daniel Dai
> >
> > Currently, column pruning does not prune sub-fields inside a complex
> data-type. For example:
> > A = load '1.txt' as (a0, a1, a2);
> > B = group A by a0;
> > C = foreach B generate group, SUM(A.a1);
> > Currently, since we group A as a bag, and some part of the bag is used
> in the following statement, so none of the fields inside A can be pruned.
> We shall keep track of sub-fields and figure out a2 is not actually needed.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
>