Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Nested foreach for nested bags looses first level data


Copy link to this message
-
Re: Nested foreach for nested bags looses first level data
Hi Carlo,

>> It looks like if I nest a foreach loop inside another foreach I'm not
able to project any more the first level fields.

PIG-3581 tried to fix this, but it has introduced a regression. In trunk,
targetDate is actually resolved. But date in your filter expression
doesn't. I am not entirely sure whether defining a local scalar variable
inside a nested foreach is supposed to be supported or not.

Please see my comment in the jira-
https://issues.apache.org/jira/browse/PIG-3581?focusedCommentId=13855935&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13855935

Thanks,
Cheolsoo
On Thu, Dec 19, 2013 at 5:36 AM, Carlo Di Fulco <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I'm writing a script to perform some analytics on a set of events occurring
> in a set of apps.
> I'm using Pig 0.11 and Hadoop 1.3.
>
> Every event contains:
>
> - d: date of the event
> - aid: app id
> - uid: user id
>
> The aim of my script is to calculate for each application and for each day
> in my log the number of unique users during the previous x days (in the
> example code that is 2).
>
> After trying various approaches with no result my current scripts looks
> like:
>
> ________________________________________________________________
>
> /**
>  * describe events output:
>  *
>  * events: {d: chararray,aid: chararray,uid: chararray}
>  */
>
> eventDates = FOREACH events GENERATE d as targetDate;
> dates      = DISTINCT eventDates;
> crossed    = CROSS (GROUP events BY (aid)), dates;
>
> /**
>  * describe crossed output:
>  *
>  * crossed: {1-7::group: chararray,1-7::events: {(d: chararray,aid:
> chararray,uid: chararray)},dates::targetDate: chararray}
>  */
>
> result = FOREACH crossed {
>     date        = ToDate(targetDate, 'yyyy-MM-dd');
>     filtered    = FILTER events BY DaysBetween(ToDate(d, 'yyyy-MM-dd'),
> date) < 2
>                                 AND SecondsBetween(ToDate(d, 'yyyy-MM-dd'),
> date) > 0;
>     uniqueUsers = DISTINCT filtered.uid;
>     GENERATE group as aid, targetDate as date, COUNT(uniqueUsers) as
> result;
> }
>
> describe result;
> dump result;
> ________________________________________________________________
>
> At this point I get the following error:
>
> 2013-12-19 05:20:17,283 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1025:
> <file script.pig, line 46, column 25> Invalid field projection. Projected
> field [targetDate] does not exist in schema:
> d:bytearray,aid:chararray,uid:chararray.
>
> Line 46 is equivalent to:
>
>     date = ToDate(targetDate, 'yyyy-MM-dd');
>
>
> But if I hardcode the date instead of reading it from the "crossed" bag:
>
>    date = ToDate('2013-12-01', 'yyyy-MM-dd');
>
> It actually works.
>
> It looks like if I nest a foreach loop inside another foreach I'm not able
> to project any more the first level fields.
>
> Any idea about the reason of this? Or perhaps any better way to achieve the
> same result?
>
>
> Forgive any stupidity I may have written, this is my first approach to Pig
> scripting! Any suggestion is highly appreciated.
>
> Thanks and Regards,
> Carlo
>
> --
> Carlo Di Fulco
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB