Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Nested foreach for nested bags looses first level data


Copy link to this message
-
Re: Nested foreach for nested bags looses first level data
Cheolsoo Park 2013-12-23, 21:23
Hi Carlo,

>> It looks like if I nest a foreach loop inside another foreach I'm not
able to project any more the first level fields.

PIG-3581 tried to fix this, but it has introduced a regression. In trunk,
targetDate is actually resolved. But date in your filter expression
doesn't. I am not entirely sure whether defining a local scalar variable
inside a nested foreach is supposed to be supported or not.

Please see my comment in the jira-
https://issues.apache.org/jira/browse/PIG-3581?focusedCommentId=13855935&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13855935

Thanks,
Cheolsoo
On Thu, Dec 19, 2013 at 5:36 AM, Carlo Di Fulco <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I'm writing a script to perform some analytics on a set of events occurring
> in a set of apps.
> I'm using Pig 0.11 and Hadoop 1.3.
>
> Every event contains:
>
> - d: date of the event
> - aid: app id
> - uid: user id
>
> The aim of my script is to calculate for each application and for each day
> in my log the number of unique users during the previous x days (in the
> example code that is 2).
>
> After trying various approaches with no result my current scripts looks
> like:
>
> ________________________________________________________________
>
> /**
>  * describe events output:
>  *
>  * events: {d: chararray,aid: chararray,uid: chararray}
>  */
>
> eventDates = FOREACH events GENERATE d as targetDate;
> dates      = DISTINCT eventDates;
> crossed    = CROSS (GROUP events BY (aid)), dates;
>
> /**
>  * describe crossed output:
>  *
>  * crossed: {1-7::group: chararray,1-7::events: {(d: chararray,aid:
> chararray,uid: chararray)},dates::targetDate: chararray}
>  */
>
> result = FOREACH crossed {
>     date        = ToDate(targetDate, 'yyyy-MM-dd');
>     filtered    = FILTER events BY DaysBetween(ToDate(d, 'yyyy-MM-dd'),
> date) < 2
>                                 AND SecondsBetween(ToDate(d, 'yyyy-MM-dd'),
> date) > 0;
>     uniqueUsers = DISTINCT filtered.uid;
>     GENERATE group as aid, targetDate as date, COUNT(uniqueUsers) as
> result;
> }
>
> describe result;
> dump result;
> ________________________________________________________________
>
> At this point I get the following error:
>
> 2013-12-19 05:20:17,283 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1025:
> <file script.pig, line 46, column 25> Invalid field projection. Projected
> field [targetDate] does not exist in schema:
> d:bytearray,aid:chararray,uid:chararray.
>
> Line 46 is equivalent to:
>
>     date = ToDate(targetDate, 'yyyy-MM-dd');
>
>
> But if I hardcode the date instead of reading it from the "crossed" bag:
>
>    date = ToDate('2013-12-01', 'yyyy-MM-dd');
>
> It actually works.
>
> It looks like if I nest a foreach loop inside another foreach I'm not able
> to project any more the first level fields.
>
> Any idea about the reason of this? Or perhaps any better way to achieve the
> same result?
>
>
> Forgive any stupidity I may have written, this is my first approach to Pig
> scripting! Any suggestion is highly appreciated.
>
> Thanks and Regards,
> Carlo
>
> --
> Carlo Di Fulco
>