Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Problems with union, projection producing unexpected results


+
James Kebinger 2011-02-16, 23:57
+
Jonathan Coveney 2011-02-17, 15:41
+
James Kebinger 2011-02-17, 17:32
Copy link to this message
-
Re: Problems with union, projection producing unexpected results
James Kebinger 2011-02-17, 19:01
https://issues.apache.org/jira/browse/PIG-1859

On Thu, Feb 17, 2011 at 12:32 PM, James Kebinger <[EMAIL PROTECTED]>wrote:

> Interesting, maybe I should file a bug report then?
>
>
> On Thu, Feb 17, 2011 at 10:41 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
>
>> I am glad that you got this in a replicatable form! I have seen this error
>> as well (where the output is just the last value repeated instead of the
>> multiples that you want), but wasn't able to give a concrete example.
>>
>> 2011/2/16 James Kebinger <[EMAIL PROTECTED]>
>>
>> > Hello all, I've been scratching my head over a problem with a pig script
>> > I'm
>> > having, and hoping another set of eyeballs will help. I'm using pig 0.8,
>> in
>> > local mode
>> >
>> > Here's my simplified use case:
>> >
>> > I have a log file with events on pages, and the id of the event can be a
>> > users login or a users numeric id:
>> >
>> > 2010-05-14,james
>> > 2010-05-15,123
>> > 2010-05-15,23
>> > 2010-05-15,456
>> > 2010-05-15,notjames
>> >
>> > So i want to join a set of users on either the login or user id.
>> >
>> > Here's my users:
>> >
>> > 123,james,11
>> > 234,notjames,11
>> > 456,someoneelse,11
>> >
>> >
>> > So I thought I would be clever and load the user list, union it with
>> itself
>> > to generate a relation where each user is represented twice, once by
>> login,
>> > once by id:
>> >
>> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
>> > actual_user_id;
>> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
>> > actual_user_id;
>> > user_id_or_login_lookup = UNION logins, user_ids;
>> >
>> >
>> >
>> > user_id_or_login_lookup: {matching_id: chararray,actual_user_id:
>> chararray}
>> > (123,123)
>> > (234,234)
>> > (456,456)
>> > (james,123)
>> > (notjames,234)
>> > (someoneelse,456)
>> >
>> > Then join on that, by the first column, and project that away, leaving
>> just
>> > the event info and the numeric id.
>> >
>> > views_with_id = JOIN profile_views by viewed_user_id,
>> > user_id_or_login_lookup by matching_id;
>> >
>> > That is not working however. My joined relation looks like this (which
>> is
>> > what I expect)
>> >
>> > views_with_id: {profile_views::date:
>> > chararray,profile_views::viewed_user_id:
>> > chararray,user_id_or_login_lookup::matching_id:
>> > chararray,user_id_or_login_lookup::actual_user_id: chararray}
>> >
>> > (2010-05-15,123,123,123)
>> > (2010-05-15,456,456,456)
>> > (2010-05-14,james,james,123)
>> > (2010-05-15,notjames,notjames,234)
>> >
>> > But when I project as follows: views_with_id_projected = FOREACH
>> > views_with_id GENERATE date, viewed_user_id,
>> > user_id_or_login_lookup::actual_user_id;
>> >
>> > The result is not what I expect
>> >
>> > (2010-05-15,123,123)
>> > (2010-05-15,456,456)
>> > (2010-05-14,james,james)
>> > (2010-05-15,notjames,notjames)
>> >
>> > To be clear, I expect
>> >
>> > (2010-05-15,123,123)
>> > (2010-05-15,456,456)
>> > (2010-05-14,james,123)
>> > (2010-05-15,notjames,456)
>> >
>> >
>> > Can anyone give me a push in the right direction?
>> >
>> > Thanks, James
>> >
>> > Here's my full pig script:
>> >
>> > users = LOAD 'patients-test.txt' USING PigStorage(',') AS
>> > (user_id:chararray, login:chararray, disease_id: chararray);
>> > profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',')
>> > AS(date:
>> > chararray, viewed_user_id:chararray);
>> >
>> > dump users;
>> > dump profile_views;
>> >
>> > -- build a relation so that users are present to join by login or
>> user_id
>> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
>> > actual_user_id;
>> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
>> > actual_user_id;
>> > user_id_or_login_lookup = UNION logins, user_ids;
>> >
>> > dump user_id_or_login_lookup;
>> > describe user_id_or_login_lookup;
>> >
>> > views_with_id = JOIN profile_views by viewed_user_id,
>> > user_id_or_login_lookup by matching_id;
>> >
>> > describe views_with_id;