|
|
-
Problems with union, projection producing unexpected results
James Kebinger 2011-02-16, 23:57
Hello all, I've been scratching my head over a problem with a pig script I'm having, and hoping another set of eyeballs will help. I'm using pig 0.8, in local mode
Here's my simplified use case:
I have a log file with events on pages, and the id of the event can be a users login or a users numeric id:
2010-05-14,james 2010-05-15,123 2010-05-15,23 2010-05-15,456 2010-05-15,notjames
So i want to join a set of users on either the login or user id.
Here's my users:
123,james,11 234,notjames,11 456,someoneelse,11 So I thought I would be clever and load the user list, union it with itself to generate a relation where each user is represented twice, once by login, once by id:
logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as actual_user_id; user_ids = FOREACH users GENERATE user_id as matching_id, user_id as actual_user_id; user_id_or_login_lookup = UNION logins, user_ids;
user_id_or_login_lookup: {matching_id: chararray,actual_user_id: chararray} (123,123) (234,234) (456,456) (james,123) (notjames,234) (someoneelse,456)
Then join on that, by the first column, and project that away, leaving just the event info and the numeric id.
views_with_id = JOIN profile_views by viewed_user_id, user_id_or_login_lookup by matching_id;
That is not working however. My joined relation looks like this (which is what I expect)
views_with_id: {profile_views::date: chararray,profile_views::viewed_user_id: chararray,user_id_or_login_lookup::matching_id: chararray,user_id_or_login_lookup::actual_user_id: chararray}
(2010-05-15,123,123,123) (2010-05-15,456,456,456) (2010-05-14,james,james,123) (2010-05-15,notjames,notjames,234)
But when I project as follows: views_with_id_projected = FOREACH views_with_id GENERATE date, viewed_user_id, user_id_or_login_lookup::actual_user_id;
The result is not what I expect
(2010-05-15,123,123) (2010-05-15,456,456) (2010-05-14,james,james) (2010-05-15,notjames,notjames)
To be clear, I expect
(2010-05-15,123,123) (2010-05-15,456,456) (2010-05-14,james,123) (2010-05-15,notjames,456) Can anyone give me a push in the right direction?
Thanks, James
Here's my full pig script:
users = LOAD 'patients-test.txt' USING PigStorage(',') AS (user_id:chararray, login:chararray, disease_id: chararray); profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',') AS(date: chararray, viewed_user_id:chararray);
dump users; dump profile_views;
-- build a relation so that users are present to join by login or user_id logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as actual_user_id; user_ids = FOREACH users GENERATE user_id as matching_id, user_id as actual_user_id; user_id_or_login_lookup = UNION logins, user_ids;
dump user_id_or_login_lookup; describe user_id_or_login_lookup;
views_with_id = JOIN profile_views by viewed_user_id, user_id_or_login_lookup by matching_id;
describe views_with_id; --STORE views_with_id into 'ep-views.txt';
dump views_with_id;
views_with_id_projected = FOREACH views_with_id GENERATE date, viewed_user_id, user_id_or_login_lookup::actual_user_id; dump views_with_id_projected;
+
James Kebinger 2011-02-16, 23:57
-
Re: Problems with union, projection producing unexpected results
Jonathan Coveney 2011-02-17, 15:41
I am glad that you got this in a replicatable form! I have seen this error as well (where the output is just the last value repeated instead of the multiples that you want), but wasn't able to give a concrete example.
2011/2/16 James Kebinger <[EMAIL PROTECTED]>
> Hello all, I've been scratching my head over a problem with a pig script > I'm > having, and hoping another set of eyeballs will help. I'm using pig 0.8, in > local mode > > Here's my simplified use case: > > I have a log file with events on pages, and the id of the event can be a > users login or a users numeric id: > > 2010-05-14,james > 2010-05-15,123 > 2010-05-15,23 > 2010-05-15,456 > 2010-05-15,notjames > > So i want to join a set of users on either the login or user id. > > Here's my users: > > 123,james,11 > 234,notjames,11 > 456,someoneelse,11 > > > So I thought I would be clever and load the user list, union it with itself > to generate a relation where each user is represented twice, once by login, > once by id: > > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as > actual_user_id; > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as > actual_user_id; > user_id_or_login_lookup = UNION logins, user_ids; > > > > user_id_or_login_lookup: {matching_id: chararray,actual_user_id: chararray} > (123,123) > (234,234) > (456,456) > (james,123) > (notjames,234) > (someoneelse,456) > > Then join on that, by the first column, and project that away, leaving just > the event info and the numeric id. > > views_with_id = JOIN profile_views by viewed_user_id, > user_id_or_login_lookup by matching_id; > > That is not working however. My joined relation looks like this (which is > what I expect) > > views_with_id: {profile_views::date: > chararray,profile_views::viewed_user_id: > chararray,user_id_or_login_lookup::matching_id: > chararray,user_id_or_login_lookup::actual_user_id: chararray} > > (2010-05-15,123,123,123) > (2010-05-15,456,456,456) > (2010-05-14,james,james,123) > (2010-05-15,notjames,notjames,234) > > But when I project as follows: views_with_id_projected = FOREACH > views_with_id GENERATE date, viewed_user_id, > user_id_or_login_lookup::actual_user_id; > > The result is not what I expect > > (2010-05-15,123,123) > (2010-05-15,456,456) > (2010-05-14,james,james) > (2010-05-15,notjames,notjames) > > To be clear, I expect > > (2010-05-15,123,123) > (2010-05-15,456,456) > (2010-05-14,james,123) > (2010-05-15,notjames,456) > > > Can anyone give me a push in the right direction? > > Thanks, James > > Here's my full pig script: > > users = LOAD 'patients-test.txt' USING PigStorage(',') AS > (user_id:chararray, login:chararray, disease_id: chararray); > profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',') > AS(date: > chararray, viewed_user_id:chararray); > > dump users; > dump profile_views; > > -- build a relation so that users are present to join by login or user_id > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as > actual_user_id; > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as > actual_user_id; > user_id_or_login_lookup = UNION logins, user_ids; > > dump user_id_or_login_lookup; > describe user_id_or_login_lookup; > > views_with_id = JOIN profile_views by viewed_user_id, > user_id_or_login_lookup by matching_id; > > describe views_with_id; > --STORE views_with_id into 'ep-views.txt'; > > dump views_with_id; > > views_with_id_projected = FOREACH views_with_id GENERATE date, > viewed_user_id, user_id_or_login_lookup::actual_user_id; > > > dump views_with_id_projected; >
+
Jonathan Coveney 2011-02-17, 15:41
-
Re: Problems with union, projection producing unexpected results
James Kebinger 2011-02-17, 17:32
Interesting, maybe I should file a bug report then?
On Thu, Feb 17, 2011 at 10:41 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:
> I am glad that you got this in a replicatable form! I have seen this error > as well (where the output is just the last value repeated instead of the > multiples that you want), but wasn't able to give a concrete example. > > 2011/2/16 James Kebinger <[EMAIL PROTECTED]> > > > Hello all, I've been scratching my head over a problem with a pig script > > I'm > > having, and hoping another set of eyeballs will help. I'm using pig 0.8, > in > > local mode > > > > Here's my simplified use case: > > > > I have a log file with events on pages, and the id of the event can be a > > users login or a users numeric id: > > > > 2010-05-14,james > > 2010-05-15,123 > > 2010-05-15,23 > > 2010-05-15,456 > > 2010-05-15,notjames > > > > So i want to join a set of users on either the login or user id. > > > > Here's my users: > > > > 123,james,11 > > 234,notjames,11 > > 456,someoneelse,11 > > > > > > So I thought I would be clever and load the user list, union it with > itself > > to generate a relation where each user is represented twice, once by > login, > > once by id: > > > > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as > > actual_user_id; > > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as > > actual_user_id; > > user_id_or_login_lookup = UNION logins, user_ids; > > > > > > > > user_id_or_login_lookup: {matching_id: chararray,actual_user_id: > chararray} > > (123,123) > > (234,234) > > (456,456) > > (james,123) > > (notjames,234) > > (someoneelse,456) > > > > Then join on that, by the first column, and project that away, leaving > just > > the event info and the numeric id. > > > > views_with_id = JOIN profile_views by viewed_user_id, > > user_id_or_login_lookup by matching_id; > > > > That is not working however. My joined relation looks like this (which is > > what I expect) > > > > views_with_id: {profile_views::date: > > chararray,profile_views::viewed_user_id: > > chararray,user_id_or_login_lookup::matching_id: > > chararray,user_id_or_login_lookup::actual_user_id: chararray} > > > > (2010-05-15,123,123,123) > > (2010-05-15,456,456,456) > > (2010-05-14,james,james,123) > > (2010-05-15,notjames,notjames,234) > > > > But when I project as follows: views_with_id_projected = FOREACH > > views_with_id GENERATE date, viewed_user_id, > > user_id_or_login_lookup::actual_user_id; > > > > The result is not what I expect > > > > (2010-05-15,123,123) > > (2010-05-15,456,456) > > (2010-05-14,james,james) > > (2010-05-15,notjames,notjames) > > > > To be clear, I expect > > > > (2010-05-15,123,123) > > (2010-05-15,456,456) > > (2010-05-14,james,123) > > (2010-05-15,notjames,456) > > > > > > Can anyone give me a push in the right direction? > > > > Thanks, James > > > > Here's my full pig script: > > > > users = LOAD 'patients-test.txt' USING PigStorage(',') AS > > (user_id:chararray, login:chararray, disease_id: chararray); > > profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',') > > AS(date: > > chararray, viewed_user_id:chararray); > > > > dump users; > > dump profile_views; > > > > -- build a relation so that users are present to join by login or user_id > > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as > > actual_user_id; > > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as > > actual_user_id; > > user_id_or_login_lookup = UNION logins, user_ids; > > > > dump user_id_or_login_lookup; > > describe user_id_or_login_lookup; > > > > views_with_id = JOIN profile_views by viewed_user_id, > > user_id_or_login_lookup by matching_id; > > > > describe views_with_id; > > --STORE views_with_id into 'ep-views.txt'; > > > > dump views_with_id; > > > > views_with_id_projected = FOREACH views_with_id GENERATE date, > > viewed_user_id, user_id_or_login_lookup::actual_user_id; > > > > > > dump views_with_id_projected; > > >
+
James Kebinger 2011-02-17, 17:32
-
Re: Problems with union, projection producing unexpected results
James Kebinger 2011-02-17, 19:01
https://issues.apache.org/jira/browse/PIG-1859On Thu, Feb 17, 2011 at 12:32 PM, James Kebinger <[EMAIL PROTECTED]>wrote: > Interesting, maybe I should file a bug report then? > > > On Thu, Feb 17, 2011 at 10:41 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote: > >> I am glad that you got this in a replicatable form! I have seen this error >> as well (where the output is just the last value repeated instead of the >> multiples that you want), but wasn't able to give a concrete example. >> >> 2011/2/16 James Kebinger <[EMAIL PROTECTED]> >> >> > Hello all, I've been scratching my head over a problem with a pig script >> > I'm >> > having, and hoping another set of eyeballs will help. I'm using pig 0.8, >> in >> > local mode >> > >> > Here's my simplified use case: >> > >> > I have a log file with events on pages, and the id of the event can be a >> > users login or a users numeric id: >> > >> > 2010-05-14,james >> > 2010-05-15,123 >> > 2010-05-15,23 >> > 2010-05-15,456 >> > 2010-05-15,notjames >> > >> > So i want to join a set of users on either the login or user id. >> > >> > Here's my users: >> > >> > 123,james,11 >> > 234,notjames,11 >> > 456,someoneelse,11 >> > >> > >> > So I thought I would be clever and load the user list, union it with >> itself >> > to generate a relation where each user is represented twice, once by >> login, >> > once by id: >> > >> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as >> > actual_user_id; >> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as >> > actual_user_id; >> > user_id_or_login_lookup = UNION logins, user_ids; >> > >> > >> > >> > user_id_or_login_lookup: {matching_id: chararray,actual_user_id: >> chararray} >> > (123,123) >> > (234,234) >> > (456,456) >> > (james,123) >> > (notjames,234) >> > (someoneelse,456) >> > >> > Then join on that, by the first column, and project that away, leaving >> just >> > the event info and the numeric id. >> > >> > views_with_id = JOIN profile_views by viewed_user_id, >> > user_id_or_login_lookup by matching_id; >> > >> > That is not working however. My joined relation looks like this (which >> is >> > what I expect) >> > >> > views_with_id: {profile_views::date: >> > chararray,profile_views::viewed_user_id: >> > chararray,user_id_or_login_lookup::matching_id: >> > chararray,user_id_or_login_lookup::actual_user_id: chararray} >> > >> > (2010-05-15,123,123,123) >> > (2010-05-15,456,456,456) >> > (2010-05-14,james,james,123) >> > (2010-05-15,notjames,notjames,234) >> > >> > But when I project as follows: views_with_id_projected = FOREACH >> > views_with_id GENERATE date, viewed_user_id, >> > user_id_or_login_lookup::actual_user_id; >> > >> > The result is not what I expect >> > >> > (2010-05-15,123,123) >> > (2010-05-15,456,456) >> > (2010-05-14,james,james) >> > (2010-05-15,notjames,notjames) >> > >> > To be clear, I expect >> > >> > (2010-05-15,123,123) >> > (2010-05-15,456,456) >> > (2010-05-14,james,123) >> > (2010-05-15,notjames,456) >> > >> > >> > Can anyone give me a push in the right direction? >> > >> > Thanks, James >> > >> > Here's my full pig script: >> > >> > users = LOAD 'patients-test.txt' USING PigStorage(',') AS >> > (user_id:chararray, login:chararray, disease_id: chararray); >> > profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',') >> > AS(date: >> > chararray, viewed_user_id:chararray); >> > >> > dump users; >> > dump profile_views; >> > >> > -- build a relation so that users are present to join by login or >> user_id >> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as >> > actual_user_id; >> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as >> > actual_user_id; >> > user_id_or_login_lookup = UNION logins, user_ids; >> > >> > dump user_id_or_login_lookup; >> > describe user_id_or_login_lookup; >> > >> > views_with_id = JOIN profile_views by viewed_user_id, >> > user_id_or_login_lookup by matching_id; >> > >> > describe views_with_id;
+
James Kebinger 2011-02-17, 19:01
|
|