Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problems with union, projection producing unexpected results


Copy link to this message
-
Re: Problems with union, projection producing unexpected results
Interesting, maybe I should file a bug report then?

On Thu, Feb 17, 2011 at 10:41 AM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> I am glad that you got this in a replicatable form! I have seen this error
> as well (where the output is just the last value repeated instead of the
> multiples that you want), but wasn't able to give a concrete example.
>
> 2011/2/16 James Kebinger <[EMAIL PROTECTED]>
>
> > Hello all, I've been scratching my head over a problem with a pig script
> > I'm
> > having, and hoping another set of eyeballs will help. I'm using pig 0.8,
> in
> > local mode
> >
> > Here's my simplified use case:
> >
> > I have a log file with events on pages, and the id of the event can be a
> > users login or a users numeric id:
> >
> > 2010-05-14,james
> > 2010-05-15,123
> > 2010-05-15,23
> > 2010-05-15,456
> > 2010-05-15,notjames
> >
> > So i want to join a set of users on either the login or user id.
> >
> > Here's my users:
> >
> > 123,james,11
> > 234,notjames,11
> > 456,someoneelse,11
> >
> >
> > So I thought I would be clever and load the user list, union it with
> itself
> > to generate a relation where each user is represented twice, once by
> login,
> > once by id:
> >
> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
> > actual_user_id;
> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
> > actual_user_id;
> > user_id_or_login_lookup = UNION logins, user_ids;
> >
> >
> >
> > user_id_or_login_lookup: {matching_id: chararray,actual_user_id:
> chararray}
> > (123,123)
> > (234,234)
> > (456,456)
> > (james,123)
> > (notjames,234)
> > (someoneelse,456)
> >
> > Then join on that, by the first column, and project that away, leaving
> just
> > the event info and the numeric id.
> >
> > views_with_id = JOIN profile_views by viewed_user_id,
> > user_id_or_login_lookup by matching_id;
> >
> > That is not working however. My joined relation looks like this (which is
> > what I expect)
> >
> > views_with_id: {profile_views::date:
> > chararray,profile_views::viewed_user_id:
> > chararray,user_id_or_login_lookup::matching_id:
> > chararray,user_id_or_login_lookup::actual_user_id: chararray}
> >
> > (2010-05-15,123,123,123)
> > (2010-05-15,456,456,456)
> > (2010-05-14,james,james,123)
> > (2010-05-15,notjames,notjames,234)
> >
> > But when I project as follows: views_with_id_projected = FOREACH
> > views_with_id GENERATE date, viewed_user_id,
> > user_id_or_login_lookup::actual_user_id;
> >
> > The result is not what I expect
> >
> > (2010-05-15,123,123)
> > (2010-05-15,456,456)
> > (2010-05-14,james,james)
> > (2010-05-15,notjames,notjames)
> >
> > To be clear, I expect
> >
> > (2010-05-15,123,123)
> > (2010-05-15,456,456)
> > (2010-05-14,james,123)
> > (2010-05-15,notjames,456)
> >
> >
> > Can anyone give me a push in the right direction?
> >
> > Thanks, James
> >
> > Here's my full pig script:
> >
> > users = LOAD 'patients-test.txt' USING PigStorage(',') AS
> > (user_id:chararray, login:chararray, disease_id: chararray);
> > profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',')
> > AS(date:
> > chararray, viewed_user_id:chararray);
> >
> > dump users;
> > dump profile_views;
> >
> > -- build a relation so that users are present to join by login or user_id
> > logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as
> > actual_user_id;
> > user_ids = FOREACH users GENERATE user_id as matching_id, user_id as
> > actual_user_id;
> > user_id_or_login_lookup = UNION logins, user_ids;
> >
> > dump user_id_or_login_lookup;
> > describe user_id_or_login_lookup;
> >
> > views_with_id = JOIN profile_views by viewed_user_id,
> > user_id_or_login_lookup by matching_id;
> >
> > describe views_with_id;
> > --STORE views_with_id into 'ep-views.txt';
> >
> > dump views_with_id;
> >
> > views_with_id_projected = FOREACH views_with_id GENERATE date,
> > viewed_user_id, user_id_or_login_lookup::actual_user_id;
> >
> >
> > dump views_with_id_projected;
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB