|
|
-
Re: None. wtf is None?Robert Yerex 2012-07-24, 16:51
Whats in part-r-00000?
On Tue, Jul 24, 2012 at 9:30 AM, Russell Jurney <[EMAIL PROTECTED]>wrote: > No. No python UDF. > > Russell Jurney http://datasyndrome.com > > On Jul 24, 2012, at 6:50 AM, Robert Yerex > <[EMAIL PROTECTED]> wrote: > > > Python UDF? That would explain the None instead of null > > > > On Tue, Jul 24, 2012 at 12:49 AM, Russell Jurney > > <[EMAIL PROTECTED]>wrote: > > > >> Can someone explain this script to me? It is freaking me out. When did > Pig > >> start spitting out 'None' in place of null? > >> > >> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar > >> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar > >> register /me/pig/contrib/piggybank/java/piggybank.jar > >> > >> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); > >> > >> rmf /tmp/sent_mails > >> rmf /tmp/replies > >> > >> /* Get rid of emails with reply_to, as they confuse everything in > mailing > >> lists. */ > >> avro_emails = load '/me/tmp/thu_emails' using AvroStorage(); > >> clean_emails = filter avro_emails by froms is not null and reply_tos is > >> null; > >> > >> /* Treat emails without in_reply_to as sent emails */ > >> combined_emails = foreach clean_emails generate froms, tos, message_id; > >> *sent_mails = foreach combined_emails generate flatten(froms.address) as > >> from, * > >> * flatten(tos.address) as > to, > >> * > >> * message_id;* > >> store sent_mails into '/tmp/sent_mails'; > >> > >> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the > >> nulls */ > >> *replies = filter clean_emails by in_reply_to is not null;* > >> *replies = foreach replies generate flatten(froms.address) as from,* > >> * flatten(tos.address) as to,* > >> * in_reply_to;* > >> store replies into '/tmp/replies'; > >> > >> > >> Despite filtering replies to emails that only have the 'in_reply_to' > >> field... I get the same number of records in both relations I store: > >> > >> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l > >> 17431 > >> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l > >> 17431 > >> > >> > >> Investigating shows me: > >> > >> cat /tmp/replies/part-00001 > >> > >> [EMAIL PROTECTED] [EMAIL PROTECTED] None > >> [EMAIL PROTECTED] [EMAIL PROTECTED] > >> <[EMAIL PROTECTED] > >> [EMAIL PROTECTED] [EMAIL PROTECTED] None > >> > >> > >> Where did *None* come from? I thought FLATTEN would prune records with > >> empty columns, and I'm ok with it not but... what operators does None > >> respond to? It is not null. How do I prune these? > >> -- > >> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] > >> datasyndrome.com > >> > > > > > > > > -- > > Robert Yerex > > Data Scientist > > Civitas Learning > > www.civitaslearning.com > -- Robert Yerex Data Scientist Civitas Learning www.civitaslearning.com |