|
|
-
Re: None. wtf is None?Russell Jurney 2012-07-24, 14:30
No. No python UDF.
Russell Jurney http://datasyndrome.com On Jul 24, 2012, at 6:50 AM, Robert Yerex <[EMAIL PROTECTED]> wrote: > Python UDF? That would explain the None instead of null > > On Tue, Jul 24, 2012 at 12:49 AM, Russell Jurney > <[EMAIL PROTECTED]>wrote: > >> Can someone explain this script to me? It is freaking me out. When did Pig >> start spitting out 'None' in place of null? >> >> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar >> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar >> register /me/pig/contrib/piggybank/java/piggybank.jar >> >> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); >> >> rmf /tmp/sent_mails >> rmf /tmp/replies >> >> /* Get rid of emails with reply_to, as they confuse everything in mailing >> lists. */ >> avro_emails = load '/me/tmp/thu_emails' using AvroStorage(); >> clean_emails = filter avro_emails by froms is not null and reply_tos is >> null; >> >> /* Treat emails without in_reply_to as sent emails */ >> combined_emails = foreach clean_emails generate froms, tos, message_id; >> *sent_mails = foreach combined_emails generate flatten(froms.address) as >> from, * >> * flatten(tos.address) as to, >> * >> * message_id;* >> store sent_mails into '/tmp/sent_mails'; >> >> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the >> nulls */ >> *replies = filter clean_emails by in_reply_to is not null;* >> *replies = foreach replies generate flatten(froms.address) as from,* >> * flatten(tos.address) as to,* >> * in_reply_to;* >> store replies into '/tmp/replies'; >> >> >> Despite filtering replies to emails that only have the 'in_reply_to' >> field... I get the same number of records in both relations I store: >> >> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l >> 17431 >> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l >> 17431 >> >> >> Investigating shows me: >> >> cat /tmp/replies/part-00001 >> >> [EMAIL PROTECTED] [EMAIL PROTECTED] None >> [EMAIL PROTECTED] [EMAIL PROTECTED] >> <[EMAIL PROTECTED] >> [EMAIL PROTECTED] [EMAIL PROTECTED] None >> >> >> Where did *None* come from? I thought FLATTEN would prune records with >> empty columns, and I'm ok with it not but... what operators does None >> respond to? It is not null. How do I prune these? >> -- >> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] >> datasyndrome.com >> > > > > -- > Robert Yerex > Data Scientist > Civitas Learning > www.civitaslearning.com |