|
|
-
None. wtf is None?Russell Jurney 2012-07-24, 05:49
Can someone explain this script to me? It is freaking me out. When did Pig
start spitting out 'None' in place of null? register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar register /me/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); rmf /tmp/sent_mails rmf /tmp/replies /* Get rid of emails with reply_to, as they confuse everything in mailing lists. */ avro_emails = load '/me/tmp/thu_emails' using AvroStorage(); clean_emails = filter avro_emails by froms is not null and reply_tos is null; /* Treat emails without in_reply_to as sent emails */ combined_emails = foreach clean_emails generate froms, tos, message_id; *sent_mails = foreach combined_emails generate flatten(froms.address) as from, * * flatten(tos.address) as to, * * message_id;* store sent_mails into '/tmp/sent_mails'; /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the nulls */ *replies = filter clean_emails by in_reply_to is not null;* *replies = foreach replies generate flatten(froms.address) as from,* * flatten(tos.address) as to,* * in_reply_to;* store replies into '/tmp/replies'; Despite filtering replies to emails that only have the 'in_reply_to' field... I get the same number of records in both relations I store: russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l 17431 russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l 17431 Investigating shows me: cat /tmp/replies/part-00001 [EMAIL PROTECTED] [EMAIL PROTECTED] None [EMAIL PROTECTED] [EMAIL PROTECTED] <[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] None Where did *None* come from? I thought FLATTEN would prune records with empty columns, and I'm ok with it not but... what operators does None respond to? It is not null. How do I prune these? -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com |