Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> None. wtf is None?


Copy link to this message
-
None. wtf is None?
Can someone explain this script to me? It is freaking me out. When did Pig
start spitting out 'None' in place of null?

register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/contrib/piggybank/java/piggybank.jar

define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

rmf /tmp/sent_mails
rmf /tmp/replies

/* Get rid of emails with reply_to, as they confuse everything in mailing
lists. */
avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
clean_emails = filter avro_emails by froms is not null and reply_tos is
null;

/* Treat emails without in_reply_to as sent emails */
combined_emails = foreach clean_emails generate froms, tos, message_id;
*sent_mails = foreach combined_emails generate flatten(froms.address) as
from, *
*                                              flatten(tos.address) as to, *
*                                              message_id;*
store sent_mails into '/tmp/sent_mails';

/* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
nulls */
*replies = filter clean_emails by in_reply_to is not null;*
*replies = foreach replies generate flatten(froms.address) as from,*
*                                   flatten(tos.address) as to,*
*                                   in_reply_to;*
store replies into '/tmp/replies';
Despite filtering replies to emails that only have the 'in_reply_to'
field... I get the same number of records in both relations I store:

russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
   17431
russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
   17431
Investigating shows me:

cat /tmp/replies/part-00001

[EMAIL PROTECTED] [EMAIL PROTECTED] None
[EMAIL PROTECTED] [EMAIL PROTECTED]
<[EMAIL PROTECTED]
[EMAIL PROTECTED] [EMAIL PROTECTED] None
Where did *None* come from? I thought FLATTEN would prune records with
empty columns, and I'm ok with it not but... what operators does None
respond to? It is not null. How do I prune these?
--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Alan Gates 2012-07-24, 20:43
+
Robert Yerex 2012-07-24, 13:50
+
Russell Jurney 2012-07-24, 14:30
+
Robert Yerex 2012-07-24, 16:51
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB