Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> None. wtf is None?


+
Russell Jurney 2012-07-24, 05:49
Copy link to this message
-
Re: None. wtf is None?
Can you attach a sample of the input data?  I'm guessing None came from the input data.  

Alan.

On Jul 23, 2012, at 10:49 PM, Russell Jurney wrote:

> Can someone explain this script to me? It is freaking me out. When did Pig
> start spitting out 'None' in place of null?
>
> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
> register /me/pig/contrib/piggybank/java/piggybank.jar
>
> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> rmf /tmp/sent_mails
> rmf /tmp/replies
>
> /* Get rid of emails with reply_to, as they confuse everything in mailing
> lists. */
> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
> clean_emails = filter avro_emails by froms is not null and reply_tos is
> null;
>
> /* Treat emails without in_reply_to as sent emails */
> combined_emails = foreach clean_emails generate froms, tos, message_id;
> *sent_mails = foreach combined_emails generate flatten(froms.address) as
> from, *
> *                                              flatten(tos.address) as to, *
> *                                              message_id;*
> store sent_mails into '/tmp/sent_mails';
>
> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
> nulls */
> *replies = filter clean_emails by in_reply_to is not null;*
> *replies = foreach replies generate flatten(froms.address) as from,*
> *                                   flatten(tos.address) as to,*
> *                                   in_reply_to;*
> store replies into '/tmp/replies';
>
>
> Despite filtering replies to emails that only have the 'in_reply_to'
> field... I get the same number of records in both relations I store:
>
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
>   17431
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
>   17431
>
>
> Investigating shows me:
>
> cat /tmp/replies/part-00001
>
> [EMAIL PROTECTED] [EMAIL PROTECTED] None
> [EMAIL PROTECTED] [EMAIL PROTECTED]
> <[EMAIL PROTECTED]
> [EMAIL PROTECTED] [EMAIL PROTECTED] None
>
>
> Where did *None* come from? I thought FLATTEN would prune records with
> empty columns, and I'm ok with it not but... what operators does None
> respond to? It is not null. How do I prune these?
> --
> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Robert Yerex 2012-07-24, 13:50
+
Russell Jurney 2012-07-24, 14:30
+
Robert Yerex 2012-07-24, 16:51
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB