Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> None. wtf is None?


+
Russell Jurney 2012-07-24, 05:49
Copy link to this message
-
Re: None. wtf is None?
Can you attach a sample of the input data?  I'm guessing None came from the input data.  

Alan.

On Jul 23, 2012, at 10:49 PM, Russell Jurney wrote:

> Can someone explain this script to me? It is freaking me out. When did Pig
> start spitting out 'None' in place of null?
>
> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
> register /me/pig/contrib/piggybank/java/piggybank.jar
>
> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> rmf /tmp/sent_mails
> rmf /tmp/replies
>
> /* Get rid of emails with reply_to, as they confuse everything in mailing
> lists. */
> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
> clean_emails = filter avro_emails by froms is not null and reply_tos is
> null;
>
> /* Treat emails without in_reply_to as sent emails */
> combined_emails = foreach clean_emails generate froms, tos, message_id;
> *sent_mails = foreach combined_emails generate flatten(froms.address) as
> from, *
> *                                              flatten(tos.address) as to, *
> *                                              message_id;*
> store sent_mails into '/tmp/sent_mails';
>
> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
> nulls */
> *replies = filter clean_emails by in_reply_to is not null;*
> *replies = foreach replies generate flatten(froms.address) as from,*
> *                                   flatten(tos.address) as to,*
> *                                   in_reply_to;*
> store replies into '/tmp/replies';
>
>
> Despite filtering replies to emails that only have the 'in_reply_to'
> field... I get the same number of records in both relations I store:
>
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
>   17431
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
>   17431
>
>
> Investigating shows me:
>
> cat /tmp/replies/part-00001
>
> [EMAIL PROTECTED] [EMAIL PROTECTED] None
> [EMAIL PROTECTED] [EMAIL PROTECTED]
> <[EMAIL PROTECTED]
> [EMAIL PROTECTED] [EMAIL PROTECTED] None
>
>
> Where did *None* come from? I thought FLATTEN would prune records with
> empty columns, and I'm ok with it not but... what operators does None
> respond to? It is not null. How do I prune these?
> --
> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Robert Yerex 2012-07-24, 13:50
+
Russell Jurney 2012-07-24, 14:30
+
Robert Yerex 2012-07-24, 16:51