Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - None. wtf is None?


Copy link to this message
-
Re: None. wtf is None?
Robert Yerex 2012-07-24, 16:51
Whats in part-r-00000?

On Tue, Jul 24, 2012 at 9:30 AM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> No. No python UDF.
>
> Russell Jurney http://datasyndrome.com
>
> On Jul 24, 2012, at 6:50 AM, Robert Yerex
> <[EMAIL PROTECTED]> wrote:
>
> > Python UDF? That would explain the None instead of null
> >
> > On Tue, Jul 24, 2012 at 12:49 AM, Russell Jurney
> > <[EMAIL PROTECTED]>wrote:
> >
> >> Can someone explain this script to me? It is freaking me out. When did
> Pig
> >> start spitting out 'None' in place of null?
> >>
> >> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
> >> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
> >> register /me/pig/contrib/piggybank/java/piggybank.jar
> >>
> >> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> >>
> >> rmf /tmp/sent_mails
> >> rmf /tmp/replies
> >>
> >> /* Get rid of emails with reply_to, as they confuse everything in
> mailing
> >> lists. */
> >> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
> >> clean_emails = filter avro_emails by froms is not null and reply_tos is
> >> null;
> >>
> >> /* Treat emails without in_reply_to as sent emails */
> >> combined_emails = foreach clean_emails generate froms, tos, message_id;
> >> *sent_mails = foreach combined_emails generate flatten(froms.address) as
> >> from, *
> >> *                                              flatten(tos.address) as
> to,
> >> *
> >> *                                              message_id;*
> >> store sent_mails into '/tmp/sent_mails';
> >>
> >> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
> >> nulls */
> >> *replies = filter clean_emails by in_reply_to is not null;*
> >> *replies = foreach replies generate flatten(froms.address) as from,*
> >> *                                   flatten(tos.address) as to,*
> >> *                                   in_reply_to;*
> >> store replies into '/tmp/replies';
> >>
> >>
> >> Despite filtering replies to emails that only have the 'in_reply_to'
> >> field... I get the same number of records in both relations I store:
> >>
> >> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
> >>   17431
> >> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
> >>   17431
> >>
> >>
> >> Investigating shows me:
> >>
> >> cat /tmp/replies/part-00001
> >>
> >> [EMAIL PROTECTED] [EMAIL PROTECTED] None
> >> [EMAIL PROTECTED] [EMAIL PROTECTED]
> >> <[EMAIL PROTECTED]
> >> [EMAIL PROTECTED] [EMAIL PROTECTED] None
> >>
> >>
> >> Where did *None* come from? I thought FLATTEN would prune records with
> >> empty columns, and I'm ok with it not but... what operators does None
> >> respond to? It is not null. How do I prune these?
> >> --
> >> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]
> >> datasyndrome.com
> >>
> >
> >
> >
> > --
> > Robert Yerex
> > Data Scientist
> > Civitas Learning
> > www.civitaslearning.com
>

--
Robert Yerex
Data Scientist
Civitas Learning
www.civitaslearning.com