Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> None. wtf is None?


Copy link to this message
-
Re: None. wtf is None?
No. No python UDF.

Russell Jurney http://datasyndrome.com

On Jul 24, 2012, at 6:50 AM, Robert Yerex
<[EMAIL PROTECTED]> wrote:

> Python UDF? That would explain the None instead of null
>
> On Tue, Jul 24, 2012 at 12:49 AM, Russell Jurney
> <[EMAIL PROTECTED]>wrote:
>
>> Can someone explain this script to me? It is freaking me out. When did Pig
>> start spitting out 'None' in place of null?
>>
>> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
>> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
>> register /me/pig/contrib/piggybank/java/piggybank.jar
>>
>> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>>
>> rmf /tmp/sent_mails
>> rmf /tmp/replies
>>
>> /* Get rid of emails with reply_to, as they confuse everything in mailing
>> lists. */
>> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
>> clean_emails = filter avro_emails by froms is not null and reply_tos is
>> null;
>>
>> /* Treat emails without in_reply_to as sent emails */
>> combined_emails = foreach clean_emails generate froms, tos, message_id;
>> *sent_mails = foreach combined_emails generate flatten(froms.address) as
>> from, *
>> *                                              flatten(tos.address) as to,
>> *
>> *                                              message_id;*
>> store sent_mails into '/tmp/sent_mails';
>>
>> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
>> nulls */
>> *replies = filter clean_emails by in_reply_to is not null;*
>> *replies = foreach replies generate flatten(froms.address) as from,*
>> *                                   flatten(tos.address) as to,*
>> *                                   in_reply_to;*
>> store replies into '/tmp/replies';
>>
>>
>> Despite filtering replies to emails that only have the 'in_reply_to'
>> field... I get the same number of records in both relations I store:
>>
>> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
>>   17431
>> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
>>   17431
>>
>>
>> Investigating shows me:
>>
>> cat /tmp/replies/part-00001
>>
>> [EMAIL PROTECTED] [EMAIL PROTECTED] None
>> [EMAIL PROTECTED] [EMAIL PROTECTED]
>> <[EMAIL PROTECTED]
>> [EMAIL PROTECTED] [EMAIL PROTECTED] None
>>
>>
>> Where did *None* come from? I thought FLATTEN would prune records with
>> empty columns, and I'm ok with it not but... what operators does None
>> respond to? It is not null. How do I prune these?
>> --
>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]
>> datasyndrome.com
>>
>
>
>
> --
> Robert Yerex
> Data Scientist
> Civitas Learning
> www.civitaslearning.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB