Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> problem filtering null values with pig


Copy link to this message
-
Re: problem filtering null values with pig
just for the record
I m posting here the solution for my problem.

Thank you for your help.

In the end the problem seams to be with the JsonLoader I was using. I don't
know why exactly, but it seams to have a bug with my strings.

I finally changed my code to use https://github.com/kevinweil/elephant-bird.

the code now looks like this:

    register 'elephant-bird-core-3.0.0.jar'
    register 'elephant-bird-pig-3.0.0.jar'
    register 'google-collections-1.0.jar'
    register 'json-simple-1.1.jar'

    json_lines = LOAD
'/twitterecho/tweets/stream/v1/json/2012_10_10/08' USING
com.twitter.elephantbird.pig.load.JsonLoader();

    geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS
id, (CHARARRAY) $0#'geoLocation' AS geoLocation;

    tweets_grp = GROUP geo_tweets BY id;
    unique_tweets = FOREACH tweets_grp {
          first_tweet = LIMIT inpt 1;
          GENERATE FLATTEN(first_tweet);
    };

    only_not_nulls = FILTER geo_tweets BY geoLocation is not null;
    store only_not_nulls into '/twitter_data/results/geo_tweets';
cheers
thanks again for your support
Arian P

2012/11/1 Arian Pasquali <[EMAIL PROTECTED]>

> You are right Cheolsoo,
> Indeed, it doesn't make any sense to write an UDF to compare datatypes. I
> know its possible, but doesn't sound the right way.
> Maybe it can be a bug at the JsonLoader I'm using
> https://github.com/mmay/PigJsonLoader/blob/master/JsonLoader.java
>
> I will share with u the script and the data in a few.
>
> tks for the hints.
>
> Arian Rodrigo Pasquali
> FEUP, SAPO Labs
> http://www.arianpasquali.com
> twitter @arianpasquali
>
>
>
> 2012/10/31 Cheolsoo Park <[EMAIL PROTECTED]>
>
>> Hi,
>>
>> > what's be the best way to filter only the valid rows, since some of
>> them are string and others map?
>>
>> This shouldn't happen. The data type is defined per column, so it should
>> be
>> either string or map for all rows. If that's not the case, it should be a
>> bug.
>>
>> > can create an expression to compare datatypes? is it possible?
>>
>> Technically, you should be able to write a UDF that checks type. But I am
>> more interested in knowing why you're running into this problem. Can you
>> please share your script and sample data? I'd like to reproduce it.
>>
>> Thanks,
>> Cheolsoo
>>
>> On Wed, Oct 31, 2012 at 2:54 PM, Arian Pasquali <[EMAIL PROTECTED]
>> >wrote:
>>
>> > can create an expression to compare datatypes?
>> > is it possible?
>> >
>> > ArianP
>> >
>> > 2012/10/31 Arian Pasquali <[EMAIL PROTECTED]>
>> >
>> > > you are right, it doesn't seam like a null value.
>> > > it looks like a chararray. But the expression causes error when
>> comparing
>> > > a string with ([longitude#-9.15199849,latitude#38.71179122])
>> > >
>> > > geoinfo_no_nulls = FILTER geoinfo BY $0!='null'
>> > >
>> > > I get
>> > > ERROR 2997: Unable to recreate exception from backed error:
>> > > org.apache.pig.backend.executionengine.ExecException: ERROR 1071:
>> Cannot
>> > > convert a map to a String
>> > >
>> > > what's be the best way to filter only the valid rows, since some of
>> them
>> > > are string and others map?
>> > >
>> > > Arian
>> > >
>> > >
>> > >
>> > > 2012/10/31 Cheolsoo Park <[EMAIL PROTECTED]>
>> > >
>> > >> Hi,
>> > >>
>> > >> I am not sure what's the problem because I can't reproduce it. To me,
>> > null
>> > >> values are printed as an empty "( )" not "(null)", so it doesn't seem
>> > like
>> > >> null.
>> > >>
>> > >> I am wondering whether OpenJDK is the problem. Can you try Oracle
>> > HotSpot
>> > >> JDK 1.6 and see that fixes it?
>> > >>
>> > >> Thanks,
>> > >> Cheolsoo
>> > >>
>> > >> On Wed, Oct 31, 2012 at 1:06 PM, Arian Pasquali <
>> > [EMAIL PROTECTED]
>> > >> >wrote:
>> > >>
>> > >> > hey people
>> > >> > I'm having some troubles with a silly task, I can´t find a way to
>> > filter
>> > >> > null values from my rows. This is the result when I dump the object
>> > >> > geoinfo:
>> > >> >
>> > >> > DUMP geoinfo;
>> > >> > ([longitude#70.95853,latitude#30.9773])
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB