Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> problem filtering null values with pig


Copy link to this message
-
Re: problem filtering null values with pig
just for the record
I m posting here the solution for my problem.

Thank you for your help.

In the end the problem seams to be with the JsonLoader I was using. I don't
know why exactly, but it seams to have a bug with my strings.

I finally changed my code to use https://github.com/kevinweil/elephant-bird.

the code now looks like this:

    register 'elephant-bird-core-3.0.0.jar'
    register 'elephant-bird-pig-3.0.0.jar'
    register 'google-collections-1.0.jar'
    register 'json-simple-1.1.jar'

    json_lines = LOAD
'/twitterecho/tweets/stream/v1/json/2012_10_10/08' USING
com.twitter.elephantbird.pig.load.JsonLoader();

    geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS
id, (CHARARRAY) $0#'geoLocation' AS geoLocation;

    tweets_grp = GROUP geo_tweets BY id;
    unique_tweets = FOREACH tweets_grp {
          first_tweet = LIMIT inpt 1;
          GENERATE FLATTEN(first_tweet);
    };

    only_not_nulls = FILTER geo_tweets BY geoLocation is not null;
    store only_not_nulls into '/twitter_data/results/geo_tweets';
cheers
thanks again for your support
Arian P

2012/11/1 Arian Pasquali <[EMAIL PROTECTED]>

> You are right Cheolsoo,
> Indeed, it doesn't make any sense to write an UDF to compare datatypes. I
> know its possible, but doesn't sound the right way.
> Maybe it can be a bug at the JsonLoader I'm using
> https://github.com/mmay/PigJsonLoader/blob/master/JsonLoader.java
>
> I will share with u the script and the data in a few.
>
> tks for the hints.
>
> Arian Rodrigo Pasquali
> FEUP, SAPO Labs
> http://www.arianpasquali.com
> twitter @arianpasquali
>
>
>
> 2012/10/31 Cheolsoo Park <[EMAIL PROTECTED]>
>
>> Hi,
>>
>> > what's be the best way to filter only the valid rows, since some of
>> them are string and others map?
>>
>> This shouldn't happen. The data type is defined per column, so it should
>> be
>> either string or map for all rows. If that's not the case, it should be a
>> bug.
>>
>> > can create an expression to compare datatypes? is it possible?
>>
>> Technically, you should be able to write a UDF that checks type. But I am
>> more interested in knowing why you're running into this problem. Can you
>> please share your script and sample data? I'd like to reproduce it.
>>
>> Thanks,
>> Cheolsoo
>>
>> On Wed, Oct 31, 2012 at 2:54 PM, Arian Pasquali <[EMAIL PROTECTED]
>> >wrote:
>>
>> > can create an expression to compare datatypes?
>> > is it possible?
>> >
>> > ArianP
>> >
>> > 2012/10/31 Arian Pasquali <[EMAIL PROTECTED]>
>> >
>> > > you are right, it doesn't seam like a null value.
>> > > it looks like a chararray. But the expression causes error when
>> comparing
>> > > a string with ([longitude#-9.15199849,latitude#38.71179122])
>> > >
>> > > geoinfo_no_nulls = FILTER geoinfo BY $0!='null'
>> > >
>> > > I get
>> > > ERROR 2997: Unable to recreate exception from backed error:
>> > > org.apache.pig.backend.executionengine.ExecException: ERROR 1071:
>> Cannot
>> > > convert a map to a String
>> > >
>> > > what's be the best way to filter only the valid rows, since some of
>> them
>> > > are string and others map?
>> > >
>> > > Arian
>> > >
>> > >
>> > >
>> > > 2012/10/31 Cheolsoo Park <[EMAIL PROTECTED]>
>> > >
>> > >> Hi,
>> > >>
>> > >> I am not sure what's the problem because I can't reproduce it. To me,
>> > null
>> > >> values are printed as an empty "( )" not "(null)", so it doesn't seem
>> > like
>> > >> null.
>> > >>
>> > >> I am wondering whether OpenJDK is the problem. Can you try Oracle
>> > HotSpot
>> > >> JDK 1.6 and see that fixes it?
>> > >>
>> > >> Thanks,
>> > >> Cheolsoo
>> > >>
>> > >> On Wed, Oct 31, 2012 at 1:06 PM, Arian Pasquali <
>> > [EMAIL PROTECTED]
>> > >> >wrote:
>> > >>
>> > >> > hey people
>> > >> > I'm having some troubles with a silly task, I can´t find a way to
>> > filter
>> > >> > null values from my rows. This is the result when I dump the object
>> > >> > geoinfo:
>> > >> >
>> > >> > DUMP geoinfo;
>> > >> > ([longitude#70.95853,latitude#30.9773])