|
|
-
Re: problem filtering null values with pigArian Pasquali 2012-11-17, 05:01
just for the record
I m posting here the solution for my problem. Thank you for your help. In the end the problem seams to be with the JsonLoader I was using. I don't know why exactly, but it seams to have a bug with my strings. I finally changed my code to use https://github.com/kevinweil/elephant-bird. the code now looks like this: register 'elephant-bird-core-3.0.0.jar' register 'elephant-bird-pig-3.0.0.jar' register 'google-collections-1.0.jar' register 'json-simple-1.1.jar' json_lines = LOAD '/twitterecho/tweets/stream/v1/json/2012_10_10/08' USING com.twitter.elephantbird.pig.load.JsonLoader(); geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS id, (CHARARRAY) $0#'geoLocation' AS geoLocation; tweets_grp = GROUP geo_tweets BY id; unique_tweets = FOREACH tweets_grp { first_tweet = LIMIT inpt 1; GENERATE FLATTEN(first_tweet); }; only_not_nulls = FILTER geo_tweets BY geoLocation is not null; store only_not_nulls into '/twitter_data/results/geo_tweets'; cheers thanks again for your support Arian P 2012/11/1 Arian Pasquali <[EMAIL PROTECTED]> > You are right Cheolsoo, > Indeed, it doesn't make any sense to write an UDF to compare datatypes. I > know its possible, but doesn't sound the right way. > Maybe it can be a bug at the JsonLoader I'm using > https://github.com/mmay/PigJsonLoader/blob/master/JsonLoader.java > > I will share with u the script and the data in a few. > > tks for the hints. > > Arian Rodrigo Pasquali > FEUP, SAPO Labs > http://www.arianpasquali.com > twitter @arianpasquali > > > > 2012/10/31 Cheolsoo Park <[EMAIL PROTECTED]> > >> Hi, >> >> > what's be the best way to filter only the valid rows, since some of >> them are string and others map? >> >> This shouldn't happen. The data type is defined per column, so it should >> be >> either string or map for all rows. If that's not the case, it should be a >> bug. >> >> > can create an expression to compare datatypes? is it possible? >> >> Technically, you should be able to write a UDF that checks type. But I am >> more interested in knowing why you're running into this problem. Can you >> please share your script and sample data? I'd like to reproduce it. >> >> Thanks, >> Cheolsoo >> >> On Wed, Oct 31, 2012 at 2:54 PM, Arian Pasquali <[EMAIL PROTECTED] >> >wrote: >> >> > can create an expression to compare datatypes? >> > is it possible? >> > >> > ArianP >> > >> > 2012/10/31 Arian Pasquali <[EMAIL PROTECTED]> >> > >> > > you are right, it doesn't seam like a null value. >> > > it looks like a chararray. But the expression causes error when >> comparing >> > > a string with ([longitude#-9.15199849,latitude#38.71179122]) >> > > >> > > geoinfo_no_nulls = FILTER geoinfo BY $0!='null' >> > > >> > > I get >> > > ERROR 2997: Unable to recreate exception from backed error: >> > > org.apache.pig.backend.executionengine.ExecException: ERROR 1071: >> Cannot >> > > convert a map to a String >> > > >> > > what's be the best way to filter only the valid rows, since some of >> them >> > > are string and others map? >> > > >> > > Arian >> > > >> > > >> > > >> > > 2012/10/31 Cheolsoo Park <[EMAIL PROTECTED]> >> > > >> > >> Hi, >> > >> >> > >> I am not sure what's the problem because I can't reproduce it. To me, >> > null >> > >> values are printed as an empty "( )" not "(null)", so it doesn't seem >> > like >> > >> null. >> > >> >> > >> I am wondering whether OpenJDK is the problem. Can you try Oracle >> > HotSpot >> > >> JDK 1.6 and see that fixes it? >> > >> >> > >> Thanks, >> > >> Cheolsoo >> > >> >> > >> On Wed, Oct 31, 2012 at 1:06 PM, Arian Pasquali < >> > [EMAIL PROTECTED] >> > >> >wrote: >> > >> >> > >> > hey people >> > >> > I'm having some troubles with a silly task, I can´t find a way to >> > filter >> > >> > null values from my rows. This is the result when I dump the object >> > >> > geoinfo: >> > >> > >> > >> > DUMP geoinfo; >> > >> > ([longitude#70.95853,latitude#30.9773]) |