|
Russell Jurney
2012-11-17, 22:09
Dan Young
2012-11-18, 01:23
Arian Pasquali
2012-11-18, 02:30
Russell Jurney
2012-11-18, 04:32
Russell Jurney
2012-11-18, 17:19
Arian Pasquali
2012-11-18, 22:46
Arian Pasquali
2012-11-19, 00:31
Russell Jurney
2012-11-19, 16:23
Russell Jurney
2012-11-19, 19:27
Russell Jurney
2012-11-19, 19:30
Russell Jurney
2012-11-19, 19:33
Russell Jurney
2012-11-19, 19:35
Deepak Tiwari
2012-11-19, 20:22
Saxifrage Cucvara
2012-11-21, 05:56
David LaBarbera
2012-11-21, 14:25
Saxifrage Cucvara
2012-11-21, 22:36
Adam Kawa
2012-11-17, 23:40
Russell Jurney
2012-11-18, 22:46
|
-
How do I load JSON in Pig?Russell Jurney 2012-11-17, 22:09
I have some JSON data with a uniform schema. I want to load it in Pig.
JsonStorage doesn't work, because the data has no schema. How can I load JSON data in Pig? -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com +
Russell Jurney 2012-11-17, 22:09
-
Re: How do I load JSON in Pig?Dan Young 2012-11-18, 01:23
No sure if this helps, but in 0.11 I've been using this on EMR for some of
our JSON data.... raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' USING JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); Regards, Dano On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney <[EMAIL PROTECTED]>wrote: > I have some JSON data with a uniform schema. I want to load it in Pig. > JsonStorage doesn't work, because the data has no schema. > > How can I load JSON data in Pig? > > -- > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] > datasyndrome.com > +
Dan Young 2012-11-18, 01:23
-
Re: How do I load JSON in Pig?Arian Pasquali 2012-11-18, 02:30
keep calm
and use elephant-bird https://github.com/kevinweil/elephant-bird<https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java> I posted here yesterday an example how to load tweets in json here goes again. I hope it helps. register 'elephant-bird-core-3.0.0.jar' register 'elephant-bird-pig-3.0.0.jar' register 'google-collections-1.0.jar' register 'json-simple-1.1.jar' json_lines = LOAD '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING com.twitter.elephantbird.pig.load.JsonLoader(); geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS id, (CHARARRAY) $0#'geoLocation' AS geoLocation; only_not_nulls = FILTER geo_tweets BY geoLocation is not null; store only_not_nulls into '/twitter_data/results/geo_tweets'; Arian Rodrigo Pasquali FEUP, SAPO Labs http://www.arianpasquali.com twitter @arianpasquali 2012/11/18 Dan Young <[EMAIL PROTECTED]> > No sure if this helps, but in 0.11 I've been using this on EMR for some of > our JSON data.... > > raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' USING > > JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); > > > Regards, > > Dano > > > > On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney <[EMAIL PROTECTED] > >wrote: > > > I have some JSON data with a uniform schema. I want to load it in Pig. > > JsonStorage doesn't work, because the data has no schema. > > > > How can I load JSON data in Pig? > > > > -- > > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] > > datasyndrome.com > > > +
Arian Pasquali 2012-11-18, 02:30
-
Re: How do I load JSON in Pig?Russell Jurney 2012-11-18, 04:32
Thanks, that is excellent.
Russell Jurney http://datasyndrome.com On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED]> wrote: > keep calm > and use elephant-bird > https://github.com/kevinweil/elephant-bird<https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java> > > I posted here yesterday an example how to load tweets in json > here goes again. I hope it helps. > > register 'elephant-bird-core-3.0.0.jar' > register 'elephant-bird-pig-3.0.0.jar' > register 'google-collections-1.0.jar' > register 'json-simple-1.1.jar' > > json_lines = LOAD > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING > com.twitter.elephantbird.pig.load.JsonLoader(); > > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; > > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; > store only_not_nulls into '/twitter_data/results/geo_tweets'; > > > > Arian Rodrigo Pasquali > FEUP, SAPO Labs > http://www.arianpasquali.com > twitter @arianpasquali > > > > 2012/11/18 Dan Young <[EMAIL PROTECTED]> > >> No sure if this helps, but in 0.11 I've been using this on EMR for some of >> our JSON data.... >> >> raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' USING >> >> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); >> >> >> Regards, >> >> Dano >> >> >> >> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney <[EMAIL PROTECTED] >>> wrote: >> >>> I have some JSON data with a uniform schema. I want to load it in Pig. >>> JsonStorage doesn't work, because the data has no schema. >>> >>> How can I load JSON data in Pig? >>> >>> -- >>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] >>> datasyndrome.com >>> >> +
Russell Jurney 2012-11-18, 04:32
-
Re: How do I load JSON in Pig?Russell Jurney 2012-11-18, 17:19
Thanks - looks like I don't have to specify the schema, which is good.
I'll try and build elephant-bird. Russell Jurney http://datasyndrome.com On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED]> wrote: > keep calm > and use elephant-bird > https://github.com/kevinweil/elephant-bird<https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java> > > I posted here yesterday an example how to load tweets in json > here goes again. I hope it helps. > > register 'elephant-bird-core-3.0.0.jar' > register 'elephant-bird-pig-3.0.0.jar' > register 'google-collections-1.0.jar' > register 'json-simple-1.1.jar' > > json_lines = LOAD > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING > com.twitter.elephantbird.pig.load.JsonLoader(); > > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; > > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; > store only_not_nulls into '/twitter_data/results/geo_tweets'; > > > > Arian Rodrigo Pasquali > FEUP, SAPO Labs > http://www.arianpasquali.com > twitter @arianpasquali > > > > 2012/11/18 Dan Young <[EMAIL PROTECTED]> > >> No sure if this helps, but in 0.11 I've been using this on EMR for some of >> our JSON data.... >> >> raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' USING >> >> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); >> >> >> Regards, >> >> Dano >> >> >> >> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney <[EMAIL PROTECTED] >>> wrote: >> >>> I have some JSON data with a uniform schema. I want to load it in Pig. >>> JsonStorage doesn't work, because the data has no schema. >>> >>> How can I load JSON data in Pig? >>> >>> -- >>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] >>> datasyndrome.com >>> >> +
Russell Jurney 2012-11-18, 17:19
-
Re: How do I load JSON in Pig?Arian Pasquali 2012-11-18, 22:46
U dont need to build neither
Just download those two jar I used in my example. Arian Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: > Thanks - looks like I don't have to specify the schema, which is good. > I'll try and build elephant-bird. > > Russell Jurney http://datasyndrome.com > > On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED]<javascript:;>> > wrote: > > > keep calm > > and use elephant-bird > > https://github.com/kevinweil/elephant-bird< > https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java > > > > > > I posted here yesterday an example how to load tweets in json > > here goes again. I hope it helps. > > > > register 'elephant-bird-core-3.0.0.jar' > > register 'elephant-bird-pig-3.0.0.jar' > > register 'google-collections-1.0.jar' > > register 'json-simple-1.1.jar' > > > > json_lines = LOAD > > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING > > com.twitter.elephantbird.pig.load.JsonLoader(); > > > > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS > > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; > > > > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; > > store only_not_nulls into '/twitter_data/results/geo_tweets'; > > > > > > > > Arian Rodrigo Pasquali > > FEUP, SAPO Labs > > http://www.arianpasquali.com > > twitter @arianpasquali > > > > > > > > 2012/11/18 Dan Young <[EMAIL PROTECTED] <javascript:;>> > > > >> No sure if this helps, but in 0.11 I've been using this on EMR for some > of > >> our JSON data.... > >> > >> raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' > USING > >> > >> > JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); > >> > >> > >> Regards, > >> > >> Dano > >> > >> > >> > >> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney < > [EMAIL PROTECTED] <javascript:;> > >>> wrote: > >> > >>> I have some JSON data with a uniform schema. I want to load it in Pig. > >>> JsonStorage doesn't work, because the data has no schema. > >>> > >>> How can I load JSON data in Pig? > >>> > >>> -- > >>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]<javascript:;> > >>> datasyndrome.com > >>> > >> > -- Sent from Gmail Mobile +
Arian Pasquali 2012-11-18, 22:46
-
Re: How do I load JSON in Pig?Arian Pasquali 2012-11-19, 00:31
I dont think you really need to build it.
you can find it at any maven repository. Arian Rodrigo Pasquali FEUP, SAPO Labs http://www.arianpasquali.com twitter @arianpasquali 2012/11/18 Arian Pasquali <[EMAIL PROTECTED]> > U dont need to build neither > Just download those two jar I used in my example. > > Arian > > Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: > >> Thanks - looks like I don't have to specify the schema, which is good. >> >> I'll try and build elephant-bird. >> >> Russell Jurney http://datasyndrome.com >> >> On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED]> >> wrote: >> >> > keep calm >> > and use elephant-bird >> > https://github.com/kevinweil/elephant-bird< >> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java >> > >> > >> > I posted here yesterday an example how to load tweets in json >> > here goes again. I hope it helps. >> > >> > register 'elephant-bird-core-3.0.0.jar' >> > register 'elephant-bird-pig-3.0.0.jar' >> > register 'google-collections-1.0.jar' >> > register 'json-simple-1.1.jar' >> > >> > json_lines = LOAD >> > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING >> > com.twitter.elephantbird.pig.load.JsonLoader(); >> > >> > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS >> > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; >> > >> > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; >> > store only_not_nulls into '/twitter_data/results/geo_tweets'; >> > >> > >> > >> > Arian Rodrigo Pasquali >> > FEUP, SAPO Labs >> > http://www.arianpasquali.com >> > twitter @arianpasquali >> > >> > >> > >> > 2012/11/18 Dan Young <[EMAIL PROTECTED]> >> > >> >> No sure if this helps, but in 0.11 I've been using this on EMR for >> some of >> >> our JSON data.... >> >> >> >> raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' >> USING >> >> >> >> >> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); >> >> >> >> >> >> Regards, >> >> >> >> Dano >> >> >> >> >> >> >> >> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney < >> [EMAIL PROTECTED] >> >>> wrote: >> >> >> >>> I have some JSON data with a uniform schema. I want to load it in Pig. >> >>> JsonStorage doesn't work, because the data has no schema. >> >>> >> >>> How can I load JSON data in Pig? >> >>> >> >>> -- >> >>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] >> >>> datasyndrome.com >> >>> >> >> >> > > > -- > Sent from Gmail Mobile > +
Arian Pasquali 2012-11-19, 00:31
-
Re: How do I load JSON in Pig?Russell Jurney 2012-11-19, 16:23
It seems that everyone can build elephant-bird but me:
https://github.com/kevinweil/elephant-bird/issues/272 On Sun, Nov 18, 2012 at 7:31 PM, Arian Pasquali <[EMAIL PROTECTED]>wrote: > I dont think you really need to build it. > you can find it at any maven repository. > > Arian Rodrigo Pasquali > FEUP, SAPO Labs > http://www.arianpasquali.com > twitter @arianpasquali > > > > 2012/11/18 Arian Pasquali <[EMAIL PROTECTED]> > > > U dont need to build neither > > Just download those two jar I used in my example. > > > > Arian > > > > Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: > > > >> Thanks - looks like I don't have to specify the schema, which is good. > >> > >> I'll try and build elephant-bird. > >> > >> Russell Jurney http://datasyndrome.com > >> > >> On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED]> > >> wrote: > >> > >> > keep calm > >> > and use elephant-bird > >> > https://github.com/kevinweil/elephant-bird< > >> > https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java > >> > > >> > > >> > I posted here yesterday an example how to load tweets in json > >> > here goes again. I hope it helps. > >> > > >> > register 'elephant-bird-core-3.0.0.jar' > >> > register 'elephant-bird-pig-3.0.0.jar' > >> > register 'google-collections-1.0.jar' > >> > register 'json-simple-1.1.jar' > >> > > >> > json_lines = LOAD > >> > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING > >> > com.twitter.elephantbird.pig.load.JsonLoader(); > >> > > >> > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS > >> > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; > >> > > >> > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; > >> > store only_not_nulls into '/twitter_data/results/geo_tweets'; > >> > > >> > > >> > > >> > Arian Rodrigo Pasquali > >> > FEUP, SAPO Labs > >> > http://www.arianpasquali.com > >> > twitter @arianpasquali > >> > > >> > > >> > > >> > 2012/11/18 Dan Young <[EMAIL PROTECTED]> > >> > > >> >> No sure if this helps, but in 0.11 I've been using this on EMR for > >> some of > >> >> our JSON data.... > >> >> > >> >> raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' > >> USING > >> >> > >> >> > >> > JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); > >> >> > >> >> > >> >> Regards, > >> >> > >> >> Dano > >> >> > >> >> > >> >> > >> >> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney < > >> [EMAIL PROTECTED] > >> >>> wrote: > >> >> > >> >>> I have some JSON data with a uniform schema. I want to load it in > Pig. > >> >>> JsonStorage doesn't work, because the data has no schema. > >> >>> > >> >>> How can I load JSON data in Pig? > >> >>> > >> >>> -- > >> >>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] > >> >>> datasyndrome.com > >> >>> > >> >> > >> > > > > > > -- > > Sent from Gmail Mobile > > > -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com +
Russell Jurney 2012-11-19, 16:23
-
Re: How do I load JSON in Pig?Russell Jurney 2012-11-19, 19:27
Got it building. Are google collections and json-simple external deps?
On Mon, Nov 19, 2012 at 11:23 AM, Russell Jurney <[EMAIL PROTECTED]>wrote: > It seems that everyone can build elephant-bird but me: > https://github.com/kevinweil/elephant-bird/issues/272 > > > On Sun, Nov 18, 2012 at 7:31 PM, Arian Pasquali <[EMAIL PROTECTED]>wrote: > >> I dont think you really need to build it. >> you can find it at any maven repository. >> >> Arian Rodrigo Pasquali >> FEUP, SAPO Labs >> http://www.arianpasquali.com >> twitter @arianpasquali >> >> >> >> 2012/11/18 Arian Pasquali <[EMAIL PROTECTED]> >> >> > U dont need to build neither >> > Just download those two jar I used in my example. >> > >> > Arian >> > >> > Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: >> > >> >> Thanks - looks like I don't have to specify the schema, which is good. >> >> >> >> I'll try and build elephant-bird. >> >> >> >> Russell Jurney http://datasyndrome.com >> >> >> >> On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED]> >> >> wrote: >> >> >> >> > keep calm >> >> > and use elephant-bird >> >> > https://github.com/kevinweil/elephant-bird< >> >> >> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java >> >> > >> >> > >> >> > I posted here yesterday an example how to load tweets in json >> >> > here goes again. I hope it helps. >> >> > >> >> > register 'elephant-bird-core-3.0.0.jar' >> >> > register 'elephant-bird-pig-3.0.0.jar' >> >> > register 'google-collections-1.0.jar' >> >> > register 'json-simple-1.1.jar' >> >> > >> >> > json_lines = LOAD >> >> > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING >> >> > com.twitter.elephantbird.pig.load.JsonLoader(); >> >> > >> >> > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS >> >> > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; >> >> > >> >> > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; >> >> > store only_not_nulls into '/twitter_data/results/geo_tweets'; >> >> > >> >> > >> >> > >> >> > Arian Rodrigo Pasquali >> >> > FEUP, SAPO Labs >> >> > http://www.arianpasquali.com >> >> > twitter @arianpasquali >> >> > >> >> > >> >> > >> >> > 2012/11/18 Dan Young <[EMAIL PROTECTED]> >> >> > >> >> >> No sure if this helps, but in 0.11 I've been using this on EMR for >> >> some of >> >> >> our JSON data.... >> >> >> >> >> >> raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' >> >> USING >> >> >> >> >> >> >> >> >> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); >> >> >> >> >> >> >> >> >> Regards, >> >> >> >> >> >> Dano >> >> >> >> >> >> >> >> >> >> >> >> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney < >> >> [EMAIL PROTECTED] >> >> >>> wrote: >> >> >> >> >> >>> I have some JSON data with a uniform schema. I want to load it in >> Pig. >> >> >>> JsonStorage doesn't work, because the data has no schema. >> >> >>> >> >> >>> How can I load JSON data in Pig? >> >> >>> >> >> >>> -- >> >> >>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] >> >> >>> datasyndrome.com >> >> >>> >> >> >> >> >> >> > >> > >> > -- >> > Sent from Gmail Mobile >> > >> > > > > -- > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome. > com > -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com +
Russell Jurney 2012-11-19, 19:27
-
Re: How do I load JSON in Pig?Russell Jurney 2012-11-19, 19:30
Talking to myself... never mind, guava and json-simple are included with
Pig. On Mon, Nov 19, 2012 at 2:27 PM, Russell Jurney <[EMAIL PROTECTED]>wrote: > Got it building. Are google collections and json-simple external deps? > > > On Mon, Nov 19, 2012 at 11:23 AM, Russell Jurney <[EMAIL PROTECTED] > > wrote: > >> It seems that everyone can build elephant-bird but me: >> https://github.com/kevinweil/elephant-bird/issues/272 >> >> >> On Sun, Nov 18, 2012 at 7:31 PM, Arian Pasquali <[EMAIL PROTECTED]>wrote: >> >>> I dont think you really need to build it. >>> you can find it at any maven repository. >>> >>> Arian Rodrigo Pasquali >>> FEUP, SAPO Labs >>> http://www.arianpasquali.com >>> twitter @arianpasquali >>> >>> >>> >>> 2012/11/18 Arian Pasquali <[EMAIL PROTECTED]> >>> >>> > U dont need to build neither >>> > Just download those two jar I used in my example. >>> > >>> > Arian >>> > >>> > Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: >>> > >>> >> Thanks - looks like I don't have to specify the schema, which is good. >>> >> >>> >> I'll try and build elephant-bird. >>> >> >>> >> Russell Jurney http://datasyndrome.com >>> >> >>> >> On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED]> >>> >> wrote: >>> >> >>> >> > keep calm >>> >> > and use elephant-bird >>> >> > https://github.com/kevinweil/elephant-bird< >>> >> >>> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java >>> >> > >>> >> > >>> >> > I posted here yesterday an example how to load tweets in json >>> >> > here goes again. I hope it helps. >>> >> > >>> >> > register 'elephant-bird-core-3.0.0.jar' >>> >> > register 'elephant-bird-pig-3.0.0.jar' >>> >> > register 'google-collections-1.0.jar' >>> >> > register 'json-simple-1.1.jar' >>> >> > >>> >> > json_lines = LOAD >>> >> > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING >>> >> > com.twitter.elephantbird.pig.load.JsonLoader(); >>> >> > >>> >> > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS >>> >> > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; >>> >> > >>> >> > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; >>> >> > store only_not_nulls into '/twitter_data/results/geo_tweets'; >>> >> > >>> >> > >>> >> > >>> >> > Arian Rodrigo Pasquali >>> >> > FEUP, SAPO Labs >>> >> > http://www.arianpasquali.com >>> >> > twitter @arianpasquali >>> >> > >>> >> > >>> >> > >>> >> > 2012/11/18 Dan Young <[EMAIL PROTECTED]> >>> >> > >>> >> >> No sure if this helps, but in 0.11 I've been using this on EMR for >>> >> some of >>> >> >> our JSON data.... >>> >> >> >>> >> >> raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' >>> >> USING >>> >> >> >>> >> >> >>> >> >>> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); >>> >> >> >>> >> >> >>> >> >> Regards, >>> >> >> >>> >> >> Dano >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney < >>> >> [EMAIL PROTECTED] >>> >> >>> wrote: >>> >> >> >>> >> >>> I have some JSON data with a uniform schema. I want to load it in >>> Pig. >>> >> >>> JsonStorage doesn't work, because the data has no schema. >>> >> >>> >>> >> >>> How can I load JSON data in Pig? >>> >> >>> >>> >> >>> -- >>> >> >>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] >>> >> >>> datasyndrome.com >>> >> >>> >>> >> >> >>> >> >>> > >>> > >>> > -- >>> > Sent from Gmail Mobile >>> > >>> >> >> >> >> -- >> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome. Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com +
Russell Jurney 2012-11-19, 19:30
-
Re: How do I load JSON in Pig?Russell Jurney 2012-11-19, 19:33
Wait... com.twitter.elephantbird.pig.load.JsonLoader() does not infer the
schema from a record. This is what I was looking for. Looks like I have to write that myself. And yes, I understand the tradeoffs in doing so. Assuming a sample is the overall schema is a big assumption. On Mon, Nov 19, 2012 at 2:30 PM, Russell Jurney <[EMAIL PROTECTED]>wrote: > Talking to myself... never mind, guava and json-simple are included with > Pig. > > > On Mon, Nov 19, 2012 at 2:27 PM, Russell Jurney <[EMAIL PROTECTED]>wrote: > >> Got it building. Are google collections and json-simple external deps? >> >> >> On Mon, Nov 19, 2012 at 11:23 AM, Russell Jurney < >> [EMAIL PROTECTED]> wrote: >> >>> It seems that everyone can build elephant-bird but me: >>> https://github.com/kevinweil/elephant-bird/issues/272 >>> >>> >>> On Sun, Nov 18, 2012 at 7:31 PM, Arian Pasquali <[EMAIL PROTECTED] >>> > wrote: >>> >>>> I dont think you really need to build it. >>>> you can find it at any maven repository. >>>> >>>> Arian Rodrigo Pasquali >>>> FEUP, SAPO Labs >>>> http://www.arianpasquali.com >>>> twitter @arianpasquali >>>> >>>> >>>> >>>> 2012/11/18 Arian Pasquali <[EMAIL PROTECTED]> >>>> >>>> > U dont need to build neither >>>> > Just download those two jar I used in my example. >>>> > >>>> > Arian >>>> > >>>> > Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: >>>> > >>>> >> Thanks - looks like I don't have to specify the schema, which is >>>> good. >>>> >> >>>> >> I'll try and build elephant-bird. >>>> >> >>>> >> Russell Jurney http://datasyndrome.com >>>> >> >>>> >> On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED] >>>> > >>>> >> wrote: >>>> >> >>>> >> > keep calm >>>> >> > and use elephant-bird >>>> >> > https://github.com/kevinweil/elephant-bird< >>>> >> >>>> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java >>>> >> > >>>> >> > >>>> >> > I posted here yesterday an example how to load tweets in json >>>> >> > here goes again. I hope it helps. >>>> >> > >>>> >> > register 'elephant-bird-core-3.0.0.jar' >>>> >> > register 'elephant-bird-pig-3.0.0.jar' >>>> >> > register 'google-collections-1.0.jar' >>>> >> > register 'json-simple-1.1.jar' >>>> >> > >>>> >> > json_lines = LOAD >>>> >> > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING >>>> >> > com.twitter.elephantbird.pig.load.JsonLoader(); >>>> >> > >>>> >> > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS >>>> >> > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; >>>> >> > >>>> >> > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; >>>> >> > store only_not_nulls into '/twitter_data/results/geo_tweets'; >>>> >> > >>>> >> > >>>> >> > >>>> >> > Arian Rodrigo Pasquali >>>> >> > FEUP, SAPO Labs >>>> >> > http://www.arianpasquali.com >>>> >> > twitter @arianpasquali >>>> >> > >>>> >> > >>>> >> > >>>> >> > 2012/11/18 Dan Young <[EMAIL PROTECTED]> >>>> >> > >>>> >> >> No sure if this helps, but in 0.11 I've been using this on EMR for >>>> >> some of >>>> >> >> our JSON data.... >>>> >> >> >>>> >> >> raw = load >>>> 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' >>>> >> USING >>>> >> >> >>>> >> >> >>>> >> >>>> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); >>>> >> >> >>>> >> >> >>>> >> >> Regards, >>>> >> >> >>>> >> >> Dano >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney < >>>> >> [EMAIL PROTECTED] >>>> >> >>> wrote: >>>> >> >> >>>> >> >>> I have some JSON data with a uniform schema. I want to load it Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com +
Russell Jurney 2012-11-19, 19:33
-
Re: How do I load JSON in Pig?Russell Jurney 2012-11-19, 19:35
Ok, its even worse. My data is a big array.
Am I being negative in saying that JSON and Pig is like a nightmare? On Mon, Nov 19, 2012 at 2:33 PM, Russell Jurney <[EMAIL PROTECTED]>wrote: > Wait... com.twitter.elephantbird.pig.load.JsonLoader() does not infer the > schema from a record. This is what I was looking for. Looks like I have to > write that myself. > > And yes, I understand the tradeoffs in doing so. Assuming a sample is the > overall schema is a big assumption. > > > > On Mon, Nov 19, 2012 at 2:30 PM, Russell Jurney <[EMAIL PROTECTED]>wrote: > >> Talking to myself... never mind, guava and json-simple are included with >> Pig. >> >> >> On Mon, Nov 19, 2012 at 2:27 PM, Russell Jurney <[EMAIL PROTECTED] >> > wrote: >> >>> Got it building. Are google collections and json-simple external deps? >>> >>> >>> On Mon, Nov 19, 2012 at 11:23 AM, Russell Jurney < >>> [EMAIL PROTECTED]> wrote: >>> >>>> It seems that everyone can build elephant-bird but me: >>>> https://github.com/kevinweil/elephant-bird/issues/272 >>>> >>>> >>>> On Sun, Nov 18, 2012 at 7:31 PM, Arian Pasquali < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> I dont think you really need to build it. >>>>> you can find it at any maven repository. >>>>> >>>>> Arian Rodrigo Pasquali >>>>> FEUP, SAPO Labs >>>>> http://www.arianpasquali.com >>>>> twitter @arianpasquali >>>>> >>>>> >>>>> >>>>> 2012/11/18 Arian Pasquali <[EMAIL PROTECTED]> >>>>> >>>>> > U dont need to build neither >>>>> > Just download those two jar I used in my example. >>>>> > >>>>> > Arian >>>>> > >>>>> > Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: >>>>> > >>>>> >> Thanks - looks like I don't have to specify the schema, which is >>>>> good. >>>>> >> >>>>> >> I'll try and build elephant-bird. >>>>> >> >>>>> >> Russell Jurney http://datasyndrome.com >>>>> >> >>>>> >> On Nov 17, 2012, at 9:30 PM, Arian Pasquali < >>>>> [EMAIL PROTECTED]> >>>>> >> wrote: >>>>> >> >>>>> >> > keep calm >>>>> >> > and use elephant-bird >>>>> >> > https://github.com/kevinweil/elephant-bird< >>>>> >> >>>>> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java >>>>> >> > >>>>> >> > >>>>> >> > I posted here yesterday an example how to load tweets in json >>>>> >> > here goes again. I hope it helps. >>>>> >> > >>>>> >> > register 'elephant-bird-core-3.0.0.jar' >>>>> >> > register 'elephant-bird-pig-3.0.0.jar' >>>>> >> > register 'google-collections-1.0.jar' >>>>> >> > register 'json-simple-1.1.jar' >>>>> >> > >>>>> >> > json_lines = LOAD >>>>> >> > '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING >>>>> >> > com.twitter.elephantbird.pig.load.JsonLoader(); >>>>> >> > >>>>> >> > geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS >>>>> >> > id, (CHARARRAY) $0#'geoLocation' AS geoLocation; >>>>> >> > >>>>> >> > only_not_nulls = FILTER geo_tweets BY geoLocation is not null; >>>>> >> > store only_not_nulls into '/twitter_data/results/geo_tweets'; >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > Arian Rodrigo Pasquali >>>>> >> > FEUP, SAPO Labs >>>>> >> > http://www.arianpasquali.com >>>>> >> > twitter @arianpasquali >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > 2012/11/18 Dan Young <[EMAIL PROTECTED]> >>>>> >> > >>>>> >> >> No sure if this helps, but in 0.11 I've been using this on EMR >>>>> for >>>>> >> some of >>>>> >> >> our JSON data.... >>>>> >> >> >>>>> >> >> raw = load >>>>> 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' >>>>> >> USING >>>>> >> >> >>>>> >> >> >>>>> >> >>>>> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com +
Russell Jurney 2012-11-19, 19:35
-
Re: How do I load JSON in Pig?Deepak Tiwari 2012-11-19, 20:22
I also ran into same dilemma..here is something that I found easier and
working for me .. I compiled some sources from http://www.json.org/java/ import java.io.IOException; import java.io.UnsupportedEncodingException; import java.util.List; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import org.json.JSONArray; import org.json.JSONException; import org.json.JSONObject; public class JsonParser extends EvalFunc<Tuple> { @Override public Tuple exec(Tuple input) throws IOException { TupleFactory tf = TupleFactory.getInstance(); Tuple t = tf.newTuple(); if ( input.get(0) != null ){ String inString = (String) input.get(0); try { JSONObject jsn = new JSONObject(inString); t.append(getJsonArr(jsn)); } catch (JSONException e) { e.printStackTrace(); } } return t; } private String getJsonArr(JSONObject jsn) { String jsnArrVal = ""; try { if (!jsn.has("jsonKey")) return null; JSONArray jTagArray = jsn.getJSONArray("jsonKey"); for (int i=0; i<jTagArray.length(); i++){ JSONObject hst = jTagArray.getJSONObject(i); String jsnArrVal = hst.getString("text") + jsnArrVal; } } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } return jsnArrVal; } } On Mon, Nov 19, 2012 at 11:35 AM, Russell Jurney <[EMAIL PROTECTED]>wrote: > Ok, its even worse. My data is a big array. > > Am I being negative in saying that JSON and Pig is like a nightmare? > > > On Mon, Nov 19, 2012 at 2:33 PM, Russell Jurney <[EMAIL PROTECTED] > >wrote: > > > Wait... com.twitter.elephantbird.pig.load.JsonLoader() does not infer the > > schema from a record. This is what I was looking for. Looks like I have > to > > write that myself. > > > > And yes, I understand the tradeoffs in doing so. Assuming a sample is the > > overall schema is a big assumption. > > > > > > > > On Mon, Nov 19, 2012 at 2:30 PM, Russell Jurney < > [EMAIL PROTECTED]>wrote: > > > >> Talking to myself... never mind, guava and json-simple are included with > >> Pig. > >> > >> > >> On Mon, Nov 19, 2012 at 2:27 PM, Russell Jurney < > [EMAIL PROTECTED] > >> > wrote: > >> > >>> Got it building. Are google collections and json-simple external deps? > >>> > >>> > >>> On Mon, Nov 19, 2012 at 11:23 AM, Russell Jurney < > >>> [EMAIL PROTECTED]> wrote: > >>> > >>>> It seems that everyone can build elephant-bird but me: > >>>> https://github.com/kevinweil/elephant-bird/issues/272 > >>>> > >>>> > >>>> On Sun, Nov 18, 2012 at 7:31 PM, Arian Pasquali < > >>>> [EMAIL PROTECTED]> wrote: > >>>> > >>>>> I dont think you really need to build it. > >>>>> you can find it at any maven repository. > >>>>> > >>>>> Arian Rodrigo Pasquali > >>>>> FEUP, SAPO Labs > >>>>> http://www.arianpasquali.com > >>>>> twitter @arianpasquali > >>>>> > >>>>> > >>>>> > >>>>> 2012/11/18 Arian Pasquali <[EMAIL PROTECTED]> > >>>>> > >>>>> > U dont need to build neither > >>>>> > Just download those two jar I used in my example. > >>>>> > > >>>>> > Arian > >>>>> > > >>>>> > Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: > >>>>> > > >>>>> >> Thanks - looks like I don't have to specify the schema, which is > >>>>> good. > >>>>> >> > >>>>> >> I'll try and build elephant-bird. > >>>>> >> > >>>>> >> Russell Jurney http://datasyndrome.com > >>>>> >> > >>>>> >> On Nov 17, 2012, at 9:30 PM, Arian Pasquali < > >>>>> [EMAIL PROTECTED]> > >>>>> >> wrote: > >>>>> >> > >>>>> >> > keep calm > >>>>> >> > and use elephant-bird > >>>>> >> > https://github.com/kevinweil/elephant-bird< > >>>>> >> > >>>>> > https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java +
Deepak Tiwari 2012-11-19, 20:22
-
Re: How do I load JSON in Pig?Saxifrage Cucvara 2012-11-21, 05:56
I'm also experiencing problems working with JSON objects in Pig.
I have managed to load in a log file in JSON format but only query the top level objects. Whenever I try to call anything that is nested it fails. -- Register JARS register elephant-bird-2.2.3.jar; register json-simple-1.1.jar; -- Load data nestobject = LOAD '/Users/Path/GoogleDrive/test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') AS (json:map[]); DUMP nestobject; -- Example query tester = FOREACH nestobject GENERATE json#'event',json#'uid', json#'data'#'expired_reason' as reason; DUMP tester; The above fails ... Does anyone have any ideas? Thanks Sax On 20 November 2012 07:22, Deepak Tiwari <[EMAIL PROTECTED]> wrote: > I also ran into same dilemma..here is something that I found easier and > working for me .. I compiled some sources from http://www.json.org/java/ > > > import java.io.IOException; > import java.io.UnsupportedEncodingException; > import java.util.List; > > import org.apache.pig.EvalFunc; > import org.apache.pig.data.Tuple; > import org.apache.pig.data.TupleFactory; > import org.json.JSONArray; > import org.json.JSONException; > import org.json.JSONObject; > > > public class JsonParser extends EvalFunc<Tuple> { > @Override > public Tuple exec(Tuple input) throws IOException { > TupleFactory tf = TupleFactory.getInstance(); > Tuple t = tf.newTuple(); > > > if ( input.get(0) != null ){ > String inString = (String) input.get(0); > try { > JSONObject jsn = new JSONObject(inString); > t.append(getJsonArr(jsn)); > } catch (JSONException e) { > > e.printStackTrace(); > > } > } > return t; > } > > private String getJsonArr(JSONObject jsn) { > String jsnArrVal = ""; > > try { > if (!jsn.has("jsonKey")) > return null; > JSONArray jTagArray = jsn.getJSONArray("jsonKey"); > for (int i=0; i<jTagArray.length(); i++){ > JSONObject hst = jTagArray.getJSONObject(i); > String jsnArrVal = hst.getString("text") + jsnArrVal; > } > } catch (JSONException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > return jsnArrVal; > } > } > > > On Mon, Nov 19, 2012 at 11:35 AM, Russell Jurney > <[EMAIL PROTECTED]>wrote: > > > Ok, its even worse. My data is a big array. > > > > Am I being negative in saying that JSON and Pig is like a nightmare? > > > > > > On Mon, Nov 19, 2012 at 2:33 PM, Russell Jurney < > [EMAIL PROTECTED] > > >wrote: > > > > > Wait... com.twitter.elephantbird.pig.load.JsonLoader() does not infer > the > > > schema from a record. This is what I was looking for. Looks like I have > > to > > > write that myself. > > > > > > And yes, I understand the tradeoffs in doing so. Assuming a sample is > the > > > overall schema is a big assumption. > > > > > > > > > > > > On Mon, Nov 19, 2012 at 2:30 PM, Russell Jurney < > > [EMAIL PROTECTED]>wrote: > > > > > >> Talking to myself... never mind, guava and json-simple are included > with > > >> Pig. > > >> > > >> > > >> On Mon, Nov 19, 2012 at 2:27 PM, Russell Jurney < > > [EMAIL PROTECTED] > > >> > wrote: > > >> > > >>> Got it building. Are google collections and json-simple external > deps? > > >>> > > >>> > > >>> On Mon, Nov 19, 2012 at 11:23 AM, Russell Jurney < > > >>> [EMAIL PROTECTED]> wrote: > > >>> > > >>>> It seems that everyone can build elephant-bird but me: > > >>>> https://github.com/kevinweil/elephant-bird/issues/272 > > >>>> > > >>>> > > >>>> On Sun, Nov 18, 2012 at 7:31 PM, Arian Pasquali < > > >>>> [EMAIL PROTECTED]> wrote: > > >>>> > > >>>>> I dont think you really need to build it. > > >>>>> you can find it at any maven repository. > > >>>>> > > >>>>> Arian Rodrigo Pasquali *Saxifrage Cucvara* Senior Data Analyst [image: JBA Digital] <http://www.jbadigital.com/> *JBA Online Consultancy* E: [EMAIL PROTECTED] M: +61 424 622 534 W: www.jbadigital.com A: Level 6, 69 Reservoir Street, Surry Hills NSW 2010 The information contained in this email is confidential and is intended for the use of the individual or entity named above. If the receiver of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copy of this email is strictly prohibited. If you have received this e-mail in error, please notify our office by telephone. JB/A and their employees do not represent that this transmission is free from viruses or other defects and you should see it as your responsibility to check for viruses and defects. JB/A disclaims any liability to any person for loss or damage resulting (directly or indirectly) from the receipt of electronic mail (including enclosures). +
Saxifrage Cucvara 2012-11-21, 05:56
-
Re: How do I load JSON in Pig?David LaBarbera 2012-11-21, 14:25
Try
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') This should allow access to nested object as nested map ($0#'level1#'level2'#'level3' …) David On Nov 21, 2012, at 12:56 AM, Saxifrage Cucvara <[EMAIL PROTECTED]> wrote: > I'm also experiencing problems working with JSON objects in Pig. > > I have managed to load in a log file in JSON format but only query the top > level objects. Whenever I try to call anything that is nested it fails. > > -- Register JARS > register elephant-bird-2.2.3.jar; > register json-simple-1.1.jar; > > -- Load data > nestobject = LOAD '/Users/Path/GoogleDrive/test.json' > USING > com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') > AS (json:map[]); > DUMP nestobject; > > -- Example query > tester = FOREACH nestobject GENERATE json#'event',json#'uid', > json#'data'#'expired_reason' as reason; > DUMP tester; > > The above fails ... > > Does anyone have any ideas? > > Thanks > > Sax > > On 20 November 2012 07:22, Deepak Tiwari <[EMAIL PROTECTED]> wrote: > >> I also ran into same dilemma..here is something that I found easier and >> working for me .. I compiled some sources from http://www.json.org/java/ >> >> >> import java.io.IOException; >> import java.io.UnsupportedEncodingException; >> import java.util.List; >> >> import org.apache.pig.EvalFunc; >> import org.apache.pig.data.Tuple; >> import org.apache.pig.data.TupleFactory; >> import org.json.JSONArray; >> import org.json.JSONException; >> import org.json.JSONObject; >> >> >> public class JsonParser extends EvalFunc<Tuple> { >> @Override >> public Tuple exec(Tuple input) throws IOException { >> TupleFactory tf = TupleFactory.getInstance(); >> Tuple t = tf.newTuple(); >> >> >> if ( input.get(0) != null ){ >> String inString = (String) input.get(0); >> try { >> JSONObject jsn = new JSONObject(inString); >> t.append(getJsonArr(jsn)); >> } catch (JSONException e) { >> >> e.printStackTrace(); >> >> } >> } >> return t; >> } >> >> private String getJsonArr(JSONObject jsn) { >> String jsnArrVal = ""; >> >> try { >> if (!jsn.has("jsonKey")) >> return null; >> JSONArray jTagArray = jsn.getJSONArray("jsonKey"); >> for (int i=0; i<jTagArray.length(); i++){ >> JSONObject hst = jTagArray.getJSONObject(i); >> String jsnArrVal = hst.getString("text") + jsnArrVal; >> } >> } catch (JSONException e) { >> // TODO Auto-generated catch block >> e.printStackTrace(); >> } >> return jsnArrVal; >> } >> } >> >> >> On Mon, Nov 19, 2012 at 11:35 AM, Russell Jurney >> <[EMAIL PROTECTED]>wrote: >> >>> Ok, its even worse. My data is a big array. >>> >>> Am I being negative in saying that JSON and Pig is like a nightmare? >>> >>> >>> On Mon, Nov 19, 2012 at 2:33 PM, Russell Jurney < >> [EMAIL PROTECTED] >>>> wrote: >>> >>>> Wait... com.twitter.elephantbird.pig.load.JsonLoader() does not infer >> the >>>> schema from a record. This is what I was looking for. Looks like I have >>> to >>>> write that myself. >>>> >>>> And yes, I understand the tradeoffs in doing so. Assuming a sample is >> the >>>> overall schema is a big assumption. >>>> >>>> >>>> >>>> On Mon, Nov 19, 2012 at 2:30 PM, Russell Jurney < >>> [EMAIL PROTECTED]>wrote: >>>> >>>>> Talking to myself... never mind, guava and json-simple are included >> with >>>>> Pig. >>>>> >>>>> >>>>> On Mon, Nov 19, 2012 at 2:27 PM, Russell Jurney < >>> [EMAIL PROTECTED] >>>>>> wrote: >>>>> >>>>>> Got it building. Are google collections and json-simple external >> deps? >>>>>> >>>>>> >>>>>> On Mon, Nov 19, 2012 at 11:23 AM, Russell Jurney < >>>>>> [EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> It seems that everyone can build elephant-bird but me: +
David LaBarbera 2012-11-21, 14:25
-
Re: How do I load JSON in Pig?Saxifrage Cucvara 2012-11-21, 22:36
Thanks David.
However, I did try this. I can read things on first level of the JSON file but anything in any of the nested levels is failing. Not sure if the below errors help with identifying what the problem might be: *012-11-22 09:29:07,065 [Thread-39] WARN org.apache.hadoop.mapred.FileOutputCommitter - Output path is null in cleanup* *2012-11-22 09:29:07,065 [Thread-39] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0009* *org.apache.pig.backend.executionengine.ExecException: ERROR 1081: Cannot cast to map. Expected bytearray but received: chararray* * at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:1422) * * at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.processInput(POMapLookUp.java:87) * * at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98) * * at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:117) * * at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:320) * * at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) * * at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) * * at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:271) * * at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266) * * at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) * * at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)* * at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)* * at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)* * at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)* *Caused by: java.lang.ClassCastException* *2012-11-22 09:29:07,199 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0009* *2012-11-22 09:29:07,199 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete* *2012-11-22 09:29:12,207 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local_0009 has failed! Stop running all dependent jobs* *2012-11-22 09:29:12,207 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete* *2012-11-22 09:29:12,207 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!* On 22 November 2012 01:25, David LaBarbera <[EMAIL PROTECTED] > wrote: > Try > > com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') > This should allow access to nested object as nested map > ($0#'level1#'level2'#'level3' …) > > David > > On Nov 21, 2012, at 12:56 AM, Saxifrage Cucvara < > [EMAIL PROTECTED]> wrote: > > > I'm also experiencing problems working with JSON objects in Pig. > > > > I have managed to load in a log file in JSON format but only query the > top > > level objects. Whenever I try to call anything that is nested it fails. > > > > -- Register JARS > > register elephant-bird-2.2.3.jar; > > register json-simple-1.1.jar; > > > > -- Load data > > nestobject = LOAD '/Users/Path/GoogleDrive/test.json' > > USING > > com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') > > AS (json:map[]); > > DUMP nestobject; > > > > -- Example query > > tester = FOREACH nestobject GENERATE json#'event',json#'uid', > > json#'data'#'expired_reason' as reason; > > DUMP tester; > > > > The above fails ... > > > > Does anyone have any ideas? > > > > Thanks > > > > Sax > > > > On 20 November 2012 07:22, Deepak Tiwari <[EMAIL PROTECTED]> wrote: *Saxifrage Cucvara* Senior Data Analyst [image: JBA Digital] <http://www.jbadigital.com/> *JBA Online Consultancy* E: [EMAIL PROTECTED] M: +61 424 622 534 W: www.jbadigital.com A: Level 6, 69 Reservoir Street, Surry Hills NSW 2010 The information contained in this email is confidential and is intended for the use of the individual or entity named above. If the receiver of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copy of this email is strictly prohibited. If you have received this e-mail in error, please notify our office by telephone. JB/A and their employees do not represent that this transmission is free from viruses or other defects and you should see it as your responsibility to check for viruses and defects. JB/A disclaims any liability to any person for loss or damage resulting (directly or indirectly) from the receipt of electronic mail (including enclosures). +
Saxifrage Cucvara 2012-11-21, 22:36
-
Re: How do I load JSON in Pig?Adam Kawa 2012-11-17, 23:40
Maybe JsonLoader from ElephantBird can be useful? -
https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java. 2012/11/17 Russell Jurney <[EMAIL PROTECTED]>: > I have some JSON data with a uniform schema. I want to load it in Pig. > JsonStorage doesn't work, because the data has no schema. > > How can I load JSON data in Pig? > > -- > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com +
Adam Kawa 2012-11-17, 23:40
-
Re: How do I load JSON in Pig?Russell Jurney 2012-11-18, 22:46
They come prebuilt? Neat!
Russell Jurney twitter.com/rjurney On Nov 18, 2012, at 5:31 PM, Arian Pasquali <[EMAIL PROTECTED]> wrote: > U dont need to build neither > Just download those two jar I used in my example. > > Arian > > Em domingo, 18 de novembro de 2012, Russell Jurney escreveu: > >> Thanks - looks like I don't have to specify the schema, which is good. >> I'll try and build elephant-bird. >> >> Russell Jurney http://datasyndrome.com >> >> On Nov 17, 2012, at 9:30 PM, Arian Pasquali <[EMAIL PROTECTED]<javascript:;>> >> wrote: >> >>> keep calm >>> and use elephant-bird >>> https://github.com/kevinweil/elephant-bird< >> https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/JsonLoader.java >>> >>> >>> I posted here yesterday an example how to load tweets in json >>> here goes again. I hope it helps. >>> >>> register 'elephant-bird-core-3.0.0.jar' >>> register 'elephant-bird-pig-3.0.0.jar' >>> register 'google-collections-1.0.jar' >>> register 'json-simple-1.1.jar' >>> >>> json_lines = LOAD >>> '/twitter_data/tweets/stream/v1/json/2012_10_10/08' USING >>> com.twitter.elephantbird.pig.load.JsonLoader(); >>> >>> geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS >>> id, (CHARARRAY) $0#'geoLocation' AS geoLocation; >>> >>> only_not_nulls = FILTER geo_tweets BY geoLocation is not null; >>> store only_not_nulls into '/twitter_data/results/geo_tweets'; >>> >>> >>> >>> Arian Rodrigo Pasquali >>> FEUP, SAPO Labs >>> http://www.arianpasquali.com >>> twitter @arianpasquali >>> >>> >>> >>> 2012/11/18 Dan Young <[EMAIL PROTECTED] <javascript:;>> >>> >>>> No sure if this helps, but in 0.11 I've been using this on EMR for some >> of >>>> our JSON data.... >>>> >>>> raw = load 'hdfs:///cleaned_logs/clicks2/$year_id/$month_id/part-*' >> USING >>>> >>>> >> JsonLoader('a:chararray,at:chararray,c1:(url:chararray,useragent:chararray,referrer:chararray,window:(innerheight:chararray,innerwidth:chararray,outerheight:chararray,outerwidth:chararray),resolution:(height:chararray,width:chararray)),cst:chararray,d:(a:chararray,b:chararray),i:chararray,id:chararray,ip:chararray,k:chararray,l:(lat:chararray,lng:chararray),p:chararray,pv:chararray,sa:chararray,sid:chararray,sst:chararray,t:chararray,uuid:chararray,v:chararray'); >>>> >>>> >>>> Regards, >>>> >>>> Dano >>>> >>>> >>>> >>>> On Sat, Nov 17, 2012 at 3:09 PM, Russell Jurney < >> [EMAIL PROTECTED] <javascript:;> >>>>> wrote: >>>> >>>>> I have some JSON data with a uniform schema. I want to load it in Pig. >>>>> JsonStorage doesn't work, because the data has no schema. >>>>> >>>>> How can I load JSON data in Pig? >>>>> >>>>> -- >>>>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]<javascript:;> >>>>> datasyndrome.com >>>>> >>>> >> > > > -- > Sent from Gmail Mobile +
Russell Jurney 2012-11-18, 22:46
|