Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> UDF discussion? Here or on the dev list? / Json Loading


+
Alex McLintock 2011-01-29, 12:12
+
Jacob Perkins 2011-01-29, 13:43
+
Alex McLintock 2011-01-30, 20:09
+
Jacob Perkins 2011-01-30, 21:01
Copy link to this message
-
Re: UDF discussion? Here or on the dev list? / Json Loading
Hello,

On Sat, Jan 29, 2011 at 5:42 PM, Alex McLintock
<[EMAIL PROTECTED]> wrote:
> I wonder if discussion of the Piggybank and other User Defined Fields is
> best done here (since it is *using* Pig) or on the Development list (because
> it is enhancing Pig).
>
> I'm trying to load some Json into pig using the PigJsonLoader.java UDF which
> Kim Vogt posted about back in September. (It isn't in Piggybank AFAICS)
> https://gist.github.com/601331
>
>
> The class works for me - mostly....
>
>
> This works when the Json is just a single level
>
> {"field1": "value1", "field2": "value2", "field3": "value3"}
>
> But doesn't seem to work when the json is nested
>
> {"field1": "value1", "field2": "value2", {"field4": "value4", "field5":
> "value5", "field6": "value6"}, "field3": "value3"}
>

The json-simple library for Java will build the entire JSON
representation as a JSONObject, which is _exactly_ what you need. This
is a Java Map-like class which would contain your structure properly.
What remains is to properly convert this to a Pig-acceptable Map
structure.

But what's happening in Vogt's code (and also Elephant-Bird's
LzoJsonLoader from which it was sourced) is that the Map is
down-converted to a simple Key-Value mapping instead of a Map
containing another Map. This was done due to a limitation in Pig
0.6.0, where the Map type could not hold complex types in it -- as
noted in the latter class's javadoc [1].

This limitation has gone away in 0.7.0+ I think (As the Pig Map spec
now supports <String, {Atom, Tuple, Bag, Map}>, so you can feel free
to change/get rid of the iteration inside parseStringToTuple(...) to
not 'flatten' the Map.

Additionally I think the json-simple dependency can perhaps be removed
in favor of Jackson Core/Mapper libraries that are now being shipped
by Hadoop itself (eliminating an extra JAR). Pig does not ship the
json-simple library along. But you may want to be careful about the
version of Jackson Core/Mapper in place inside your Hadoop. There are
much more recent updates of it available with benefits.

Perhaps, if you feel like, you can contribute your change back to
elephant-bird [2]. I think they're open to newer-Pig related changes.

[1] - https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/LzoJsonLoader.java
[2] -  https://github.com/kevinweil/elephant-bird

--
Harsh J
www.harshj.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB