Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> UDF discussion? Here or on the dev list? / Json Loading


Copy link to this message
-
Re: UDF discussion? Here or on the dev list? / Json Loading
Hello,

On Sat, Jan 29, 2011 at 5:42 PM, Alex McLintock
<[EMAIL PROTECTED]> wrote:
> I wonder if discussion of the Piggybank and other User Defined Fields is
> best done here (since it is *using* Pig) or on the Development list (because
> it is enhancing Pig).
>
> I'm trying to load some Json into pig using the PigJsonLoader.java UDF which
> Kim Vogt posted about back in September. (It isn't in Piggybank AFAICS)
> https://gist.github.com/601331
>
>
> The class works for me - mostly....
>
>
> This works when the Json is just a single level
>
> {"field1": "value1", "field2": "value2", "field3": "value3"}
>
> But doesn't seem to work when the json is nested
>
> {"field1": "value1", "field2": "value2", {"field4": "value4", "field5":
> "value5", "field6": "value6"}, "field3": "value3"}
>

The json-simple library for Java will build the entire JSON
representation as a JSONObject, which is _exactly_ what you need. This
is a Java Map-like class which would contain your structure properly.
What remains is to properly convert this to a Pig-acceptable Map
structure.

But what's happening in Vogt's code (and also Elephant-Bird's
LzoJsonLoader from which it was sourced) is that the Map is
down-converted to a simple Key-Value mapping instead of a Map
containing another Map. This was done due to a limitation in Pig
0.6.0, where the Map type could not hold complex types in it -- as
noted in the latter class's javadoc [1].

This limitation has gone away in 0.7.0+ I think (As the Pig Map spec
now supports <String, {Atom, Tuple, Bag, Map}>, so you can feel free
to change/get rid of the iteration inside parseStringToTuple(...) to
not 'flatten' the Map.

Additionally I think the json-simple dependency can perhaps be removed
in favor of Jackson Core/Mapper libraries that are now being shipped
by Hadoop itself (eliminating an extra JAR). Pig does not ship the
json-simple library along. But you may want to be careful about the
version of Jackson Core/Mapper in place inside your Hadoop. There are
much more recent updates of it available with benefits.

Perhaps, if you feel like, you can contribute your change back to
elephant-bird [2]. I think they're open to newer-Pig related changes.

[1] - https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/LzoJsonLoader.java
[2] -  https://github.com/kevinweil/elephant-bird

--
Harsh J
www.harshj.com