Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Aggregating data nested into JSON documents


Copy link to this message
-
Re: Aggregating data nested into JSON documents
Hi..

Have you thought about HBase?

I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file.
Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small.

That would be the same for either pig/hive.

In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed.

Hive?
I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer.
Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased.

I think you're on the right track or at least train of thought.

HTH

-Mike
On Jun 12, 2013, at 7:57 PM, Tecno Brain <[EMAIL PROTECTED]> wrote:

> Hello,
>    I'm new to Hadoop.
>    I have a large quantity of JSON documents with a structure similar to what is shown below.  
>
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ],
>     ... // other fields omitted
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ...
>          },    
>      ]
>    },
> }
>  
>
> Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document)
>
> I am interested in analyzing only the  "important-data" array and the "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column "map" would be a complex column, all others would be primitives.
>
> ( g, sg, j, page, f1, f2, f3 )
>  
> ( g, sg, j, page, x1, x2, x3, map )
>
> So, for each JSON document, I would like to "create" several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the "flattened" data.
>
> In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat "standard" solutions.
>
> By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single "row" of the table with some columns being an array, a map of other nested structures.
> a) Is there a way to break each JSON document into several "rows" for a Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard?
>
> The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source?
>
> Thanks
>
>
>
>
>
>
>
>
>
>