Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Aggregating data nested into JSON documents


Copy link to this message
-
Re: Aggregating data nested into JSON documents
Hi Mike,

Yes, I also have thought about HBase or Cassandra but my data is pretty
much a snapshot, it does not require updates. Most of my aggregations will
also need to be computed once and won't change over time with the exception
of some aggregation that is based on the last N days of data.  Should I
still consider HBase ? I think that probably it will be good for the
aggregated data.

I have no idea what are sequence files, but I will take a look.  My raw
data is stored in the cloud, not in my Hadoop cluster.

I'll keep looking at Pig with ElephantBird.
Thanks,

-Jorge

On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <[EMAIL PROTECTED]>wrote:

> Hi..
>
> Have you thought about HBase?
>
> I would suggest that if you're using Hive or Pig, to look at taking these
> files and putting the JSON records in to a sequence file.
> Or set of sequence files.... (Then look at HBase to help index them...)
> 200KB is small.
>
> That would be the same for either pig/hive.
>
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice.
> And yes you get each record as a row, however you can always flatten them
> as needed.
>
> Hive?
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward
> Capriolo could give you a better answer.
> Going from memory, I don't know that there is a good SerDe that would
> write JSON, just read it. (Hive)
>
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated
> and biased.
>
> I think you're on the right track or at least train of thought.
>
> HTH
>
> -Mike
>
>
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <[EMAIL PROTECTED]>
> wrote:
>
> Hello,
>    I'm new to Hadoop.
>    I have a large quantity of JSON documents with a structure similar to
> what is shown below.
>
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ...
>          },
>      ],
>     ... // other fields omitted
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ...
>          },
>      ]
>    },
> }
>
>
> Each file contains a single JSON document (gzip compressed, and roughly
> about 200KB uncompressed of pretty-printed json text per document)
>
> I am interested in analyzing only the  "important-data" array and the
> "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a
> couple of tables with a fixed set of columns. Only the column "map" would
> be a complex column, all others would be primitives.
>
> ( g, sg, j, page, f1, f2, f3 )
>
> ( g, sg, j, page, x1, x2, x3, map )
>
> So, for each JSON document, I would like to "create" several rows, but I
> would like to avoid the intermediate step of persisting -and duplicating-
> the "flattened" data.
>
> In order to avoid persisting the data flattened, I thought I had to write
> my own map-reduce in Java code, but discovered that others have had the
> same problem of using JSON as the source and there are somewhat "standard"
> solutions.
>
> By reading about the SerDe approach for Hive I get the impression that
> each JSON document is transformed into a single "row" of the table with
> some columns being an array, a map of other nested structures.
> a) Is there a way to break each JSON document into several "rows" for a
> Hive external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them