Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Json and split into multiple files


Copy link to this message
-
Re: Json and split into multiple files
Mohit Anchlia 2012-09-13, 14:01
On Wed, Sep 12, 2012 at 7:51 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

> I don't understand your use case or why you need to use exec or
> outputSchema.  Would it be possible to send a more complete example that
> makes clear why you need these?
>
> My Json has many fields and several parent elements. I already have POJO
that I can parse into and read fields from instead of hand typing all of
them. I also have a mapper and formatter that maps JSON to database fields
which is a fixed position in file. Hand typing all of it in Pig would be
really painful. With exec I can easily parse my Json and then use Mappers
to write to Tuples. It's faster to develop and easy to unit test.
> Alan.
>
> A tuple can contain a tuple, so it's certainly possible with
> outputSchema() to generate a schema that declares both your tuples.  But I
> don't think this answers your questions.
>
> On Sep 7, 2012, at 10:21 AM, Mohit Anchlia wrote:
>
> > It looks like I can use outputSchema(Schema input) call to do this. But
> > examples I see are only for one tuple. In my case if I am reading it
> right
> > I need tuple for each dimension and hence schema for each. For instance
> > there'll be one user tuple and then product tuple for instance. So I need
> > schema for each.
> >
> > How can I do this using outputSchema such that result is like below
> where I
> > can access each tuple and field that is a named field? Thanks for your
> help
> >
> > A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name:
> > chararray), product: tuple(id: int, name:chararray))
> >
> > On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
> >
> >> I have a Json something like:
> >>
> >> {
> >> user{
> >> id : 1
> >> name: user1
> >> }
> >> product {
> >> id: 1
> >> name: product1
> >> }
> >> }
> >>
> >> I want to be able to read this file and create 2 files as follows:
> >>
> >> user file:
> >> key,1,user1
> >>
> >> product file:
> >> key,1,product1
> >>
> >> I know I need to call exec but the method will return Bags for each of
> >> these dimensions.  But since it's all unordered how do I split it
> further
> >> to write them to separate files?
> >>
>
>