Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Design issue, need feedback


Copy link to this message
-
Re: Design issue, need feedback
Andy Schlaikjer 2012-05-22, 14:56
Another possible solution: Use json for your storage and load via
JsonLoader:

https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/JsonLoader.java

Then you could project the fields you'd like to use out of the loaded maps
via Pig's map dereference operator:

http://pig.apache.org/docs/r0.9.2/basic.html#deref

Andy
On Tue, May 22, 2012 at 4:43 AM, Ruslan Al-fakikh <
[EMAIL PROTECTED]> wrote:

> Hey Nerius,
>
> As for the columns number changes - yes, Avro or Thrift can handle that.
> As for transforming a value of a row from this 'Age=23' to this '23' -
> this is what Pig can do for you.
> Try something like
> b = foreach a generate substring(0,4,$1) AS Age  --I haven't tested it,
> there can be typos
> Or maybe some other builtins can do, like regex stuff
>
> Ruslan Al-Fakikh
>
> -----Original Message-----
> From: Jonathan Coveney [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, May 22, 2012 3:47 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Design issue, need feedback
>
> There are a couple of ways that you can do this. One, is that you could
> make a special loader that converts your format to a map of (key,value)
> pairs, and then you can project however you want.
>
> Another (better, if at all possible) way would be to use something like
> Avro or Thrift that allows you to update your schemas in backwards
> compatible ways. I highly recommend leveraging one of these projects
> instead of rolling your own janky file format. Starting from nothing Avro
> would probably be easier, though here at Twitter we have standardized on
> Thrift. Pig can work with both.
>
> 2012/5/21 Nerius Landys <[EMAIL PROTECTED]>
>
> > I'm trying to do something with Pig that I believe Pig wasn't really
> > designed/intended to handle.
> > Normally, the way I'd do things with Pig is by feeding it data like so:
> >
> >    Fred     23
> >    Adam   25
> >    Mary     21
> >
> > ...where the columns are name and age.
> >
> > Now, a requirement came up to have the Pig script be more flexible
> > when the format of the data changes.  In fact with the system I'm
> > working in, the format of the data will change quite often.  Let me
> > explain.  So now in one scenario the data incoming would be this,
> > literally:
> >
> >    Name=Fred     Age=23
> >    Name=Adam   Age=25
> >    Name=Mary     Age=21
> >
> > ...and if the format of the data changes somewhat it may appear like
> > this for example:
> >
> >    Age=23    Sex=male      Name=Fred
> >    Age=25    Sex=male      Name=Adam
> >    Age=21    Sex=female    Name=Mary
> >
> > <All data is tab delimited of course.> However in all cases the only
> > data I'm interested in is name and age, in that order.  And in all
> > cases these two pieces of data are guaranteed to appear.
> >
> > So my question is, how easy and efficient would it be to write
> > something in the Pig langauge that transforms the key/value rows into
> > the format in my first example with just name and age columns?
> >
> > That is, I want to transform, using a fixed series of Pig statements,
> > this type of row:
> >
> >    Name=Fred     Age=23
> >
> > to this:
> >
> >    Fred     23
> >
> > And I want the same series of Pig statements to transform this type of
> row:
> >
> >    Age=23  Sex=male  Name=Fred
> >
> > to this:
> >
> >    Fred     23
> >
> > How would I go about doing this, and would this be terribly
> > inefficient?  Is this just not the way Pig was meant to work?
> > I can think of a way to do this using a UDF maybe but is there a way
> > to do this using builtins?
> >
> > Thanks, Nerius
> >
>
>