Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Design issue, need feedback


Copy link to this message
-
Re: Design issue, need feedback
Another possible solution: Use json for your storage and load via
JsonLoader:

https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/JsonLoader.java

Then you could project the fields you'd like to use out of the loaded maps
via Pig's map dereference operator:

http://pig.apache.org/docs/r0.9.2/basic.html#deref

Andy
On Tue, May 22, 2012 at 4:43 AM, Ruslan Al-fakikh <
[EMAIL PROTECTED]> wrote:

> Hey Nerius,
>
> As for the columns number changes - yes, Avro or Thrift can handle that.
> As for transforming a value of a row from this 'Age=23' to this '23' -
> this is what Pig can do for you.
> Try something like
> b = foreach a generate substring(0,4,$1) AS Age  --I haven't tested it,
> there can be typos
> Or maybe some other builtins can do, like regex stuff
>
> Ruslan Al-Fakikh
>
> -----Original Message-----
> From: Jonathan Coveney [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, May 22, 2012 3:47 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Design issue, need feedback
>
> There are a couple of ways that you can do this. One, is that you could
> make a special loader that converts your format to a map of (key,value)
> pairs, and then you can project however you want.
>
> Another (better, if at all possible) way would be to use something like
> Avro or Thrift that allows you to update your schemas in backwards
> compatible ways. I highly recommend leveraging one of these projects
> instead of rolling your own janky file format. Starting from nothing Avro
> would probably be easier, though here at Twitter we have standardized on
> Thrift. Pig can work with both.
>
> 2012/5/21 Nerius Landys <[EMAIL PROTECTED]>
>
> > I'm trying to do something with Pig that I believe Pig wasn't really
> > designed/intended to handle.
> > Normally, the way I'd do things with Pig is by feeding it data like so:
> >
> >    Fred     23
> >    Adam   25
> >    Mary     21
> >
> > ...where the columns are name and age.
> >
> > Now, a requirement came up to have the Pig script be more flexible
> > when the format of the data changes.  In fact with the system I'm
> > working in, the format of the data will change quite often.  Let me
> > explain.  So now in one scenario the data incoming would be this,
> > literally:
> >
> >    Name=Fred     Age=23
> >    Name=Adam   Age=25
> >    Name=Mary     Age=21
> >
> > ...and if the format of the data changes somewhat it may appear like
> > this for example:
> >
> >    Age=23    Sex=male      Name=Fred
> >    Age=25    Sex=male      Name=Adam
> >    Age=21    Sex=female    Name=Mary
> >
> > <All data is tab delimited of course.> However in all cases the only
> > data I'm interested in is name and age, in that order.  And in all
> > cases these two pieces of data are guaranteed to appear.
> >
> > So my question is, how easy and efficient would it be to write
> > something in the Pig langauge that transforms the key/value rows into
> > the format in my first example with just name and age columns?
> >
> > That is, I want to transform, using a fixed series of Pig statements,
> > this type of row:
> >
> >    Name=Fred     Age=23
> >
> > to this:
> >
> >    Fred     23
> >
> > And I want the same series of Pig statements to transform this type of
> row:
> >
> >    Age=23  Sex=male  Name=Fred
> >
> > to this:
> >
> >    Fred     23
> >
> > How would I go about doing this, and would this be terribly
> > inefficient?  Is this just not the way Pig was meant to work?
> > I can think of a way to do this using a UDF maybe but is there a way
> > to do this using builtins?
> >
> > Thanks, Nerius
> >
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB