Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - multiple folder loading or passing comma on parameter with Amazon Pig


Copy link to this message
-
Re: multiple folder loading or passing comma on parameter with Amazon Pig
Dexin Wang 2011-08-18, 19:28
I will.

There is also a "bug" on Pig documentation here:

http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html

where it says

   In this example the command is executed and its stdout is used as the
parameter value.

  %declare CMD 'generate_date';

it should really be `generate_date` with the back ticks, not the single
quotes.

On Wed, Aug 17, 2011 at 6:18 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:

> Nice job figuring out a fix!
> You should seriously file a bug with AMR for that. That's kind of
> ridiculous.
>
> D
>
> On Wed, Aug 17, 2011 at 6:03 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
>
> > I solved my own problem and just want to share with whoever might
> encounter
> > the same issue.
> >
> > I pass colon separated list then convert it to comma separated list
> inside
> > pig script using declare command.
> >
> > Submit pig job  like this:
> >
> >     -p SOURCE_DIRS="2011-08:2011-07:2011-06"
> >
> > and in Pig script
> >
> >     % declare SOURCE_DIRS_CONVERTED  `echo $SOURCE_DIRS | tr ':' ','`;
> >     LOAD '/root_dir/{$SOURCE_DIRS_CONVERTED}' ...
> >
> >
> > On Wed, Aug 17, 2011 at 4:21 PM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > >
> > > I'm running pig jobs using Amazon pig support, where you submit jobs
> with
> > > comma concatenated parameters like this:
> > >
> > >      elastic-mapreduce --pig-script --args myscript.pig --args
> > > -p,PARAM1=value1,-p,PARAM2=value2,-p,PARAM3=value3
> > >
> > > In my script, I need to pass multiple directories for the pig script to
> > > load like this:
> > >
> > >      raw = LOAD '/root_dir/{$SOURCE_DIRS}'
> > >
> > > and SOURCE_DIRS is computed. For example, it can be
> > > "2011-08,2011-07,20110-06", meaning my pig script need to load data for
> > the
> > > past 3 months. This works fine when I run my job using local or direct
> > > hadoop mode. But with Amazon pig, I have to do something like this:
> > >
> > >      elastic-mapreduce --pig-script --args myscript.pig
> > > -p,SOURCE_DIRS="2011-08,2011-07,2011-06"
> > >
> > > but emr will just replace commas with spaces so it breaks the parameter
> > > passing syntax. I've tried adding backslashes before commas, but I
> simply
> > > end up with back slash with space in between.
> > >
> > > So question becomes:
> > >
> > > 1. can I do something differently than what I'm doing to pass multiple
> > > folders to pig script (without commas), or
> > > 2. anyone knows how to properly pass commas to elastic-mapreduce ?
> > >
> > > Thanks!
> > >
> > > Dexin
> > >
> >
>