Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - store less files


Copy link to this message
-
Re: store less files
Jameson Li 2011-04-02, 02:32
Thanks all of you.

I have test that. It works well.
Below is the pig codes:
    a = load '/logs/2011-03-31';
    b = filter a by $1=='a' and $2=='b';
    c = group b by RANDOM() parallel 30;/*here you can modify the parallel
number, and it will generate the number of the output files.*/
    d = foreach c generate flatten(b);
    store d into 'youroutputdir';

But I still have the doubt that the way 'group by RANDOM()' will add the
extra steps.
Does there have no way directly store the number of the files that I want´╝č
2011/4/2 Dmitriy Ryaboy <[EMAIL PROTECTED]>

> Don't order, that's expensive.
> Just group by rand(), specify parallelism on the group by, and store the
> result of "foreach grouped generate FLATTEN(name_of_original_relation);"
>
> On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan <[EMAIL PROTECTED]> wrote:
>
> > Hi Jameson,
> >
> > Do you mind to add something like this:
> >
> > c = order b by $0 parallel n;
> > store c into '20110331-ab';
> >
> > you can order on anything. it will add a reduce and give you less files.
> >
> > Regards,
> > Shawn
> > On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > When I run the below pig codes:
> > > a = load '/logs/2011-03-31';
> > > b = filter a by $1=='a' and $2=='b';
> > > store b into '20110331-ab';
> > >
> > > It runs a M/R that have thousands maps, and then create a output store
> > > directory that have the same number so many files.
> > >
> > > I have a doubt that how I could store less files when I use pig to
> store
> > > files in the HDFS.
> > >
> > >
> > > Thanks,
> > > Jameson Li.
> > >
> >
>