Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> store less files


Copy link to this message
-
Re: store less files
Thanks all of you.

I have test that. It works well.
Below is the pig codes:
    a = load '/logs/2011-03-31';
    b = filter a by $1=='a' and $2=='b';
    c = group b by RANDOM() parallel 30;/*here you can modify the parallel
number, and it will generate the number of the output files.*/
    d = foreach c generate flatten(b);
    store d into 'youroutputdir';

But I still have the doubt that the way 'group by RANDOM()' will add the
extra steps.
Does there have no way directly store the number of the files that I want´╝č
2011/4/2 Dmitriy Ryaboy <[EMAIL PROTECTED]>

> Don't order, that's expensive.
> Just group by rand(), specify parallelism on the group by, and store the
> result of "foreach grouped generate FLATTEN(name_of_original_relation);"
>
> On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan <[EMAIL PROTECTED]> wrote:
>
> > Hi Jameson,
> >
> > Do you mind to add something like this:
> >
> > c = order b by $0 parallel n;
> > store c into '20110331-ab';
> >
> > you can order on anything. it will add a reduce and give you less files.
> >
> > Regards,
> > Shawn
> > On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > When I run the below pig codes:
> > > a = load '/logs/2011-03-31';
> > > b = filter a by $1=='a' and $2=='b';
> > > store b into '20110331-ab';
> > >
> > > It runs a M/R that have thousands maps, and then create a output store
> > > directory that have the same number so many files.
> > >
> > > I have a doubt that how I could store less files when I use pig to
> store
> > > files in the HDFS.
> > >
> > >
> > > Thanks,
> > > Jameson Li.
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB