|
Jameson Li
2011-04-01, 07:57
Jameson Lopp
2011-04-01, 17:51
Xiaomeng Wan
2011-04-01, 18:22
Dmitriy Ryaboy
2011-04-01, 18:53
Jameson Li
2011-04-02, 02:26
Jameson Li
2011-04-02, 02:32
Mridul Muralidharan
2011-04-02, 13:57
Dmitriy Ryaboy
2011-04-02, 20:25
|
-
store less filesJameson Li 2011-04-01, 07:57
Hi,
When I run the below pig codes: a = load '/logs/2011-03-31'; b = filter a by $1=='a' and $2=='b'; store b into '20110331-ab'; It runs a M/R that have thousands maps, and then create a output store directory that have the same number so many files. I have a doubt that how I could store less files when I use pig to store files in the HDFS. Thanks, Jameson Li.
-
Re: store less filesJameson Lopp 2011-04-01, 17:51
I can't think of a simple way to accomplish that without reducing the parallelism of your M/R jobs,
which of course would affect the performance of your script. Things I'd take into account: * how much data are you reading / writing with this pig script? * do you really need thousands of mappers / how adversely would your M/R performance be affected by reducing parallelism? * why do you need to reduce the number of files stored to HDFS? -- Jameson Lopp Software Engineer Bronto Software, Inc. On 04/01/2011 03:57 AM, Jameson Li wrote: > Hi, > > When I run the below pig codes: > a = load '/logs/2011-03-31'; > b = filter a by $1=='a' and $2=='b'; > store b into '20110331-ab'; > > It runs a M/R that have thousands maps, and then create a output store > directory that have the same number so many files. > > I have a doubt that how I could store less files when I use pig to store > files in the HDFS. > > > Thanks, > Jameson Li. >
-
Re: store less filesXiaomeng Wan 2011-04-01, 18:22
Hi Jameson,
Do you mind to add something like this: c = order b by $0 parallel n; store c into '20110331-ab'; you can order on anything. it will add a reduce and give you less files. Regards, Shawn On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li <[EMAIL PROTECTED]> wrote: > Hi, > > When I run the below pig codes: > a = load '/logs/2011-03-31'; > b = filter a by $1=='a' and $2=='b'; > store b into '20110331-ab'; > > It runs a M/R that have thousands maps, and then create a output store > directory that have the same number so many files. > > I have a doubt that how I could store less files when I use pig to store > files in the HDFS. > > > Thanks, > Jameson Li. >
-
Re: store less filesDmitriy Ryaboy 2011-04-01, 18:53
Don't order, that's expensive.
Just group by rand(), specify parallelism on the group by, and store the result of "foreach grouped generate FLATTEN(name_of_original_relation);" On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan <[EMAIL PROTECTED]> wrote: > Hi Jameson, > > Do you mind to add something like this: > > c = order b by $0 parallel n; > store c into '20110331-ab'; > > you can order on anything. it will add a reduce and give you less files. > > Regards, > Shawn > On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li <[EMAIL PROTECTED]> wrote: > > Hi, > > > > When I run the below pig codes: > > a = load '/logs/2011-03-31'; > > b = filter a by $1=='a' and $2=='b'; > > store b into '20110331-ab'; > > > > It runs a M/R that have thousands maps, and then create a output store > > directory that have the same number so many files. > > > > I have a doubt that how I could store less files when I use pig to store > > files in the HDFS. > > > > > > Thanks, > > Jameson Li. > > >
-
Re: store less filesJameson Li 2011-04-02, 02:26
If I have many of the TB input, and I have configured the block size "128M",
it will generate thousands of mappers, and generate thousands of the output files. Because too many of the files will increase the loading of the Namenode, and it also will increase the io loading in the cluster, I need to reduce the number of files stored to HDFS. 2011/4/2 Jameson Lopp <[EMAIL PROTECTED]> > I can't think of a simple way to accomplish that without reducing the > parallelism of your M/R jobs, which of course would affect the performance > of your script. > > Things I'd take into account: > * how much data are you reading / writing with this pig script? > * do you really need thousands of mappers / how adversely would your > M/R performance be affected by reducing parallelism? > * why do you need to reduce the number of files stored to HDFS? > -- > Jameson Lopp > Software Engineer > Bronto Software, Inc. > > > On 04/01/2011 03:57 AM, Jameson Li wrote: > >> Hi, >> >> When I run the below pig codes: >> a = load '/logs/2011-03-31'; >> b = filter a by $1=='a' and $2=='b'; >> store b into '20110331-ab'; >> >> It runs a M/R that have thousands maps, and then create a output store >> directory that have the same number so many files. >> >> I have a doubt that how I could store less files when I use pig to store >> files in the HDFS. >> >> >> Thanks, >> Jameson Li. >> >>
-
Re: store less filesJameson Li 2011-04-02, 02:32
Thanks all of you.
I have test that. It works well. Below is the pig codes: a = load '/logs/2011-03-31'; b = filter a by $1=='a' and $2=='b'; c = group b by RANDOM() parallel 30;/*here you can modify the parallel number, and it will generate the number of the output files.*/ d = foreach c generate flatten(b); store d into 'youroutputdir'; But I still have the doubt that the way 'group by RANDOM()' will add the extra steps. Does there have no way directly store the number of the files that I want? 2011/4/2 Dmitriy Ryaboy <[EMAIL PROTECTED]> > Don't order, that's expensive. > Just group by rand(), specify parallelism on the group by, and store the > result of "foreach grouped generate FLATTEN(name_of_original_relation);" > > On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan <[EMAIL PROTECTED]> wrote: > > > Hi Jameson, > > > > Do you mind to add something like this: > > > > c = order b by $0 parallel n; > > store c into '20110331-ab'; > > > > you can order on anything. it will add a reduce and give you less files. > > > > Regards, > > Shawn > > On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > When I run the below pig codes: > > > a = load '/logs/2011-03-31'; > > > b = filter a by $1=='a' and $2=='b'; > > > store b into '20110331-ab'; > > > > > > It runs a M/R that have thousands maps, and then create a output store > > > directory that have the same number so many files. > > > > > > I have a doubt that how I could store less files when I use pig to > store > > > files in the HDFS. > > > > > > > > > Thanks, > > > Jameson Li. > > > > > >
-
Re: store less filesMridul Muralidharan 2011-04-02, 13:57
Using rand() as group key, in general, is a pretty bad idea in case of failures. - Mridul On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote: > Don't order, that's expensive. > Just group by rand(), specify parallelism on the group by, and store the > result of "foreach grouped generate FLATTEN(name_of_original_relation);" > > On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan<[EMAIL PROTECTED]> wrote: > >> Hi Jameson, >> >> Do you mind to add something like this: >> >> c = order b by $0 parallel n; >> store c into '20110331-ab'; >> >> you can order on anything. it will add a reduce and give you less files. >> >> Regards, >> Shawn >> On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li<[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> When I run the below pig codes: >>> a = load '/logs/2011-03-31'; >>> b = filter a by $1=='a' and $2=='b'; >>> store b into '20110331-ab'; >>> >>> It runs a M/R that have thousands maps, and then create a output store >>> directory that have the same number so many files. >>> >>> I have a doubt that how I could store less files when I use pig to store >>> files in the HDFS. >>> >>> >>> Thanks, >>> Jameson Li. >>> >>
-
RE: store less filesDmitriy Ryaboy 2011-04-02, 20:25
That's a good call, thanks Mridul. Something reproducible like taking a hash of a tuple field is much better.
As for the concern about having to move all the data -- until hdfs allows multiple writers to a single file (not on the roadmap afaik), there isn't a good way to have multiple mappers write a single file. In fact even if multiple writers were possible you'd get in trouble if mappers failed... -----Original Message----- From: "Mridul Muralidharan" <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Cc: "Dmitriy Ryaboy" <[EMAIL PROTECTED]>; "Xiaomeng Wan" <[EMAIL PROTECTED]> Sent: 4/2/2011 6:57 AM Subject: Re: store less files Using rand() as group key, in general, is a pretty bad idea in case of failures. - Mridul On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote: > Don't order, that's expensive. > Just group by rand(), specify parallelism on the group by, and store the > result of "foreach grouped generate FLATTEN(name_of_original_relation);" > > On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan<[EMAIL PROTECTED]> wrote: > >> Hi Jameson, >> >> Do you mind to add something like this: >> >> c = order b by $0 parallel n; >> store c into '20110331-ab'; >> >> you can order on anything. it will add a reduce and give you less files. >> >> Regards, >> Shawn >> On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li<[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> When I run the below pig codes: >>> a = load '/logs/2011-03-31'; >>> b = filter a by $1=='a' and $2=='b'; >>> store b into '20110331-ab'; >>> >>> It runs a M/R that have thousands maps, and then create a output store >>> director [truncated by sender] |