If I have many of the TB input, and I have configured the block size "128M",
it will generate thousands of mappers, and generate thousands of the output
Because too many of the files will increase the loading of the Namenode, and
it also will increase the io loading in the cluster, I need to reduce the
number of files stored to HDFS.
2011/4/2 Jameson Lopp <[EMAIL PROTECTED]>
> I can't think of a simple way to accomplish that without reducing the
> parallelism of your M/R jobs, which of course would affect the performance
> of your script.
> Things I'd take into account:
> * how much data are you reading / writing with this pig script?
> * do you really need thousands of mappers / how adversely would your
> M/R performance be affected by reducing parallelism?
> * why do you need to reduce the number of files stored to HDFS?
> Jameson Lopp
> Software Engineer
> Bronto Software, Inc.
> On 04/01/2011 03:57 AM, Jameson Li wrote:
>> When I run the below pig codes:
>> a = load '/logs/2011-03-31';
>> b = filter a by $1=='a' and $2=='b';
>> store b into '20110331-ab';
>> It runs a M/R that have thousands maps, and then create a output store
>> directory that have the same number so many files.
>> I have a doubt that how I could store less files when I use pig to store
>> files in the HDFS.
>> Jameson Li.