The total number of Maps in the RandomTextWriter execution were 100 and
hence the total number of input files for WordCount are 100.
My dfs.block.size = 128MB and I have not changed the
mapred.max.split.size and could not find it in myJob.xml file.
Hence refering the formula *max(minsplitsize, min(maxsplitsize, blocksize))*,
I am assuming the mapred.max.split.size to be 128MB.
If I calculate the blocks per file [bytes per file / block size (128 MB)]
gives me 8.21 for all. And then if I sum up them it becomes 821.22 (Same as
my previous calculation).
I have some how managed to do a need copy of the Job.xml in a word doc. I
copied it from browser as I cannot recover it in the hdfs. Please find it
in the attachment. You may refer the parameters and configuration there. I
have also attached the console output for the bytes per file in the
On Fri, Aug 17, 2012 at 3:28 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:
> Hi Gaurav
> To add on more clarity to my previous mail
> If you are using the default TextInputFormat there will be *atleast* one
> task generated per file even if the file size is less than
> the block size. (assuming you have split size equal to block size)
> So the right way to calculate the number of splits is per file and not on
> the whole input data size. Calculate number of blocks per file and summing
> up those values from all files would equate to the number of mappers.
> What is the value of mapred.max.splitsize in your job? If it is less than
> the hdfs block size there will be more spits for even for a hdfs block.
> Bejoy KS