-Re: Difference between HDFS_BYTES_READ and the actual size of input files
Probably because records are split across blocks, so some of the data has to be read twice. Assuming you have a 64 MB block size and 128 GB of data, I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB record size. Your overhead is about 75% of that, maybe my arithmetic is off, or there is some intelligence in HDFS to reduce how often records are split across blocks (say if a big record needs to be written but there is little space left in the current block, anybody know?). Test this theory by increasing the HDFS block size to 128 MB or 256 MB, and then re-create or copy dataset2. Overhead should be 1/2 or 1/4 of what it is now.
----- Original Message -----
From: "Jeff LI" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Wednesday, March 6, 2013 10:21:09 AM
Subject: Difference between HDFS_BYTES_READ and the actual size of input files
Dear Hadoop Users,
I recently noticed there is a difference between the File System Counter "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs. And the difference seems to increase as the size of each key,value pairs increases.
For example, I'm running the same job on two datasets. The sizes of both datasets are the same, which is about 128 GB. And the keys are integers. The difference between these two datasets is the number of key,values pairs and thus the size of each value: dataset1 has 2^17 key,value pairs and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each value.
The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for dataset2.
I have also tested on other sizes of each value. There doesn't seem to be any difference when the size of each value is small (<=128KB), but noticeable difference when the size increases.
Could you give me some idea on why this is happening?
By the way, I'm running Hadoop 0.20.2-cdh3u5.