-Re: Difference between HDFS_BYTES_READ and the actual size of input files
Thanks Jeff, for your suggestions.
I did run another test to study the extra copies you suggested.
First, I forgot to mention in the previous test, there are 256 files for
I ran another test with one change, number of files from 256 to 1024, which
means in the later case one input split per file (given that the block size
is 128MB). And it turned out that there is no difference or very slight
difference in the HDFS_BYTES_READ.
So I think this confirms your suggestion in some way. There are probably a
lot "copied twice" going on.
On Wed, Mar 6, 2013 at 2:12 PM, Jeffrey Buell <[EMAIL PROTECTED]> wrote:
> Probably because records are split across blocks, so some of the data has
> to be read twice. Assuming you have a 64 MB block size and 128 GB of data,
> I'd estimate the overhead at 1 GB for 1 MB record size, and 32 GB for 32 MB
> record size. Your overhead is about 75% of that, maybe my arithmetic is
> off, or there is some intelligence in HDFS to reduce how often records are
> split across blocks (say if a big record needs to be written but there is
> little space left in the current block, anybody know?). Test this theory
> by increasing the HDFS block size to 128 MB or 256 MB, and then re-create
> or copy dataset2. Overhead should be 1/2 or 1/4 of what it is now.
> *From: *"Jeff LI" <[EMAIL PROTECTED]>
> *To: *[EMAIL PROTECTED]
> *Sent: *Wednesday, March 6, 2013 10:21:09 AM
> *Subject: *Difference between HDFS_BYTES_READ and the actual size of
> input files
> Dear Hadoop Users,
> I recently noticed there is a difference between the File System Counter
> "HDFS_BYTES_READ" and the actual size of input files in map-reduce jobs.
> And the difference seems to increase as the size of each key,value pairs
> For example, I'm running the same job on two datasets. The sizes of both
> datasets are the same, which is about 128 GB. And the keys are integers.
> The difference between these two datasets is the number of key,values
> pairs and thus the size of each value: dataset1 has 2^17 key,value pairs
> and 1MB for each value; dataset2 has 2^12 key,value pairs and 32MB for each
> The HDFS_BYTES_READ counter is 128.77GB for dataset1 and 152.00GB for
> I have also tested on other sizes of each value. There doesn't seem to be
> any difference when the size of each value is small (<=128KB),
> but noticeable difference when the size increases.
> Could you give me some idea on why this is happening?
> By the way, I'm running Hadoop 0.20.2-cdh3u5.