Razen Al Harbi 2013-06-10, 13:36
Daryn Sharp 2013-06-10, 13:53
Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <[EMAIL PROTECTED]>:
> I have deployed Hadoop on a cluster of 20 machines. I set the replication factor to one. When I put a file (larger than HDFS block size) into HDFS, all the blocks are stored on the machine where the Hadoop put command is invoked.
> For higher replication factor, I see the same behavior but the replicated blocks are stored randomly on all the other machines.
> Is this a normal behavior, if not what would be the cause?
Yes, this is normal behavior. When a HDFS client happens to run on a host that also is a DataNode (always the case when a reducer writes its output), the first copy of a block is stored on that very same node. This is to optimize the latency, it's faster to write to a local disk than writing across the network.
The second copy of the block gets stored onto a random host in another rack (if your cluster is configured to be rack-aware), to increase the distribution of the data.
The third copy of the block gets stored onto a random host in that other rack.
So your observations are correct.
Shahab Yunus 2013-06-10, 13:57