Yeah Kai si right.
You can read more details for your understanding at:
and right from the horse's mouth (Pgs 70-75):
On Mon, Jun 10, 2013 at 9:47 AM, Kai Voigt <[EMAIL PROTECTED]> wrote:
> Am 10.06.2013 um 15:36 schrieb Razen Al Harbi <[EMAIL PROTECTED]>:
> > I have deployed Hadoop on a cluster of 20 machines. I set the
> replication factor to one. When I put a file (larger than HDFS block size)
> into HDFS, all the blocks are stored on the machine where the Hadoop put
> command is invoked.
> > For higher replication factor, I see the same behavior but the
> replicated blocks are stored randomly on all the other machines.
> > Is this a normal behavior, if not what would be the cause?
> Yes, this is normal behavior. When a HDFS client happens to run on a host
> that also is a DataNode (always the case when a reducer writes its output),
> the first copy of a block is stored on that very same node. This is to
> optimize the latency, it's faster to write to a local disk than writing
> across the network.
> The second copy of the block gets stored onto a random host in another
> rack (if your cluster is configured to be rack-aware), to increase the
> distribution of the data.
> The third copy of the block gets stored onto a random host in that other
> So your observations are correct.
> Kai Voigt
> [EMAIL PROTECTED]