-Re: Understanding harpoon - help needed
Harsh J 2013-01-23, 14:39
Iterating on Bharath's responses, my answers to each of your questions
On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
> Replication factor = 3
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
This isn't theoretically correct (a randomity and dependence of client's
location exists here in spreading of the blocks), but for a balanced state
assumption let it be so.
> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
Depends on where the client's location is. If its among the DNs, a local
read is incurred. If elsewhere, it is possible that each may read from a
unique DN or even the same DN (randomly ordered returns from the NN). But
ideally the closest to the DN is picked, at least rack-wise, if the NN is
aware of this.
> Does Namenode know which Datanode is busy and which one is idle?
NN does health checks upon writes (stuff like space, load and recent
availability). At read time, the client does more of a failing-over act,
trying DNs one at a time in provided order until one accepts its request,
if they are all highly busy.
> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
Yes, more replicas generally mean more available DNs to serve its read, but
at the same time it impacts write speeds as there's more synchronous wait
to take care of.
> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
"Will access be distributed [for a series of block of the same file]?" -
Yes, for remote client reads. Access order is randomized for these form of
clients, leading to possibly different patterns each time.