Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Understanding harpoon - help needed


Copy link to this message
-
Re: Understanding harpoon - help needed
Iterating on Bharath's responses, my answers to each of your questions
inline:
On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<[EMAIL PROTECTED]>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>

This isn't theoretically correct (a randomity and dependence of client's
location exists here in spreading of the blocks), but for a balanced state
assumption let it be so.
> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>

Depends on where the client's location is. If its among the DNs, a local
read is incurred. If elsewhere, it is possible that each may read from a
unique DN or even the same DN (randomly ordered returns from the NN). But
ideally the closest to the DN is picked, at least rack-wise, if the NN is
aware of this.

> Does Namenode know which Datanode is busy and which one is idle?
>

NN does health checks upon writes (stuff like space, load and recent
availability). At read time, the client does more of a failing-over act,
trying DNs one at a time in provided order until one accepts its request,
if they are all highly busy.
> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>

Yes, more replicas generally mean more available DNs to serve its read, but
at the same time it impacts write speeds as there's more synchronous wait
to take care of.
> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>
"Will access be distributed [for a series of block of the same file]?" -
Yes, for remote client reads. Access order is randomized for these form of
clients, leading to possibly different patterns each time.

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB