Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: distributed cache


+
Lin Ma 2012-12-26, 10:11
+
Lin Ma 2012-12-26, 12:06
+
Harsh J 2012-12-26, 12:19
+
Lin Ma 2012-12-28, 10:02
Copy link to this message
-
Re: distributed cache
I have figured out the 2nd issue, appreciate if anyone could advise on the
first issue.

regards,
Lin

On Sat, Dec 22, 2012 at 9:24 PM, Lin Ma <[EMAIL PROTECTED]> wrote:

> Hi Kai,
>
> Smart answer! :-)
>
>    - The assumption you have is one distributed cache replica could only
>    serve one download session for tasktracker node (this is why you get
>    concurrency n/r). The question is, why one distributed cache replica cannot
>    serve multiple concurrent download session? For example, supposing a
>    tasktracker use elapsed time t to download a file from a specific
>    distributed cache replica, it is possible for 2 tasktrackers to download
>    from the specific distributed cache replica in parallel using elapsed time
>    t as well, or 1.5 t, which is faster than sequential download time 2t you
>    mentioned before?
>    - "In total, r+n/r concurrent operations. If you optimize r depending
>    on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>    for minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <[EMAIL PROTECTED]>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> access distributed cache file, it resides on disk, not resides in memory.
>>> Just want to make sure distributed cache file does not fully loaded into
>>> memory which compete memory consumption with mapper/reducer tasks. Is that
>>> correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for
>>> future tasks on this node. After a job is done, all local copies and the
>>> HDFS copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> [EMAIL PROTECTED]
>>>
>>>
>>>
>>>
>>>
>>
>>  --
>> Kai Voigt
>> [EMAIL PROTECTED]
>>
>>
>>
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB