Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: distributed cache


Copy link to this message
-
Re: distributed cache
Lin Ma 2012-12-26, 06:06
I have figured out the 2nd issue, appreciate if anyone could advise on the
first issue.

regards,
Lin

On Sat, Dec 22, 2012 at 9:24 PM, Lin Ma <[EMAIL PROTECTED]> wrote:

> Hi Kai,
>
> Smart answer! :-)
>
>    - The assumption you have is one distributed cache replica could only
>    serve one download session for tasktracker node (this is why you get
>    concurrency n/r). The question is, why one distributed cache replica cannot
>    serve multiple concurrent download session? For example, supposing a
>    tasktracker use elapsed time t to download a file from a specific
>    distributed cache replica, it is possible for 2 tasktrackers to download
>    from the specific distributed cache replica in parallel using elapsed time
>    t as well, or 1.5 t, which is faster than sequential download time 2t you
>    mentioned before?
>    - "In total, r+n/r concurrent operations. If you optimize r depending
>    on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
>    for minimize r+n/r? Appreciate if you could point me to more details.
>
> regards,
> Lin
>
>
> On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> simple math. Assuming you have n TaskTrackers in your cluster that will
>> need to access the files in the distributed cache. And r is the replication
>> level of those files.
>>
>> Copying the files into HDFS requires r copy operations over the network.
>> The n TaskTrackers need to get their local copies from HDFS, so the n
>> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total,
>> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is
>> the optimal replication level. So 10 is a reasonable default setting for
>> most clusters that are not 500+ nodes big.
>>
>> Kai
>>
>> Am 22.12.2012 um 13:46 schrieb Lin Ma <[EMAIL PROTECTED]>:
>>
>> Thanks Kai, using higher replication count for the purpose of?
>>
>> regards,
>> Lin
>>
>> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>:
>>>
>>> > I want to confirm when on each task node either mapper or reducer
>>> access distributed cache file, it resides on disk, not resides in memory.
>>> Just want to make sure distributed cache file does not fully loaded into
>>> memory which compete memory consumption with mapper/reducer tasks. Is that
>>> correct?
>>>
>>>
>>> Yes, you are correct. The JobTracker will put files for the distributed
>>> cache into HDFS with a higher replication count (10 by default). Whenever a
>>> TaskTracker needs those files for a task it is launching locally, it will
>>> fetch a copy to its local disk. So it won't need to do this again for
>>> future tasks on this node. After a job is done, all local copies and the
>>> HDFS copies of files in the distributed cache are cleaned up.
>>>
>>> Kai
>>>
>>> --
>>> Kai Voigt
>>> [EMAIL PROTECTED]
>>>
>>>
>>>
>>>
>>>
>>
>>  --
>> Kai Voigt
>> [EMAIL PROTECTED]
>>
>>
>>
>>
>>
>