Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> distributed cache


Copy link to this message
-
Re: distributed cache
Thanks Harsh, multiple concurrent read is generally faster or?

regards,
Lin

On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> There is no limitation in HDFS that limits reads of a block to a
> single client at a time (no reason to do so) - so downloads can be as
> concurrent as possible.
>
> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> > Thanks Harsh,
> >
> > Supposing DistributedCache is uploaded by client, for each replica, in
> > Hadoop design, it could only serve one download session (download from a
> > mapper or a reducer which requires the DistributedCache) at a time until
> > DistributedCache file download is completed, or it could serve multiple
> > concurrent parallel download session (download from multiple mappers or
> > reducers which requires the DistributedCache).
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi Lin,
> >>
> >> DistributedCache files are stored onto the HDFS by the client first.
> >> The TaskTrackers download and localize it. Therefore, as with any
> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> higher replicas.
> >>
> >> The point of having higher replication for these files is also tied to
> >> the concept of racks in a cluster - you would want more replicas
> >> spread across racks such that on task bootup the downloads happen with
> >> rack locality.
> >>
> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> >> > Hi Kai,
> >> >
> >> > Smart answer! :-)
> >> >
> >> > The assumption you have is one distributed cache replica could only
> >> > serve
> >> > one download session for tasktracker node (this is why you get
> >> > concurrency
> >> > n/r). The question is, why one distributed cache replica cannot serve
> >> > multiple concurrent download session? For example, supposing a
> >> > tasktracker
> >> > use elapsed time t to download a file from a specific distributed
> cache
> >> > replica, it is possible for 2 tasktrackers to download from the
> specific
> >> > distributed cache replica in parallel using elapsed time t as well, or
> >> > 1.5
> >> > t, which is faster than sequential download time 2t you mentioned
> >> > before?
> >> > "In total, r+n/r concurrent operations. If you optimize r depending on
> >> > n,
> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n)
> for
> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
> will
> >> >> need to access the files in the distributed cache. And r is the
> >> >> replication
> >> >> level of those files.
> >> >>
> >> >> Copying the files into HDFS requires r copy operations over the
> >> >> network.
> >> >> The n TaskTrackers need to get their local copies from HDFS, so the n
> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In
> >> >> total,
> >> >> r+n/r concurrent operations. If you optimize r depending on n,
> SRQT(n)
> >> >> is
> >> >> the optimal replication level. So 10 is a reasonable default setting
> >> >> for
> >> >> most clusters that are not 500+ nodes big.
> >> >>
> >> >> Kai
> >> >>
> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <[EMAIL PROTECTED]>:
> >> >>
> >> >> Thanks Kai, using higher replication count for the purpose of?
> >> >>
> >> >> regards,
> >> >> Lin
> >> >>
> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>:
> >> >>>
> >> >>> > I want to confirm when on each task node either mapper or reducer
> >> >>> > access distributed cache file, it resides on disk, not resides in
> >> >>> > memory.
> >> >>> > Just want to make sure distributed cache file does not fully
> loaded