Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: distributed cache


+
Lin Ma 2012-12-26, 10:11
Copy link to this message
-
Re: distributed cache
Thanks Harsh,

   1. For long read, you mean read a large continuous part of a file, other
   than a small chunk of a file?
   2. "gradually decreasing performance for long reads" -- you mean
   parallel multiple threads long read degrade performance? Or single thread
   exclusive long read degrade performance?

regards,
Lin

On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Lin,
>
> It is comparable (and is also logically similar) to reading a file
> multiple times in parallel in a local filesystem - not too much of a
> performance hit for small reads (by virtue of OS caches, and quick
> completion per read, as is usually the case for distributed cache
> files), and gradually decreasing performance for long reads (due to
> frequent disk physical movement)? Thankfully, due to block sizes the
> latter isn't a problem for large files on a proper DN, as the blocks
> are spread over the disks and across the nodes.
>
> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> > Thanks Harsh, multiple concurrent read is generally faster or?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> There is no limitation in HDFS that limits reads of a block to a
> >> single client at a time (no reason to do so) - so downloads can be as
> >> concurrent as possible.
> >>
> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> >> > Thanks Harsh,
> >> >
> >> > Supposing DistributedCache is uploaded by client, for each replica, in
> >> > Hadoop design, it could only serve one download session (download
> from a
> >> > mapper or a reducer which requires the DistributedCache) at a time
> until
> >> > DistributedCache file download is completed, or it could serve
> multiple
> >> > concurrent parallel download session (download from multiple mappers
> or
> >> > reducers which requires the DistributedCache).
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >> Hi Lin,
> >> >>
> >> >> DistributedCache files are stored onto the HDFS by the client first.
> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> higher replicas.
> >> >>
> >> >> The point of having higher replication for these files is also tied
> to
> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> spread across racks such that on task bootup the downloads happen
> with
> >> >> rack locality.
> >> >>
> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> >> >> > Hi Kai,
> >> >> >
> >> >> > Smart answer! :-)
> >> >> >
> >> >> > The assumption you have is one distributed cache replica could only
> >> >> > serve
> >> >> > one download session for tasktracker node (this is why you get
> >> >> > concurrency
> >> >> > n/r). The question is, why one distributed cache replica cannot
> serve
> >> >> > multiple concurrent download session? For example, supposing a
> >> >> > tasktracker
> >> >> > use elapsed time t to download a file from a specific distributed
> >> >> > cache
> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> > specific
> >> >> > distributed cache replica in parallel using elapsed time t as well,
> >> >> > or
> >> >> > 1.5
> >> >> > t, which is faster than sequential download time 2t you mentioned
> >> >> > before?
> >> >> > "In total, r+n/r concurrent operations. If you optimize r depending
> >> >> > on
> >> >> > n,
> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> SRQT(n)
> >> >> > for
> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
+
Harsh J 2012-12-26, 12:19
+
Lin Ma 2012-12-28, 10:02
+
Lin Ma 2012-12-26, 06:06