Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: distributed cache


Copy link to this message
-
Re: distributed cache
Thanks Harsh,

   1. For long read, you mean read a large continuous part of a file, other
   than a small chunk of a file?
   2. "gradually decreasing performance for long reads" -- you mean
   parallel multiple threads long read degrade performance? Or single thread
   exclusive long read degrade performance?

regards,
Lin

On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Lin,
>
> It is comparable (and is also logically similar) to reading a file
> multiple times in parallel in a local filesystem - not too much of a
> performance hit for small reads (by virtue of OS caches, and quick
> completion per read, as is usually the case for distributed cache
> files), and gradually decreasing performance for long reads (due to
> frequent disk physical movement)? Thankfully, due to block sizes the
> latter isn't a problem for large files on a proper DN, as the blocks
> are spread over the disks and across the nodes.
>
> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> > Thanks Harsh, multiple concurrent read is generally faster or?
> >
> > regards,
> > Lin
> >
> >
> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> There is no limitation in HDFS that limits reads of a block to a
> >> single client at a time (no reason to do so) - so downloads can be as
> >> concurrent as possible.
> >>
> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> >> > Thanks Harsh,
> >> >
> >> > Supposing DistributedCache is uploaded by client, for each replica, in
> >> > Hadoop design, it could only serve one download session (download
> from a
> >> > mapper or a reducer which requires the DistributedCache) at a time
> until
> >> > DistributedCache file download is completed, or it could serve
> multiple
> >> > concurrent parallel download session (download from multiple mappers
> or
> >> > reducers which requires the DistributedCache).
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >> Hi Lin,
> >> >>
> >> >> DistributedCache files are stored onto the HDFS by the client first.
> >> >> The TaskTrackers download and localize it. Therefore, as with any
> >> >> other file on HDFS, "downloads" can be efficiently parallel with
> >> >> higher replicas.
> >> >>
> >> >> The point of having higher replication for these files is also tied
> to
> >> >> the concept of racks in a cluster - you would want more replicas
> >> >> spread across racks such that on task bootup the downloads happen
> with
> >> >> rack locality.
> >> >>
> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> >> >> > Hi Kai,
> >> >> >
> >> >> > Smart answer! :-)
> >> >> >
> >> >> > The assumption you have is one distributed cache replica could only
> >> >> > serve
> >> >> > one download session for tasktracker node (this is why you get
> >> >> > concurrency
> >> >> > n/r). The question is, why one distributed cache replica cannot
> serve
> >> >> > multiple concurrent download session? For example, supposing a
> >> >> > tasktracker
> >> >> > use elapsed time t to download a file from a specific distributed
> >> >> > cache
> >> >> > replica, it is possible for 2 tasktrackers to download from the
> >> >> > specific
> >> >> > distributed cache replica in parallel using elapsed time t as well,
> >> >> > or
> >> >> > 1.5
> >> >> > t, which is faster than sequential download time 2t you mentioned
> >> >> > before?
> >> >> > "In total, r+n/r concurrent operations. If you optimize r depending
> >> >> > on
> >> >> > n,
> >> >> > SRQT(n) is the optimal replication level." -- how do you get
> SRQT(n)
> >> >> > for
> >> >> > minimize r+n/r? Appreciate if you could point me to more details.
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB