Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: distributed cache


+
Lin Ma 2012-12-26, 10:11
+
Lin Ma 2012-12-26, 12:06
Copy link to this message
-
Re: distributed cache
Hi,

Sorry for having been ambiguous. For (1) I meant a large block (if the
block size is large). For (2) I meant multiple, concurrent threads.

On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
> Thanks Harsh,
>
> For long read, you mean read a large continuous part of a file, other than a
> small chunk of a file?
> "gradually decreasing performance for long reads" -- you mean parallel
> multiple threads long read degrade performance? Or single thread exclusive
> long read degrade performance?
>
> regards,
> Lin
>
>
> On Wed, Dec 26, 2012 at 7:48 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>> Hi Lin,
>>
>> It is comparable (and is also logically similar) to reading a file
>> multiple times in parallel in a local filesystem - not too much of a
>> performance hit for small reads (by virtue of OS caches, and quick
>> completion per read, as is usually the case for distributed cache
>> files), and gradually decreasing performance for long reads (due to
>> frequent disk physical movement)? Thankfully, due to block sizes the
>> latter isn't a problem for large files on a proper DN, as the blocks
>> are spread over the disks and across the nodes.
>>
>> On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
>> > Thanks Harsh, multiple concurrent read is generally faster or?
>> >
>> > regards,
>> > Lin
>> >
>> >
>> > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> >>
>> >> There is no limitation in HDFS that limits reads of a block to a
>> >> single client at a time (no reason to do so) - so downloads can be as
>> >> concurrent as possible.
>> >>
>> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
>> >> > Thanks Harsh,
>> >> >
>> >> > Supposing DistributedCache is uploaded by client, for each replica,
>> >> > in
>> >> > Hadoop design, it could only serve one download session (download
>> >> > from a
>> >> > mapper or a reducer which requires the DistributedCache) at a time
>> >> > until
>> >> > DistributedCache file download is completed, or it could serve
>> >> > multiple
>> >> > concurrent parallel download session (download from multiple mappers
>> >> > or
>> >> > reducers which requires the DistributedCache).
>> >> >
>> >> > regards,
>> >> > Lin
>> >> >
>> >> >
>> >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> >> >>
>> >> >> Hi Lin,
>> >> >>
>> >> >> DistributedCache files are stored onto the HDFS by the client first.
>> >> >> The TaskTrackers download and localize it. Therefore, as with any
>> >> >> other file on HDFS, "downloads" can be efficiently parallel with
>> >> >> higher replicas.
>> >> >>
>> >> >> The point of having higher replication for these files is also tied
>> >> >> to
>> >> >> the concept of racks in a cluster - you would want more replicas
>> >> >> spread across racks such that on task bootup the downloads happen
>> >> >> with
>> >> >> rack locality.
>> >> >>
>> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <[EMAIL PROTECTED]> wrote:
>> >> >> > Hi Kai,
>> >> >> >
>> >> >> > Smart answer! :-)
>> >> >> >
>> >> >> > The assumption you have is one distributed cache replica could
>> >> >> > only
>> >> >> > serve
>> >> >> > one download session for tasktracker node (this is why you get
>> >> >> > concurrency
>> >> >> > n/r). The question is, why one distributed cache replica cannot
>> >> >> > serve
>> >> >> > multiple concurrent download session? For example, supposing a
>> >> >> > tasktracker
>> >> >> > use elapsed time t to download a file from a specific distributed
>> >> >> > cache
>> >> >> > replica, it is possible for 2 tasktrackers to download from the
>> >> >> > specific
>> >> >> > distributed cache replica in parallel using elapsed time t as
>> >> >> > well,
>> >> >> > or
>> >> >> > 1.5
>> >> >> > t, which is faster than sequential download time 2t you mentioned
>> >> >> > before?
>> >> >> > "In total, r+n/r concurrent operations. If you optimize r
>> >> >> > depending
>> >> >> > on
>> >> >> > n,
>> >> >> > SRQT(n) is the optimal replication level." -- how do you get

Harsh J
+
Lin Ma 2012-12-28, 10:02
+
Lin Ma 2012-12-26, 06:06