|
Lin Ma
2012-12-22, 12:03
Kai Voigt
2012-12-22, 12:44
Lin Ma
2012-12-22, 12:46
Kai Voigt
2012-12-22, 12:51
Lin Ma
2012-12-22, 13:24
Harsh J
2012-12-26, 08:51
Harsh J
2012-12-26, 10:21
Lin Ma
2012-12-26, 10:43
Harsh J
2012-12-26, 11:48
|
-
distributed cacheLin Ma 2012-12-22, 12:03
Hi guys,
I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct? thanks in advance, Lin
-
Re: distributed cacheKai Voigt 2012-12-22, 12:44
Hi,
Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>: > I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct? Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up. Kai -- Kai Voigt [EMAIL PROTECTED]
-
Re: distributed cacheLin Ma 2012-12-22, 12:46
Thanks Kai, using higher replication count for the purpose of?
regards, Lin On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > Hi, > > Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>: > > > I want to confirm when on each task node either mapper or reducer access > distributed cache file, it resides on disk, not resides in memory. Just > want to make sure distributed cache file does not fully loaded into memory > which compete memory consumption with mapper/reducer tasks. Is that correct? > > > Yes, you are correct. The JobTracker will put files for the distributed > cache into HDFS with a higher replication count (10 by default). Whenever a > TaskTracker needs those files for a task it is launching locally, it will > fetch a copy to its local disk. So it won't need to do this again for > future tasks on this node. After a job is done, all local copies and the > HDFS copies of files in the distributed cache are cleaned up. > > Kai > > -- > Kai Voigt > [EMAIL PROTECTED] > > > > >
-
Re: distributed cacheKai Voigt 2012-12-22, 12:51
Hi,
simple math. Assuming you have n TaskTrackers in your cluster that will need to access the files in the distributed cache. And r is the replication level of those files. Copying the files into HDFS requires r copy operations over the network. The n TaskTrackers need to get their local copies from HDFS, so the n TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is the optimal replication level. So 10 is a reasonable default setting for most clusters that are not 500+ nodes big. Kai Am 22.12.2012 um 13:46 schrieb Lin Ma <[EMAIL PROTECTED]>: > Thanks Kai, using higher replication count for the purpose of? > > regards, > Lin > > On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > Hi, > > Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>: > > > I want to confirm when on each task node either mapper or reducer access distributed cache file, it resides on disk, not resides in memory. Just want to make sure distributed cache file does not fully loaded into memory which compete memory consumption with mapper/reducer tasks. Is that correct? > > > Yes, you are correct. The JobTracker will put files for the distributed cache into HDFS with a higher replication count (10 by default). Whenever a TaskTracker needs those files for a task it is launching locally, it will fetch a copy to its local disk. So it won't need to do this again for future tasks on this node. After a job is done, all local copies and the HDFS copies of files in the distributed cache are cleaned up. > > Kai > > -- > Kai Voigt > [EMAIL PROTECTED] > > > > > -- Kai Voigt [EMAIL PROTECTED]
-
Re: distributed cacheLin Ma 2012-12-22, 13:24
Hi Kai,
Smart answer! :-) - The assumption you have is one distributed cache replica could only serve one download session for tasktracker node (this is why you get concurrency n/r). The question is, why one distributed cache replica cannot serve multiple concurrent download session? For example, supposing a tasktracker use elapsed time t to download a file from a specific distributed cache replica, it is possible for 2 tasktrackers to download from the specific distributed cache replica in parallel using elapsed time t as well, or 1.5 t, which is faster than sequential download time 2t you mentioned before? - "In total, r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for minimize r+n/r? Appreciate if you could point me to more details. regards, Lin On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > Hi, > > simple math. Assuming you have n TaskTrackers in your cluster that will > need to access the files in the distributed cache. And r is the replication > level of those files. > > Copying the files into HDFS requires r copy operations over the network. > The n TaskTrackers need to get their local copies from HDFS, so the n > TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, > r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is > the optimal replication level. So 10 is a reasonable default setting for > most clusters that are not 500+ nodes big. > > Kai > > Am 22.12.2012 um 13:46 schrieb Lin Ma <[EMAIL PROTECTED]>: > > Thanks Kai, using higher replication count for the purpose of? > > regards, > Lin > > On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>: >> >> > I want to confirm when on each task node either mapper or reducer >> access distributed cache file, it resides on disk, not resides in memory. >> Just want to make sure distributed cache file does not fully loaded into >> memory which compete memory consumption with mapper/reducer tasks. Is that >> correct? >> >> >> Yes, you are correct. The JobTracker will put files for the distributed >> cache into HDFS with a higher replication count (10 by default). Whenever a >> TaskTracker needs those files for a task it is launching locally, it will >> fetch a copy to its local disk. So it won't need to do this again for >> future tasks on this node. After a job is done, all local copies and the >> HDFS copies of files in the distributed cache are cleaned up. >> >> Kai >> >> -- >> Kai Voigt >> [EMAIL PROTECTED] >> >> >> >> >> > > -- > Kai Voigt > [EMAIL PROTECTED] > > > > >
-
Re: distributed cacheHarsh J 2012-12-26, 08:51
Hi Lin,
DistributedCache files are stored onto the HDFS by the client first. The TaskTrackers download and localize it. Therefore, as with any other file on HDFS, "downloads" can be efficiently parallel with higher replicas. The point of having higher replication for these files is also tied to the concept of racks in a cluster - you would want more replicas spread across racks such that on task bootup the downloads happen with rack locality. On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <[EMAIL PROTECTED]> wrote: > Hi Kai, > > Smart answer! :-) > > The assumption you have is one distributed cache replica could only serve > one download session for tasktracker node (this is why you get concurrency > n/r). The question is, why one distributed cache replica cannot serve > multiple concurrent download session? For example, supposing a tasktracker > use elapsed time t to download a file from a specific distributed cache > replica, it is possible for 2 tasktrackers to download from the specific > distributed cache replica in parallel using elapsed time t as well, or 1.5 > t, which is faster than sequential download time 2t you mentioned before? > "In total, r+n/r concurrent operations. If you optimize r depending on n, > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for > minimize r+n/r? Appreciate if you could point me to more details. > > regards, > Lin > > > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: >> >> Hi, >> >> simple math. Assuming you have n TaskTrackers in your cluster that will >> need to access the files in the distributed cache. And r is the replication >> level of those files. >> >> Copying the files into HDFS requires r copy operations over the network. >> The n TaskTrackers need to get their local copies from HDFS, so the n >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In total, >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) is >> the optimal replication level. So 10 is a reasonable default setting for >> most clusters that are not 500+ nodes big. >> >> Kai >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <[EMAIL PROTECTED]>: >> >> Thanks Kai, using higher replication count for the purpose of? >> >> regards, >> Lin >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: >>> >>> Hi, >>> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>: >>> >>> > I want to confirm when on each task node either mapper or reducer >>> > access distributed cache file, it resides on disk, not resides in memory. >>> > Just want to make sure distributed cache file does not fully loaded into >>> > memory which compete memory consumption with mapper/reducer tasks. Is that >>> > correct? >>> >>> >>> Yes, you are correct. The JobTracker will put files for the distributed >>> cache into HDFS with a higher replication count (10 by default). Whenever a >>> TaskTracker needs those files for a task it is launching locally, it will >>> fetch a copy to its local disk. So it won't need to do this again for future >>> tasks on this node. After a job is done, all local copies and the HDFS >>> copies of files in the distributed cache are cleaned up. >>> >>> Kai >>> >>> -- >>> Kai Voigt >>> [EMAIL PROTECTED] >>> >>> >>> >>> >> >> >> -- >> Kai Voigt >> [EMAIL PROTECTED] >> >> >> >> > -- Harsh J
-
Re: distributed cacheHarsh J 2012-12-26, 10:21
There is no limitation in HDFS that limits reads of a block to a
single client at a time (no reason to do so) - so downloads can be as concurrent as possible. On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <[EMAIL PROTECTED]> wrote: > Thanks Harsh, > > Supposing DistributedCache is uploaded by client, for each replica, in > Hadoop design, it could only serve one download session (download from a > mapper or a reducer which requires the DistributedCache) at a time until > DistributedCache file download is completed, or it could serve multiple > concurrent parallel download session (download from multiple mappers or > reducers which requires the DistributedCache). > > regards, > Lin > > > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >> Hi Lin, >> >> DistributedCache files are stored onto the HDFS by the client first. >> The TaskTrackers download and localize it. Therefore, as with any >> other file on HDFS, "downloads" can be efficiently parallel with >> higher replicas. >> >> The point of having higher replication for these files is also tied to >> the concept of racks in a cluster - you would want more replicas >> spread across racks such that on task bootup the downloads happen with >> rack locality. >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <[EMAIL PROTECTED]> wrote: >> > Hi Kai, >> > >> > Smart answer! :-) >> > >> > The assumption you have is one distributed cache replica could only >> > serve >> > one download session for tasktracker node (this is why you get >> > concurrency >> > n/r). The question is, why one distributed cache replica cannot serve >> > multiple concurrent download session? For example, supposing a >> > tasktracker >> > use elapsed time t to download a file from a specific distributed cache >> > replica, it is possible for 2 tasktrackers to download from the specific >> > distributed cache replica in parallel using elapsed time t as well, or >> > 1.5 >> > t, which is faster than sequential download time 2t you mentioned >> > before? >> > "In total, r+n/r concurrent operations. If you optimize r depending on >> > n, >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) for >> > minimize r+n/r? Appreciate if you could point me to more details. >> > >> > regards, >> > Lin >> > >> > >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: >> >> >> >> Hi, >> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that will >> >> need to access the files in the distributed cache. And r is the >> >> replication >> >> level of those files. >> >> >> >> Copying the files into HDFS requires r copy operations over the >> >> network. >> >> The n TaskTrackers need to get their local copies from HDFS, so the n >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In >> >> total, >> >> r+n/r concurrent operations. If you optimize r depending on n, SRQT(n) >> >> is >> >> the optimal replication level. So 10 is a reasonable default setting >> >> for >> >> most clusters that are not 500+ nodes big. >> >> >> >> Kai >> >> >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <[EMAIL PROTECTED]>: >> >> >> >> Thanks Kai, using higher replication count for the purpose of? >> >> >> >> regards, >> >> Lin >> >> >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: >> >>> >> >>> Hi, >> >>> >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>: >> >>> >> >>> > I want to confirm when on each task node either mapper or reducer >> >>> > access distributed cache file, it resides on disk, not resides in >> >>> > memory. >> >>> > Just want to make sure distributed cache file does not fully loaded >> >>> > into >> >>> > memory which compete memory consumption with mapper/reducer tasks. >> >>> > Is that >> >>> > correct? >> >>> >> >>> >> >>> Yes, you are correct. The JobTracker will put files for the >> >>> distributed >> >>> cache into HDFS with a higher replication count (10 by default). >> >>> Whenever a >> >>> TaskTracker needs those files for a task it is launching locally, it Harsh J
-
Re: distributed cacheLin Ma 2012-12-26, 10:43
Thanks Harsh, multiple concurrent read is generally faster or?
regards, Lin On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <[EMAIL PROTECTED]> wrote: > There is no limitation in HDFS that limits reads of a block to a > single client at a time (no reason to do so) - so downloads can be as > concurrent as possible. > > On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <[EMAIL PROTECTED]> wrote: > > Thanks Harsh, > > > > Supposing DistributedCache is uploaded by client, for each replica, in > > Hadoop design, it could only serve one download session (download from a > > mapper or a reducer which requires the DistributedCache) at a time until > > DistributedCache file download is completed, or it could serve multiple > > concurrent parallel download session (download from multiple mappers or > > reducers which requires the DistributedCache). > > > > regards, > > Lin > > > > > > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <[EMAIL PROTECTED]> wrote: > >> > >> Hi Lin, > >> > >> DistributedCache files are stored onto the HDFS by the client first. > >> The TaskTrackers download and localize it. Therefore, as with any > >> other file on HDFS, "downloads" can be efficiently parallel with > >> higher replicas. > >> > >> The point of having higher replication for these files is also tied to > >> the concept of racks in a cluster - you would want more replicas > >> spread across racks such that on task bootup the downloads happen with > >> rack locality. > >> > >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <[EMAIL PROTECTED]> wrote: > >> > Hi Kai, > >> > > >> > Smart answer! :-) > >> > > >> > The assumption you have is one distributed cache replica could only > >> > serve > >> > one download session for tasktracker node (this is why you get > >> > concurrency > >> > n/r). The question is, why one distributed cache replica cannot serve > >> > multiple concurrent download session? For example, supposing a > >> > tasktracker > >> > use elapsed time t to download a file from a specific distributed > cache > >> > replica, it is possible for 2 tasktrackers to download from the > specific > >> > distributed cache replica in parallel using elapsed time t as well, or > >> > 1.5 > >> > t, which is faster than sequential download time 2t you mentioned > >> > before? > >> > "In total, r+n/r concurrent operations. If you optimize r depending on > >> > n, > >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) > for > >> > minimize r+n/r? Appreciate if you could point me to more details. > >> > > >> > regards, > >> > Lin > >> > > >> > > >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > >> >> > >> >> Hi, > >> >> > >> >> simple math. Assuming you have n TaskTrackers in your cluster that > will > >> >> need to access the files in the distributed cache. And r is the > >> >> replication > >> >> level of those files. > >> >> > >> >> Copying the files into HDFS requires r copy operations over the > >> >> network. > >> >> The n TaskTrackers need to get their local copies from HDFS, so the n > >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In > >> >> total, > >> >> r+n/r concurrent operations. If you optimize r depending on n, > SRQT(n) > >> >> is > >> >> the optimal replication level. So 10 is a reasonable default setting > >> >> for > >> >> most clusters that are not 500+ nodes big. > >> >> > >> >> Kai > >> >> > >> >> Am 22.12.2012 um 13:46 schrieb Lin Ma <[EMAIL PROTECTED]>: > >> >> > >> >> Thanks Kai, using higher replication count for the purpose of? > >> >> > >> >> regards, > >> >> Lin > >> >> > >> >> On Sat, Dec 22, 2012 at 8:44 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: > >> >>> > >> >>> Hi, > >> >>> > >> >>> Am 22.12.2012 um 13:03 schrieb Lin Ma <[EMAIL PROTECTED]>: > >> >>> > >> >>> > I want to confirm when on each task node either mapper or reducer > >> >>> > access distributed cache file, it resides on disk, not resides in > >> >>> > memory. > >> >>> > Just want to make sure distributed cache file does not fully > loaded
-
Re: distributed cacheHarsh J 2012-12-26, 11:48
Hi Lin,
It is comparable (and is also logically similar) to reading a file multiple times in parallel in a local filesystem - not too much of a performance hit for small reads (by virtue of OS caches, and quick completion per read, as is usually the case for distributed cache files), and gradually decreasing performance for long reads (due to frequent disk physical movement)? Thankfully, due to block sizes the latter isn't a problem for large files on a proper DN, as the blocks are spread over the disks and across the nodes. On Wed, Dec 26, 2012 at 4:13 PM, Lin Ma <[EMAIL PROTECTED]> wrote: > Thanks Harsh, multiple concurrent read is generally faster or? > > regards, > Lin > > > On Wed, Dec 26, 2012 at 6:21 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >> There is no limitation in HDFS that limits reads of a block to a >> single client at a time (no reason to do so) - so downloads can be as >> concurrent as possible. >> >> On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma <[EMAIL PROTECTED]> wrote: >> > Thanks Harsh, >> > >> > Supposing DistributedCache is uploaded by client, for each replica, in >> > Hadoop design, it could only serve one download session (download from a >> > mapper or a reducer which requires the DistributedCache) at a time until >> > DistributedCache file download is completed, or it could serve multiple >> > concurrent parallel download session (download from multiple mappers or >> > reducers which requires the DistributedCache). >> > >> > regards, >> > Lin >> > >> > >> > On Wed, Dec 26, 2012 at 4:51 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >> >> >> Hi Lin, >> >> >> >> DistributedCache files are stored onto the HDFS by the client first. >> >> The TaskTrackers download and localize it. Therefore, as with any >> >> other file on HDFS, "downloads" can be efficiently parallel with >> >> higher replicas. >> >> >> >> The point of having higher replication for these files is also tied to >> >> the concept of racks in a cluster - you would want more replicas >> >> spread across racks such that on task bootup the downloads happen with >> >> rack locality. >> >> >> >> On Sat, Dec 22, 2012 at 6:54 PM, Lin Ma <[EMAIL PROTECTED]> wrote: >> >> > Hi Kai, >> >> > >> >> > Smart answer! :-) >> >> > >> >> > The assumption you have is one distributed cache replica could only >> >> > serve >> >> > one download session for tasktracker node (this is why you get >> >> > concurrency >> >> > n/r). The question is, why one distributed cache replica cannot serve >> >> > multiple concurrent download session? For example, supposing a >> >> > tasktracker >> >> > use elapsed time t to download a file from a specific distributed >> >> > cache >> >> > replica, it is possible for 2 tasktrackers to download from the >> >> > specific >> >> > distributed cache replica in parallel using elapsed time t as well, >> >> > or >> >> > 1.5 >> >> > t, which is faster than sequential download time 2t you mentioned >> >> > before? >> >> > "In total, r+n/r concurrent operations. If you optimize r depending >> >> > on >> >> > n, >> >> > SRQT(n) is the optimal replication level." -- how do you get SRQT(n) >> >> > for >> >> > minimize r+n/r? Appreciate if you could point me to more details. >> >> > >> >> > regards, >> >> > Lin >> >> > >> >> > >> >> > On Sat, Dec 22, 2012 at 8:51 PM, Kai Voigt <[EMAIL PROTECTED]> wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> simple math. Assuming you have n TaskTrackers in your cluster that >> >> >> will >> >> >> need to access the files in the distributed cache. And r is the >> >> >> replication >> >> >> level of those files. >> >> >> >> >> >> Copying the files into HDFS requires r copy operations over the >> >> >> network. >> >> >> The n TaskTrackers need to get their local copies from HDFS, so the >> >> >> n >> >> >> TaskTrackers copy from r DataNodes, so n/r concurrent operation. In >> >> >> total, >> >> >> r+n/r concurrent operations. If you optimize r depending on n, >> >> >> SRQT(n) >> >> >> is >> >> >> the optimal replication level. So 10 is a reasonable default setting Harsh J |