|
Gang Luo
2010-08-20, 15:08
Jeff Zhang
2010-08-20, 15:22
Gang Luo
2010-08-22, 16:48
Jeff Zhang
2010-08-23, 04:47
Gang Luo
2010-08-25, 15:12
Gang Luo
2010-08-26, 04:59
Hemanth Yamijala
2010-08-27, 14:04
|
-
where distributed cache start workingGang Luo 2010-08-20, 15:08
Hi all,
I go through the code, but couldn't find the place where distributed cache start working. I want to know between DistriubtedCache.addCacheFile at the master node and DistributedCache.getLocalCacheFiles at the client side, when and where are the files get distributed. Thanks, -Gang
-
Re: where distributed cache start workingJeff Zhang 2010-08-20, 15:22
Hi Gang,
In the TaskRunner's run() method, hadoop will download the cache files which you set on the client side to local, then the forked child jvm can use these cache files locally. On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <[EMAIL PROTECTED]> wrote: > Hi all, > I go through the code, but couldn't find the place where distributed cache start > working. I want to know between DistriubtedCache.addCacheFile at the master node > and DistributedCache.getLocalCacheFiles at the client side, when and where are > the files get distributed. > > > Thanks, > -Gang > > > > > -- Best Regards Jeff Zhang
-
Re: where distributed cache start workingGang Luo 2010-08-22, 16:48
Thanks Jeff.
However, are you sure TaskRunner.run() is also used in the new API? I use btrace to trace the function call but didn't find this function had been called anywhere. One more question about distributed cache. After I call DistributedCache.purgeCache, I think the local cached files should be deleted or invalidated. However ,When I run the same job with the purge operation at the end multiple times, I find the local files have never been deleted and the modification time is when the first job run. How can I ask my job to re-distributed the cache again anyway? Thanks, -Gang ----- 原始邮� �---- 发件人: Jeff Zhang <[EMAIL PROTECTED]> 收件人: [EMAIL PROTECTED] 发送日期: 2010/8/20 (周五) 11:22:49 上午 主 题� �Re: where distributed cache start working Hi Gang, In the TaskRunner's run() method, hadoop will download the cache files which you set on the client side to local, then the forked child jvm can use these cache files locally. On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <[EMAIL PROTECTED]> wrote: > Hi all, > I go through the code, but couldn't find the place where distributed cache >start > working. I want to know between DistriubtedCache.addCacheFile at the master >node > and DistributedCache.getLocalCacheFiles at the client side, when and where are > the files get distributed. > > > Thanks, > -Gang > > > > > -- Best Regards Jeff Zhang
-
Re: where distributed cache start workingJeff Zhang 2010-08-23, 04:47
Do you debug it using LocalJobRunner ? In local mode, TaskRunner won't
been called. In local mode, mapper task runs in thread rather than forked jvm. The TaskRunner only been called in distributed mode. 2010/8/22 Gang Luo <[EMAIL PROTECTED]>: > Thanks Jeff. > > However, are you sure TaskRunner.run() is also used in the new API? I use btrace > to trace the function call but didn't find this function had been called > anywhere. > > > One more question about distributed cache. After I call > DistributedCache.purgeCache, I think the local cached files should be deleted or > invalidated. However ,When I run the same job with the purge operation at the > end multiple times, I find the local files have never been deleted and the > modification time is when the first job run. How can I ask my job to > re-distributed the cache again anyway? > > Thanks, > -Gang > > > > > ----- 原始邮件 ---- > 发件人: Jeff Zhang <[EMAIL PROTECTED]> > 收件人: [EMAIL PROTECTED] > 发送日期: 2010/8/20 (周��) 11:22:49 上午 > 主 题: Re: where distributed cache start working > > Hi Gang, > > In the TaskRunner's run() method, hadoop will download the cache files > which you set on the client side to local, then the forked child jvm > can use these cache files locally. > > > > On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <[EMAIL PROTECTED]> wrote: >> Hi all, >> I go through the code, but couldn't find the place where distributed cache >>start >> working. I want to know between DistriubtedCache.addCacheFile at the master >>node >> and DistributedCache.getLocalCacheFiles at the client side, when and where are >> the files get distributed. >> >> >> Thanks, >> -Gang >> >> >> >> >> > > > > -- > Best Regards > > Jeff Zhang > > > > > -- Best Regards Jeff Zhang
-
Re: where distributed cache start workingGang Luo 2010-08-25, 15:12
Hi Jeff,
I realize the profiling is running within each jvm, while the distributed cache seems start before the jvm starts. That is probably why I couldn't trace it. Thanks, -Gang ----- 原始邮件 ---- 发件人: Jeff Zhang <[EMAIL PROTECTED]> 收件人: [EMAIL PROTECTED] 发送� 掌冢�2010/8/23 (周一) 12:47:31 上午 主 题: Re: where distributed cache start working Do you debug it using LocalJobRunner ? In local mode, TaskRunner won't been called. In local mode, mapper task runs in thread rather than forked jvm. The TaskRunner only been called in distributed mode. 2010/8/22 Gang Luo <[EMAIL PROTECTED]>: > Thanks Jeff. > > However, are you sure TaskRunner.run() is also used in the new API? I use >btrace > to trace the function call but didn't find this function had been called > anywhere. > > > One more question about distributed cache. After I call > DistributedCache.purgeCache, I think the local cached files should be deleted >or > invalidated. However ,When I run the same job with the purge operation at the > end multiple times, I find the local files have never been deleted and the > modification time is when the first job run. How can I ask my job to > re-distributed the cache again anyway? > > Thanks, > -Gang > > > > > ----- 原始邮件 ---- > 发件人: Jeff Zhang <[EMAIL PROTECTED]> > 收件� 耍�[EMAIL PROTECTED] > 发送日期: 2010/8/20 (周五) 11:22:49 上午 > 主 题: Re: where distributed cache start working > > Hi Gang, > > In the TaskRunner's run() method, hadoop will download the cache files > which you set on the client side to local, then the forked child jvm > can use these cache files locally. > > > > On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <[EMAIL PROTECTED]> wrote: >> Hi all, >> I go through the code, but couldn't find the place where distributed cache >>start >> working. I want to know between DistriubtedCache.addCacheFile at the master >>node >> and DistributedCache.getLocalCacheFiles at the client side, when and where are >> the files get distributed. >> >> >> Thanks, >> -Gang >> >> >> >> >> > > > > -- > Best Regards > > Jeff Zhang > > > > > -- Best Regards Jeff Zhang
-
Re: where distributed cache start workingGang Luo 2010-08-26, 04:59
Thanks Arun. Change the mTime is a good idea. However, given a file (the path is
A/B/C/D/file) distributed to all the nodes, if I just change the mTime of file to a earlier time stamp, it will not be replaced next time. Should I also change the mTime for all the directories along the path (A, B, C and D). Whose timestamp is used by DistributedCache? Thanks. -Gang ----- 原始邮件 ---- 发� �耍�Arun C Murthy <[EMAIL PROTECTED]> 收件人� �[EMAIL PROTECTED] 发送日期: 2010/8/22 (周日) 9:38:02 下午 主 题: Re: where distributed cache start working Moving to mapreduce-user@, bcc common-dev@. Please use the project specific lists. DistributedCache.purgeCache isn't a public api. You shouldn't be calling it from the task. A simple way of doing what you want is to change the mtime of the cache files on HDFS. Arun On Aug 22, 2010, at 9:48 AM, Gang Luo wrote: > Thanks Jeff. > > However, are you sure TaskRunner.run() is also used in the new API? I use >btrace > to trace the function call but didn't find this function had been called > anywhere. > > > One more question about distributed cache. After I call > DistributedCache.purgeCache, I think the local cached files should be deleted >or > invalidated. However ,When I run the same job with the purge operation at the > end multiple times, I find the local files have never been deleted and the > modification time is when the first job run. How can I ask my job to > re-distributed the cache again anyway? > > Thanks, > -Gang > > > > > ----- 原始邮件 ---- > 发件人: Jeff Zhang <[EMAIL PROTECTED]> > 收件人� �[EMAIL PROTECTED] > 发送日期: 2010/8/20 (周五) 11:22:49 上午 > 主 题: Re: where distributed cache start working > > Hi Gang, > > In the TaskRunner's run() method, hadoop will download the cache files > which you set on the client side to local, then the forked child jvm > can use these cache files locally. > > > > On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <[EMAIL PROTECTED]> wrote: >> Hi all, >> I go through the code, but couldn't find the place where distributed cache >> start >> working. I want to know between DistriubtedCache.addCacheFile at the master >> node >> and DistributedCache.getLocalCacheFiles at the client side, when and where are >> the files get distributed. >> >> >> Thanks, >> -Gang >> >> >> >> >> > > > > --Best Regards > > Jeff Zhang > > > >
-
Re: where distributed cache start workingHemanth Yamijala 2010-08-27, 14:04
Hi,
> Thanks Arun. Change the mTime is a good idea. However, given a file (the path is > > A/B/C/D/file) distributed to all the nodes, if I just change the mTime of file > to a earlier time stamp, it will not be replaced next time. Should I also change > the mTime for all the directories along the path (A, B, C and D). Whose > timestamp is used by DistributedCache? It is the timestamp of the file on DFS. So, you modify the file's timestamp on DFS, it should be re-distributed to all the nodes. Thanks Hemanth > > Thanks. > -Gang > > > > > ----- 原始邮件 ---- > 发件人: Arun C Murthy <[EMAIL PROTECTED]> > 收件人: [EMAIL PROTECTED] > 发送日期: 2010/8/22 (周日) 9:38:02 下� �> 主 题: Re: where distributed cache start working > > Moving to mapreduce-user@, bcc common-dev@. Please use the project specific > lists. > > DistributedCache.purgeCache isn't a public api. You shouldn't be calling it from > > the task. > > A simple way of doing what you want is to change the mtime of the cache files on > > HDFS. > > Arun > > On Aug 22, 2010, at 9:48 AM, Gang Luo wrote: > >> Thanks Jeff. >> >> However, are you sure TaskRunner.run() is also used in the new API? I use >>btrace >> to trace the function call but didn't find this function had been called >> anywhere. >> >> >> One more question about distributed cache. After I call >> DistributedCache.purgeCache, I think the local cached files should be deleted >>or >> invalidated. However ,When I run the same job with the purge operation at the >> end multiple times, I find the local files have never been deleted and the >> modification time is when the first job run. How can I ask my job to >> re-distributed the cache again anyway? >> >> Thanks, >> -Gang >> >> >> >> >> ----- 原始邮件 ---- >> 发件人: Jeff Zhang <[EMAIL PROTECTED]> >> 收件人: [EMAIL PROTECTED] >> 发送日期: 2010/8/20 (周五) 11:22:49 上午 >> 主 题: Re: where distributed cache start working >> >> Hi Gang, >> >> In the TaskRunner's run() method, hadoop will download the cache files >> which you set on the client side to local, then the forked child jvm >> can use these cache files locally. >> >> >> >> On Fri, Aug 20, 2010 at 8:08 AM, Gang Luo <[EMAIL PROTECTED]> wrote: >>> Hi all, >>> I go through the code, but couldn't find the place where distributed cache >>> start >>> working. I want to know between DistriubtedCache.addCacheFile at the master >>> node >>> and DistributedCache.getLocalCacheFiles at the client side, when and where > are >>> the files get distributed. >>> >>> >>> Thanks, >>> -Gang >>> >>> >>> >>> >>> >> >> >> >> --Best Regards >> >> Jeff Zhang >> >> >> >> > > > > > |