|
|
Gang Luo 2010-07-29, 14:37
Hi all, if I use distributed cache to send some files to all the nodes in one MR job, can I reuse these cached files locally in my next job, or will hadoop re-sent these files again?
Thanks, -Gang
Hemanth Yamijala 2010-07-30, 04:24
Hi,
> if I use distributed cache to send some files to all the nodes in one MR job, > can I reuse these cached files locally in my next job, or will hadoop re-sent > these files again?
Cache files are reused across Jobs. From trunk onwards, they will be restricted to be reused across jobs of the same user, unless they are marked 'public' in which case they can be reused by jobs across all users.
Thanks hemanth
Gang Luo 2010-07-30, 13:14
Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to resent exactly the same files to cache for every job?
Thanks, -Gang
Hemanth Yamijala 2010-08-02, 04:56
Hi,
> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to > resent exactly the same files to cache for every job?
I may be able to answer this better if I understand the use case. If you need the same files for every job, why would you need to send them afresh each time ? If something is cached, it can be reused, no ? I am sure I must be missing something in your requirement ...
Thanks Hemanth
Gang Luo 2010-08-02, 13:27
I am actually doing some test to see the performance. I want to eliminate the interference of distributed cache. I find there is method in the api to purge the cache. That might be what I want.
Thanks, -Gang
----- 原始邮件 ---- 发件人: Hemanth Yamijala <[EMAIL PROTECTED]> 收件人: [EMAIL PROTECTED] 发送日期: 2010/8/2 (周一) 12:56:25 上午 主 题: Re: reuse cached files
Hi,
> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to > resent exactly the same files to cache for every job?
I may be able to answer this better if I understand the use case. If you need the same files for every job, why would you need to send them afresh each time ? If something is cached, it can be reused, no ? I am sure I must be missing something in your requirement ...
Thanks Hemanth
Hemanth Yamijala 2010-08-03, 04:33
Hi,
> I am actually doing some test to see the performance. I want to eliminate the > interference of distributed cache. I find there is method in the api to purge > the cache. That might be what I want.
So, you want to run multiple versions of a job (possibly different job parameters) and measure them relatively. Is that correct ?
I can think of some options: - Is it possible, not to use distributed cache at all ? You could possibly bundle the files along with the job jar. - You could run the job on fresh cluster instances (a more costly option, nevertheless) - You could change the timestamps of the distributed cache files on DFS somehow before each invocation of the job. This will make Hadoop believe that the files have been changed, and this will cause distributed cache to fetch the files again. The purgeCache API you are seeing is very mapreduce framework specific. This is *not* to be used by client code, and is not guaranteed to work. In the latter versions of Hadoop (0.21 and trunk), these methods have been deprecated in the public API and will be removed altogether.
Thanks hemanth
> > Thanks, > -Gang > > > > ----- 原始邮件 ---- > 发件人: Hemanth Yamijala <[EMAIL PROTECTED]> > 收件人: [EMAIL PROTECTED] > 发送日期: 2010/8/2 (周一) 12:56:25 上� �> 主 题: Re: reuse cached files > > Hi, > >> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to >> resent exactly the same files to cache for every job? > > I may be able to answer this better if I understand the use case. If > you need the same files for every job, why would you need to send them > afresh each time ? If something is cached, it can be reused, no ? I am > sure I must be missing something in your requirement ... > > Thanks > Hemanth > > > > >
|
|