> I am actually doing some test to see the performance. I want to eliminate the
> interference of distributed cache. I find there is method in the api to purge
> the cache. That might be what I want.
So, you want to run multiple versions of a job (possibly different job
parameters) and measure them relatively. Is that correct ?
I can think of some options:
- Is it possible, not to use distributed cache at all ? You could
possibly bundle the files along with the job jar.
- You could run the job on fresh cluster instances (a more costly
- You could change the timestamps of the distributed cache files on
DFS somehow before each invocation of the job. This will make Hadoop
believe that the files have been changed, and this will cause
distributed cache to fetch the files again.
The purgeCache API you are seeing is very mapreduce framework
specific. This is *not* to be used by client code, and is not
guaranteed to work. In the latter versions of Hadoop (0.21 and trunk),
these methods have been deprecated in the public API and will be
> ----- 原始邮件 ----
> 发件人： Hemanth Yamijala <[EMAIL PROTECTED]>
> 收件人： [EMAIL PROTECTED]
> 发送日期： 2010/8/2 (周一) 12:56:25 上�
�> 主 题： Re: reuse cached files
>> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to
>> resent exactly the same files to cache for every job?
> I may be able to answer this better if I understand the use case. If
> you need the same files for every job, why would you need to send them
> afresh each time ? If something is cached, it can be reused, no ? I am
> sure I must be missing something in your requirement ...