There is currently no way to delete the data from the cache when you are done. It is garbage collected when the cache starts to fill up (in LRU order if you are on a newer release). The DistributedCache.addCacheFile is modifying the JobConf behind the scenes for you. If you want to dig into the details of what it is doing you can look at the source code for it.
On 11/28/11 4:46 AM, "Andy Doddington" <[EMAIL PROTECTED]> wrote:
Thanks for that link Prashant - very useful.
Two brief follow-up questions:
1) Having put data in the cache, I would like to be a good citizen by deleting the data from the cache once
I've finished - how do I do that?
2) Would it be simpler to pass the data as a value in the jobConf object?
On 25 Nov 2011, at 12:14, Prashant Kommireddi wrote:
> I believe you want to ship data to each node in your cluster before MR
> begins so the mappers can access files local to their machine. Hadoop
> tutorial on YDN has some good info on this.
> -Prashant Kommireddi
> On Fri, Nov 25, 2011 at 1:05 AM, Andy Doddington <[EMAIL PROTECTED]>wrote:
>> I have a series of mappers that I would like to be passed data using the
>> distributed cache mechanism. At the
>> moment, I am using HDFS to pass the data, but this seems wasteful to me,
>> since they are all reading the same data.
>> Is there a piece of example code that shows how data files can be placed
>> in the cache and accessed by mappers?
>> Andy Doddington