On Feb 3, 2011, at 9:46 AM, Harsh J wrote:
> On Thu, Feb 3, 2011 at 10:46 PM, Keith Wiley <[EMAIL PROTECTED]> wrote:
>> I've seen this asked before, but haven't seen a response yet.
>> If the input to a streaming job is not actual data splits but simple HDFS file names which are then read by the mappers, then how can data locality be achieved.
> Also, if you're only looking to not split the files, you can pass in a
The files won't be split, they're only 6MBs. I'm looking to get the files to my streaming job somehow and the method I've chosen is to send mere fileNAMES via the streaming API and have the streaming program open the file from HDFS through a symbolic link in the distribute cache (the link originating from -cacheFile presumably).
> custom FileInputFormat with isSplitable returning false? You'll lose
> completeness in locality cause of blocks not present in the chosen
> node though, yes -- But I believe that adding a hundred files to
> DistributedCache is not the solution, as the DistributedCache data is
> set to ALL the nodes AFAIK.
My understanding is that the -cacheFile option and the DistributedCache.addCacheFile() method don't copy the entire file to the distributed cache, but rather make tiny symbolic links to the actual HDFS file. Correct? If you don't think I should add 100s of files to the distributed cache (or even 100s of links), then how else can I make the files available to my streaming program?
Put another way, do you know of another method by which to permit the streaming programs to read files from HDFS?
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"And what if we picked the wrong religion? Every week, we're just making God
madder and madder!"
-- Homer Simpson