Right, sorry for the ambiguity, I was talking about HDFS writes only.
So my application doesn't need to do anything to signal that it is writing from inside vs. outside of the Hadoop cluster, it figures that out from IP or hostname?
From: Harsh J [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 16, 2013 11:12 PM
To: <[EMAIL PROTECTED]>
Subject: Re: Question about writing HDFS files
Thanks for the clarification Rahul. In that case, then the reading is correct (and that a HDFS client behaves the same, in and out of MR - its not really related to MR at all).
A "client outside" would write to a random set of datanode, across at least two racks for 3 replicas if rack awareness is turned on.
On Fri, May 17, 2013 at 8:17 AM, Rahul Bhattacharjee <[EMAIL PROTECTED]> wrote:
> Hi Harsh,
> I think what John meant by writing to local disk is writing to the
> same data node first which has initiated the write call.
> John can further clarify.
> On Fri, May 17, 2013 at 4:23 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>> That is not true. HDFS writes are not staged to a local disk first
>> before being written onto the DataNodes. The old architecture docs
>> seem to suggest that the writes get staged to a local disk but thats
>> not true anymore, see https://issues.apache.org/jira/browse/HDFS-1454.
>> Also worth noting that a HDFS client behaves the same way in almost
>> all contexts, whether its invoked from an MR framework or directly
>> from shell.
>> On Fri, May 17, 2013 at 3:38 AM, John Lilley
>> <[EMAIL PROTECTED]>
>> > I seem to recall reading that when a MapReduce task writes a file,
>> > the blocks of the file are always written to local disk, and
>> > replicated to other nodes. If this is true, is this also true for
>> > non-MR applications writing to HDFS from Hadoop worker nodes? What
>> > about clients outside of the cluster doing a file load?
>> > Thanks
>> > John
>> Harsh J