If you are using a stand alone client application to do the same
definitely there is just one instance of the same running and you'd be
writing the sequence file to one hdfs block at a time. Once it reaches hdfs
block size the writing continues to next block, in the mean time the first
block is replicated. If you are doing the same job distributed as map
reduce you'd be writing to to n files at a time when n is the number of
tasks in your map reduce job.
AFAIK the data node where the blocks have to be placed is determined
by hadoop it is not controlled by end user application. But if you are
triggering the stand alone job on a particular data node and if it has
space one replica would be stored in the same. Same applies in case of MR
tasks as well.
On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
> I have a client program that creates sequencefile, which essentially merges
> small files into a big file. I was wondering how is sequence file splitting
> the data accross nodes. When I start the sequence file is empty. Does it
> get split when it reaches the dfs.block size? If so then does it mean that
> I am always writing to just one node at a given point in time?
> If I start a new client writing a new sequence file then is there a way to
> select a different data node?