Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> SequenceFile split question


Copy link to this message
-
Re: SequenceFile split question
Thanks! that helps. I am reading small xml files from external file system
and then writing to the SequenceFile. I made it stand alone client thinking
that mapreduce may not be the best way to do this type of writing. My
understanding was that map reduce is best suited for processing data within
HDFS. Is map reduce also one of the options I should consider?

On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:

> Hi Mohit
>      If you are using a stand alone client application to do the same
> definitely there is just one instance of the same running and you'd be
> writing the sequence file to one hdfs block at a time. Once it reaches hdfs
> block size the writing continues to next block, in the mean time the first
> block is replicated. If you are doing the same job distributed as map
> reduce you'd be writing to to n files at a time when n is the number of
> tasks in your map reduce job.
>     AFAIK the data node where the blocks have to be placed is determined
> by hadoop it is not controlled by end user application. But if you are
> triggering the stand alone job on a particular data node and if it has
> space one replica would be stored in the same. Same applies in case of MR
> tasks as well.
>
> Regards
> Bejoy.K.S
>
> On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > I have a client program that creates sequencefile, which essentially
> merges
> > small files into a big file. I was wondering how is sequence file
> splitting
> > the data accross nodes. When I start the sequence file is empty. Does it
> > get split when it reaches the dfs.block size? If so then does it mean
> that
> > I am always writing to just one node at a given point in time?
> >
> > If I start a new client writing a new sequence file then is there a way
> to
> > select a different data node?
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB