Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - SequenceFile split question


Copy link to this message
-
Re: SequenceFile split question
Bejoy Ks 2012-03-15, 14:58
Hi Mohit
     You are right. If your smaller XML files are in hdfs then MR would be
the best approach to combine it to a sequence file. It'd do the job
in parallel.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 8:17 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> Thanks! that helps. I am reading small xml files from external file system
> and then writing to the SequenceFile. I made it stand alone client thinking
> that mapreduce may not be the best way to do this type of writing. My
> understanding was that map reduce is best suited for processing data within
> HDFS. Is map reduce also one of the options I should consider?
>
> On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:
>
> > Hi Mohit
> >      If you are using a stand alone client application to do the same
> > definitely there is just one instance of the same running and you'd be
> > writing the sequence file to one hdfs block at a time. Once it reaches
> hdfs
> > block size the writing continues to next block, in the mean time the
> first
> > block is replicated. If you are doing the same job distributed as map
> > reduce you'd be writing to to n files at a time when n is the number of
> > tasks in your map reduce job.
> >     AFAIK the data node where the blocks have to be placed is determined
> > by hadoop it is not controlled by end user application. But if you are
> > triggering the stand alone job on a particular data node and if it has
> > space one replica would be stored in the same. Same applies in case of MR
> > tasks as well.
> >
> > Regards
> > Bejoy.K.S
> >
> > On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I have a client program that creates sequencefile, which essentially
> > merges
> > > small files into a big file. I was wondering how is sequence file
> > splitting
> > > the data accross nodes. When I start the sequence file is empty. Does
> it
> > > get split when it reaches the dfs.block size? If so then does it mean
> > that
> > > I am always writing to just one node at a given point in time?
> > >
> > > If I start a new client writing a new sequence file then is there a way
> > to
> > > select a different data node?
> > >
> >
>