Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Regarding loading a big XML file to HDFS

Copy link to this message
Re: Regarding loading a big XML file to HDFS
Hi All
           I'm sharing my understanding here. Please correct me if I'm
wrong (Uma and Michael).
              The explanation by Michael  is the common working of map
reduce programs I believe. Just take case of a common text file of size
96MB and if my HDFS block size is 64 MB then this file would be split
across 2 blocks block A(64 MB) and block B(32 MB). This splitting and
storing in hdfs would be happening just based on the size and never based
on any end of line characters. Which means that the last line may not be
completely in block A , part in Block A and rest in block B. Now the file
is stored in HDFS this way.
             When we try to process the HDFS stored file using map reduce
(say using default TextInputFormat) there would be two mappers spanned by
JT, mapper-A and mapper-B. Mapper-A would be reading Block A and when it
reaches the last line it wont be getting the line delimiter so it would
read the details till the first line delimiter in Block B. Mapper B would
start processing Block B only from the first line delimiter. Now the
mappers understands whether the blocks that they are reading are the first
block or intermediate blocks of a file from the offset, if offset is 0000
then it is the first block of a file. Please add on if there are more
parameters considered for the same other than just offset like some meta
information as well. So we don't need a custom input format/record reader
here for the default behavior to read end of a line/record.
            Such a processing would hardly make sense while processing
complex xmls as xmls are based fully on parent child relation ship. (it
would work well for simple XMLs just having one level of hirearchy). Say
for example consider the mock XML like below


Even if we split it  in between(even if split happens at a line boundary)
it would be hard to process as the opening tags come in one block under one
mapper's boundary and the closing tags come in another block under another
mapper's boundary. So if we are mining some data from them it hardly makes
sense. We need to incorporate the logic in here interns of regex or so to
identify the closing tags from second block,
 May be one query remains, why use map reduce for XML if we can't exploit
parallel processing?
- We can process multiple small xml files in parallel one in each mapper
without splitting to mine and extract some information for processing. But
we lose a good extent of data locality here.
There is a sample user defined input format given in Hadoop Definitive
Guide called WholeFileInputFormat which would satisfy this purpose.

- For larger xml files we have to consider processing the splits in
parallel itself.
There is a default class provided in hadoop for the same,
StreamXmlRecordReader which can be used outside of steaming as well. For
details i have posted the

Hope it helps!..


On Tue, Nov 22, 2011 at 9:31 AM, Inder Pall <[EMAIL PROTECTED]> wrote:

> what about the records at skipped boundaries?
> Instead is there a way to define a custom splitter in hadoop which can
> understand record boundaries.
> - Inder
> On Tue, Nov 22, 2011 at 9:28 AM, Michael Segel <[EMAIL PROTECTED]
> >wrote:
> >
> > Just wanted to address this:
> > > >Basically in My mapreduce program i am expecting a complete XML as my
> > > >input.i have a CustomReader(for XML) in my mapreduce job
> > configuration.My
> > > >main confusion is if namenode distribute data to DataNodes ,there is a