Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - Regarding loading a big XML file to HDFS

hari708 2011-11-22, 01:20
Uma Maheswara Rao G 2011-11-22, 03:03
Uma Maheswara Rao G 2011-11-22, 03:08
Michael Segel 2011-11-22, 03:58
Inder Pall 2011-11-22, 04:01
Bejoy Ks 2011-11-22, 07:33
Copy link to this message
Re: Regarding loading a big XML file to HDFS
Steve Loughran 2011-11-22, 11:19
On 22/11/11 07:33, Bejoy Ks wrote:

>              Such a processing would hardly make sense while processing
> complex xmls as xmls are based fully on parent child relation ship. (it
> would work well for simple XMLs just having one level of hirearchy).

that is provided nobody is doing XML namespace declarations

<m1:vehicle xmlns:xml="uri:model1" xmlns="uri:model2>
  <car > ... </car>

In such a world the vehicle element name is the tuple ("uri:model1",
"vehicle") but that of the nested element is ("uri1:model2","car")

The way XML namespace handling is done implies the entire parent tree
needs to be parsed before you can be confident of the namespace which an
XML element and attributes belong to.

> for example consider the mock XML like below
> <Vehicle>
>      <Car>
>          <BMW>
>              <Sedan>
>                  <3-Series>
>                      <min-torque></min-torque>
> -----------------------------------------------------------------------------------------------------------------------------------
>                      <max-torque></max-torque>
>                  </3-Series
>              <Sedan>
>              <SUV>
>              </SUV
>          </BMW>
>      </Car>
>      <Truck>
>      </Truck>
>      <Bus>
>      <Bus>
> </Vehicle>
> Even if we split it  in between(even if split happens at a line boundary)
> it would be hard to process as the opening tags come in one block under one
> mapper's boundary and the closing tags come in another block under another
> mapper's boundary. So if we are mining some data from them it hardly makes
> sense.

most record scans pull it a bit of trailing data from the next block;
it's generally not very much and not worth worrying about. Collect some
data on average record length and assume that as your usual over-read.
>We need to incorporate the logic in here interns of regex or so to
> identify the closing tags from second block,

regexps which invariably contain assumptions about the encoding of
content within the XML document, break if the doctype is UTF-16 or
something else, and are still namespace-brittle.

>   May be one query remains, why use map reduce for XML if we can't exploit
> parallel processing?

Why use XML for your persistent format if you can only parse it through
a (stateful) recursive process, so limiting you to the bandwidth of your
parser accessing a single file?

> - We can process multiple small xml files in parallel one in each mapper
> without splitting to mine and extract some information for processing. But
> we lose a good extent of data locality here.

no, you aggregate lots of small XML records into a HAR
Mridul Muralidharan 2011-11-22, 12:48
hari708 2011-11-22, 01:20
Joey Echeverria 2011-11-22, 11:20