Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> XML -> Pig UDF


+
Russell Jurney 2012-12-24, 07:24
+
Vitalii Tymchyshyn 2012-12-24, 08:09
+
Russell Jurney 2012-12-24, 08:13
Copy link to this message
-
Re: XML -> Pig UDF
Nope, sorry, I wish I could open source this. I did some patches to the
loader (e.g. it did not like empty tags) - those are submitted as pull
requests.

Some more hints:
1) I've found pig-style concat function to be very useful - mine could take
any input, skip nulls, flatten bags and tuples

2) I had to introduce custom type. It does not like top-level custom types,
but works OK with tuples of custom types.
24 груд. 2012 10:13, "Russell Jurney" <[EMAIL PROTECTED]> напис.

> Thanks - any chance of contributing some of that code? :)
>
> I have thought of a similar approach: starting with an XMLToPig
> EvalFunc that takes the output of the existing XMLLoader and converts
> it to tuple/bag/map form. Easier to baby step that, just a matter of
> plugging that code in to the xml slice trimmed by XMLLoader, and much
> easier once the EvalFunc works.
>
> Russell Jurney http://datasyndrome.com
>
> On Dec 24, 2012, at 12:10 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> wrote:
>
> > I was doing such a thing in my previous project, but I did parse on
> demand.
> > What I mean is that I've created set of xml-processing functions, each
> can
> > take a string or Dom on input plus explicit parse function.
> > I did this because I was usually using concatenation/grouping on parsed
> > input files and processing was done only after that. Or processing can be
> > done in another MR step and serialization of string is much easier than
> of
> > Dom.
> > 24 груд. 2012 09:24, "Russell Jurney" <[EMAIL PROTECTED]> напис.
> >
> >> I want to extend the existing XMLLoader to go beyond capturing the text
> >> inside a tag and to actually create a Pig mapping of the Document Object
> >> Model the XML represents. This would be similar to elephant-bird's
> >> JsonLoader.
> >>
> >> For instance, check this example: https://gist.github.com/4368194
> >>
> >> Semi-structured data can vary, so this behavior can be risky but... I
> want
> >> people to be able to load JSON and XML data easily their first session
> with
> >> Pig.
> >>
> >> Russell Jurney http://datasyndrome.com
> >>
>