Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> XML -> Pig UDF

Copy link to this message
Re: XML -> Pig UDF
Thanks - any chance of contributing some of that code? :)

I have thought of a similar approach: starting with an XMLToPig
EvalFunc that takes the output of the existing XMLLoader and converts
it to tuple/bag/map form. Easier to baby step that, just a matter of
plugging that code in to the xml slice trimmed by XMLLoader, and much
easier once the EvalFunc works.

Russell Jurney http://datasyndrome.com

On Dec 24, 2012, at 12:10 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> wrote:

> I was doing such a thing in my previous project, but I did parse on demand.
> What I mean is that I've created set of xml-processing functions, each can
> take a string or Dom on input plus explicit parse function.
> I did this because I was usually using concatenation/grouping on parsed
> input files and processing was done only after that. Or processing can be
> done in another MR step and serialization of string is much easier than of
> Dom.
> 24 груд. 2012 09:24, "Russell Jurney" <[EMAIL PROTECTED]> напис.
>> I want to extend the existing XMLLoader to go beyond capturing the text
>> inside a tag and to actually create a Pig mapping of the Document Object
>> Model the XML represents. This would be similar to elephant-bird's
>> JsonLoader.
>> For instance, check this example: https://gist.github.com/4368194
>> Semi-structured data can vary, so this behavior can be risky but... I want
>> people to be able to load JSON and XML data easily their first session with
>> Pig.
>> Russell Jurney http://datasyndrome.com