Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> XML -> Pig UDF


Copy link to this message
-
Re: XML -> Pig UDF
Nope, sorry, I wish I could open source this. I did some patches to the
loader (e.g. it did not like empty tags) - those are submitted as pull
requests.

Some more hints:
1) I've found pig-style concat function to be very useful - mine could take
any input, skip nulls, flatten bags and tuples

2) I had to introduce custom type. It does not like top-level custom types,
but works OK with tuples of custom types.
24 груд. 2012 10:13, "Russell Jurney" <[EMAIL PROTECTED]> напис.

> Thanks - any chance of contributing some of that code? :)
>
> I have thought of a similar approach: starting with an XMLToPig
> EvalFunc that takes the output of the existing XMLLoader and converts
> it to tuple/bag/map form. Easier to baby step that, just a matter of
> plugging that code in to the xml slice trimmed by XMLLoader, and much
> easier once the EvalFunc works.
>
> Russell Jurney http://datasyndrome.com
>
> On Dec 24, 2012, at 12:10 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> wrote:
>
> > I was doing such a thing in my previous project, but I did parse on
> demand.
> > What I mean is that I've created set of xml-processing functions, each
> can
> > take a string or Dom on input plus explicit parse function.
> > I did this because I was usually using concatenation/grouping on parsed
> > input files and processing was done only after that. Or processing can be
> > done in another MR step and serialization of string is much easier than
> of
> > Dom.
> > 24 груд. 2012 09:24, "Russell Jurney" <[EMAIL PROTECTED]> напис.
> >
> >> I want to extend the existing XMLLoader to go beyond capturing the text
> >> inside a tag and to actually create a Pig mapping of the Document Object
> >> Model the XML represents. This would be similar to elephant-bird's
> >> JsonLoader.
> >>
> >> For instance, check this example: https://gist.github.com/4368194
> >>
> >> Semi-structured data can vary, so this behavior can be risky but... I
> want
> >> people to be able to load JSON and XML data easily their first session
> with
> >> Pig.
> >>
> >> Russell Jurney http://datasyndrome.com
> >>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB