Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> XML -> Pig UDF


+
Russell Jurney 2012-12-24, 07:24
+
Vitalii Tymchyshyn 2012-12-24, 08:09
Copy link to this message
-
Re: XML -> Pig UDF
Thanks - any chance of contributing some of that code? :)

I have thought of a similar approach: starting with an XMLToPig
EvalFunc that takes the output of the existing XMLLoader and converts
it to tuple/bag/map form. Easier to baby step that, just a matter of
plugging that code in to the xml slice trimmed by XMLLoader, and much
easier once the EvalFunc works.

Russell Jurney http://datasyndrome.com

On Dec 24, 2012, at 12:10 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> wrote:

> I was doing such a thing in my previous project, but I did parse on demand.
> What I mean is that I've created set of xml-processing functions, each can
> take a string or Dom on input plus explicit parse function.
> I did this because I was usually using concatenation/grouping on parsed
> input files and processing was done only after that. Or processing can be
> done in another MR step and serialization of string is much easier than of
> Dom.
> 24 груд. 2012 09:24, "Russell Jurney" <[EMAIL PROTECTED]> напис.
>
>> I want to extend the existing XMLLoader to go beyond capturing the text
>> inside a tag and to actually create a Pig mapping of the Document Object
>> Model the XML represents. This would be similar to elephant-bird's
>> JsonLoader.
>>
>> For instance, check this example: https://gist.github.com/4368194
>>
>> Semi-structured data can vary, so this behavior can be risky but... I want
>> people to be able to load JSON and XML data easily their first session with
>> Pig.
>>
>> Russell Jurney http://datasyndrome.com
>>
+
Vitalii Tymchyshyn 2012-12-29, 23:00
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB