Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Parsing XML using PIG

krishnan N 2012-04-20, 00:27
Copy link to this message
RE: Parsing XML using PIG
I just use XMLLoader to break the input xml into records, then stream that through an xml parser to pull out what I need into the fields of a relation for subsequent pig processing.  Like
 -- The analyze_src_recs.py script reads input xml from stdin, and writes to
 -- stdout for each relevant part of each input record:
 -- citeddocid,citingdocid,collection,seqno,year,[....]
 define analyze_src `analyze_src_recs.py`
    input  (stdin)
    output (stdout USING PigStreaming(','))
    ship   ('$scriptDir/analyze_src_recs.py');

 SrcLines  = load '$src_xml/*.xml*'
    using org.apache.pig.piggybank.storage.XMLLoader('REC')
    as (doc:chararray);
 ParseOut = stream SrcLines through analyze_src
          as (rec_type   : int,
       citeddocid : int,
              citingdocid: int,
              col        : chararray,
              seq        : chararray,

 -- rec_type determines which of two kinds of records the UDF streaming
 -- function analyze_src_recs.py has generated
 split ParseOut into
     ParseOutCitation if rec_type == 0,
     ParseOutSrc      if rec_type == 1;


William F Dowling
Senior Technologist
Thomson Reuters

-----Original Message-----
From: krishnan N [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 19, 2012 8:28 PM
Subject: Parsing XML using PIG

Hi All,

I am trying XML parsing using PIG, the below are the code which uses the
xmlloader class . I am trying to convert XML to text file with attribute in
columns and attribute value as column value.

register /usr/lib/pig/contrib/piggybank/java/piggybank.jar;

xml_file = LOAD '/home/test2.xml' using
org.apache.pig.piggybank.storage.XMLLoader('field') as (doc:chararray);

loof_file = foreach xml_file generate field;

store_file = store loof_file into '/home/xml2_to_text.dat';

The xmlloader identifies only the ‘tag’ supplied as input parameter and
gives the below result only for the particular tag. Is there any way to get
attribute values.

<field id="productId">



<field id="AckLevel">



<field id="AckDate">



<field id="Success">



Required Output :

Product_Id| AckLevel AckDate| Success

12354678   | LEVEL2  |2012-02-29T16:21:54|true