Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Parsing XML using PIG


Copy link to this message
-
RE: Parsing XML using PIG
I just use XMLLoader to break the input xml into records, then stream that through an xml parser to pull out what I need into the fields of a relation for subsequent pig processing.  Like
 -- The analyze_src_recs.py script reads input xml from stdin, and writes to
 -- stdout for each relevant part of each input record:
 -- citeddocid,citingdocid,collection,seqno,year,[....]
 --
 define analyze_src `analyze_src_recs.py`
    input  (stdin)
    output (stdout USING PigStreaming(','))
    ship   ('$scriptDir/analyze_src_recs.py');

 SrcLines  = load '$src_xml/*.xml*'
    using org.apache.pig.piggybank.storage.XMLLoader('REC')
    as (doc:chararray);
 ParseOut = stream SrcLines through analyze_src
          as (rec_type   : int,
       citeddocid : int,
              citingdocid: int,
              col        : chararray,
              seq        : chararray,
              [....]
             );

 -- rec_type determines which of two kinds of records the UDF streaming
 -- function analyze_src_recs.py has generated
 split ParseOut into
     ParseOutCitation if rec_type == 0,
     ParseOutSrc      if rec_type == 1;
 [...]

HTH,

Will
William F Dowling
Senior Technologist
Thomson Reuters

-----Original Message-----
From: krishnan N [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 19, 2012 8:28 PM
To: [EMAIL PROTECTED]
Subject: Parsing XML using PIG

Hi All,

I am trying XML parsing using PIG, the below are the code which uses the
xmlloader class . I am trying to convert XML to text file with attribute in
columns and attribute value as column value.

register /usr/lib/pig/contrib/piggybank/java/piggybank.jar;

xml_file = LOAD '/home/test2.xml' using
org.apache.pig.piggybank.storage.XMLLoader('field') as (doc:chararray);

loof_file = foreach xml_file generate field;

store_file = store loof_file into '/home/xml2_to_text.dat';

The xmlloader identifies only the ‘tag’ supplied as input parameter and
gives the below result only for the particular tag. Is there any way to get
attribute values.

<field id="productId">

                <value>12354678</value>

            </field>

<field id="AckLevel">

                <value>LEVEL2</value>

            </field>

<field id="AckDate">

                <value>2012-02-29T16:21:54</value>

            </field>

<field id="Success">

                <value>true</value>

            </field>

Required Output :

Product_Id| AckLevel AckDate| Success

12354678   | LEVEL2  |2012-02-29T16:21:54|true

Thanks
Krishnan
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB