Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> can't parse the values using XML loader


Copy link to this message
-
RE: can't parse the values using XML loader
Part of the problem might be that the regexp has

<COMPANY>(.*)<COMPANY>

but you need
<COMPANY>(.*)</COMPANY>

Using regexps to parse XML is awfully brittle. An alternative is to use a UDF that calls out to an XML parser. I use ElementTree from python UDFs.

Will Dowling

________________________________________
From: Muni mahesh [[EMAIL PROTECTED]]
Sent: Wednesday, August 21, 2013 6:58 AM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: can't parse the values using XML loader

*Input file :*

<CATALOG>
<CD>
<TITLE>hadoop developer</TITLE>
<ARTIST>ajay</ARTIST>
<COUNTRY>india</COUNTRY>
<COMPANY>ITC</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>2013</YEAR>
</CD>
</CATALOG>
===========================================================================================================================================*Pig Script:*

register /usr/lib/pig/piggybank.jar;

A = load '/home/sudeep/Desktop/CATALOG.xml' using
org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x:
chararray);
B = foreach A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x,'<CATALOG>\\n*<CD>\\n<TITLE>(.*)</TITLE>\\n\\s*<ARTIST>(.*)</ARTIST>\\n\\s*<COUNTRY>(.*)</COUNTRY>\\n\\s*<COMPANY>(.*)<COMPANY>\\n\\s*<PRICE>(.*)</PRICE>\\n\\s*<YEAR>(.*)</YEAR>\\n\\s*</CD>\\n*</CATALOG>'))
as (id: int, name:chararray);
*Output Expected :*

(hadoop, ajay, india, ITC, 10.90, 2013)

*Issue :

*

But the output i am getting is :*

()

*

*I hope it is not able to parse the values between the tags
*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB