Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Converting xml to csv


Copy link to this message
-
RE: Converting xml to csv
This is one way to get employee_id and email:

 A = load 'xxx.xml' using org.apache.pig.piggybank.storage.XMLLoader('employee') as (x:chararray);
 B = foreach A generate REPLACE(x,'[\\n]','') as x;  
 C = foreach B generate
     REGEX_EXTRACT_ALL(x,'.*(?:<employee_id>)([^<]*).*(?:<email>)([^<]*).*');
 dump C;

But it's a hack. You should use a UDF that calls out to an XML-parsing library.

William F Dowling
Senior Technologist
Thomson Reuters
-----Original Message-----
From: ajay kumar [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, September 17, 2013 2:15 AM
To: [EMAIL PROTECTED]
Subject: Re: Converting xml to csv

yeah thank you...

now im  also struck. if possible, can you share the solution ??
On Mon, Sep 16, 2013 at 7:21 PM, <[EMAIL PROTECTED]> wrote:

> Your example had newlines in the <employee> element. The regular
> expression .* does not match newlines. One way to remove newlines is
> REPLACE(x,'[\\n]',''). If the text ranges you are interested in do not
> contain newlines, for example if you are interested in <employee_id> but do
> not care about its relation to other elements inside the same <employee>
> element, then you do not need to do this.
>
> William F Dowling
> Senior Technologist
> Thomson Reuters
>
>
> -----Original Message-----
> From: ajay kumar [mailto:[EMAIL PROTECTED]]
> Sent: Monday, September 16, 2013 1:11 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Converting xml to csv
>
> SORRY IF I AM WRONG..
>
> WHY WE NEED TO USE REPLACE...I MEAN WHAT IS THE ADVANTAGE
>
>
> On Fri, Sep 13, 2013 at 7:02 PM, <[EMAIL PROTECTED]>
> wrote:
>
> > Ajay's suggestion will work for elements like <employee_id> in your
> > example, that occur all on one line. If you want to get the whole
> > <employee> element, and that spans more than one line, you will not be
> able
> > to get it with matching (.*) since that will not match a newline
> character.
> >
> > You can remove newline characters using
> > B = foreach A generate REPLACE(x,'[\\n]','');
> >
> >
> > William F Dowling
> > Senior Technologist
> > Thomson Reuters
> >
> >
> > -----Original Message-----
> > From: ajay kumar [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, September 13, 2013 2:21 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Converting xml to csv
> >
> > try this ...
> >
> > register /usr/lib/pig/piggybank.jar
> > A = load '/home/sudeep/Desktop/test1' using
> > org.apache.pig.piggybank.storage.XMLLoader('employee_id') as
> (x:chararray);
> > B = foreach A generate
> > REGEX_EXTRACT(x,'<employee_id>(.*)</employee_id>',1);
> >
> >
> > On Fri, Sep 13, 2013 at 3:54 AM, jamal sasha <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi,
> > >  I am trying to parse following json
> > >
> > >
> > >  <employee>
> > >     <employee_id>1234</employee_id>
> > >     <email>[EMAIL PROTECTED]</email>
> > >     <name>(first_name_1234,middle_initial_1234,last_name_1234)</name>
> > >
> > >
> <projects>{(project_1234_1),(project_1234_2),(project_1234_3)}</projects>
> > >     <skills>[programming:SQL,rdbms:Oracle]</skills>
> > >   </employee>
> > >
> > > And my script is
> > >
> > > a = LOAD 'sample.xml' USING
> > > org.apache.pig.piggybank.storage.XMLLoader('employee') as
> (x:chararray);
> > > B = foreach a generate REGEX_EXTRACT(x,'<employee>(.*)</employee>',1)
> > > dump B;
> > >  now B is empty tuple here?
> > > Not sure what am i missing?
> > >
> > >
> > >
> > >
> > > On Wed, Sep 11, 2013 at 11:35 PM, ajay kumar <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > use org.apache.pig.piggybank.storage.XMLLoader  and then extract them
> > > using
> > > > regex_all
> > > >
> > > >
> > > > On Thu, Sep 12, 2013 at 11:18 AM, jamal sasha <[EMAIL PROTECTED]
> >
> > > > wrote:
> > > >
> > > > > Umm.. yess.. but how do i generalize it..
> > > > > so what I am looking for is.. just like we have json parser in say
> > java
> > > > > If i give a valid json string.. I can parse it as and then i can
> > access
> > > > it
> > > > > as a hashmap..

*Thanks & Regards,*
*S. Ajay Kumar
+91-9966159106*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB