Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Converting xml to csv


Copy link to this message
-
RE: Converting xml to csv
Your example had newlines in the <employee> element. The regular expression .* does not match newlines. One way to remove newlines is REPLACE(x,'[\\n]',''). If the text ranges you are interested in do not contain newlines, for example if you are interested in <employee_id> but do not care about its relation to other elements inside the same <employee> element, then you do not need to do this.

William F Dowling
Senior Technologist
Thomson Reuters
-----Original Message-----
From: ajay kumar [mailto:[EMAIL PROTECTED]]
Sent: Monday, September 16, 2013 1:11 AM
To: [EMAIL PROTECTED]
Subject: Re: Converting xml to csv

SORRY IF I AM WRONG..

WHY WE NEED TO USE REPLACE...I MEAN WHAT IS THE ADVANTAGE
On Fri, Sep 13, 2013 at 7:02 PM, <[EMAIL PROTECTED]> wrote:

> Ajay's suggestion will work for elements like <employee_id> in your
> example, that occur all on one line. If you want to get the whole
> <employee> element, and that spans more than one line, you will not be able
> to get it with matching (.*) since that will not match a newline character.
>
> You can remove newline characters using
> B = foreach A generate REPLACE(x,'[\\n]','');
>
>
> William F Dowling
> Senior Technologist
> Thomson Reuters
>
>
> -----Original Message-----
> From: ajay kumar [mailto:[EMAIL PROTECTED]]
> Sent: Friday, September 13, 2013 2:21 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Converting xml to csv
>
> try this ...
>
> register /usr/lib/pig/piggybank.jar
> A = load '/home/sudeep/Desktop/test1' using
> org.apache.pig.piggybank.storage.XMLLoader('employee_id') as (x:chararray);
> B = foreach A generate
> REGEX_EXTRACT(x,'<employee_id>(.*)</employee_id>',1);
>
>
> On Fri, Sep 13, 2013 at 3:54 AM, jamal sasha <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >  I am trying to parse following json
> >
> >
> >  <employee>
> >     <employee_id>1234</employee_id>
> >     <email>[EMAIL PROTECTED]</email>
> >     <name>(first_name_1234,middle_initial_1234,last_name_1234)</name>
> >
> > <projects>{(project_1234_1),(project_1234_2),(project_1234_3)}</projects>
> >     <skills>[programming:SQL,rdbms:Oracle]</skills>
> >   </employee>
> >
> > And my script is
> >
> > a = LOAD 'sample.xml' USING
> > org.apache.pig.piggybank.storage.XMLLoader('employee') as (x:chararray);
> > B = foreach a generate REGEX_EXTRACT(x,'<employee>(.*)</employee>',1)
> > dump B;
> >  now B is empty tuple here?
> > Not sure what am i missing?
> >
> >
> >
> >
> > On Wed, Sep 11, 2013 at 11:35 PM, ajay kumar <[EMAIL PROTECTED]
> > >wrote:
> >
> > > use org.apache.pig.piggybank.storage.XMLLoader  and then extract them
> > using
> > > regex_all
> > >
> > >
> > > On Thu, Sep 12, 2013 at 11:18 AM, jamal sasha <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Umm.. yess.. but how do i generalize it..
> > > > so what I am looking for is.. just like we have json parser in say
> java
> > > > If i give a valid json string.. I can parse it as and then i can
> access
> > > it
> > > > as a hashmap..
> > > > But in xml loader.. i still have to specify regex rules??
> > > >
> > > > Actually, is it possible to just flatten the xml..
> > > > so for example
> > > > convert
> > > > <aux>
> > > > <foobar>1</foobar>
> > > > <fushbar>foo</fushbar>
> > > > </aux>
> > > > to
> > > > <aux><foobar>1</foobar><fushbar>foo</fushbar></aux>
> > > > ???
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Sep 11, 2013 at 10:32 PM, Jagat Singh <[EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > > > Use piggybank xmlloader
> > > > >  On 12/09/2013 10:14 AM, "jamal sasha" <[EMAIL PROTECTED]>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >   So I have different xml data sources...For example:
> > > > > >
> > > > > > src1.txt
> > > > > >
> > > > > > <foo>
> > > > > > <bar>1</bar>
> > > > > > </foo>
> > > > > > <foo>
> > > > > > <bar>2</bar>
> > > > > > </foo>
> > > > > > .. and so on
> > > > > >
> > > > > >
> > > > > > and another data
> > > > > >
> > > > > > src2.txt
> > > > > >
> > > > > > <aux>

*Thanks & Regards,*
*S. Ajay Kumar
+91-9966159106*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB