Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Need help


Hi Daniel

     I need help so badly , I hope you would understand my situation

 The use case is, I have one folder which has multiple XML files and I need
to write a PIG script which recursively parse all the files and generate
one flat file.

The XML looks like this and each XML file has different clinical_study_rank
such as <*clinical_study rank="687"*
*<?xml version="1.0" encoding="UTF-8"?>*
*<clinical_study rank="687">*
*  <!-- This xml conforms to an XML Schema at:*
*    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
<http://clinicaltrials.gov/ct2/html/images/info/public.xsd>*
* and an XML DTD at:*
*    http://clinicaltrials.gov/ct2/html/images/info/public.dtd
<http://clinicaltrials.gov/ct2/html/images/info/public.dtd> -->*
*  <required_header>*
*    <download_date>ClinicalTrials.gov processed this data on November 07,
2013</download_date>*
*    <link_text>Link to the current ClinicalTrials.gov record.</link_text>*
*    <url>http://clinicaltrials.gov/show/NCT00000611
<http://clinicaltrials.gov/show/NCT00000611></url>*
*  </required_header>*
*  <id_info>*
*    <org_study_id>114</org_study_id>*
*    <nct_id>NCT00000611</nct_id>*
*  </id_info>*
*  <brief_title>Women's Health Initiative (WHI)</brief_title>*
*  <sponsors>*
*    <lead_sponsor>*
*      <agency>National Heart, Lung, and Blood Institute (NHLBI)</agency>*
*      <agency_class>NIH</agency_class>*
*    </lead_sponsor>*
*    <collaborator>*
*      <agency>National Institute of Arthritis and Musculoskeletal and Skin
Diseases (NIAMS)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*    <collaborator>*
*      <agency>National Cancer Institute (NCI)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*    <collaborator>*
*      <agency>National Institute on Aging (NIA)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*  </sponsors>*
*<**clinical_study>*

*I have written the below script by considering  one XML file but this is
not working as per requirement since it generating many small file and I
dont know how merge them to make one.*
*Below is my Pig script.*

*register piggybank.jar;A = load 'piglab/NCT00000611.xml' using
org.apache.pig.piggybank.storage.XMLLoader('id_info')as (x: chararray);B foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
 '<id_info>\\n\\s*<org_study_id>(.*)</org_study_id>\\n\\s*<nct_id>(.*)</nct_id>\\n\\s*</id_info>'))
   as (org_study_id: chararray,nct_id : chararray);C = foreach B GENERATE
CONCAT('1$',CONCAT(CONCAT(org_study_id,'$'),nct_id));STORE C into
'piglab/result1';data = load 'piglab/result1' USING PigStorage('$') as (a1:
int,a2: chararray,a3: chararray);A1 = load 'piglab/NCT00000611.xml' using
org.apache.pig.piggybank.storage.XMLLoader('lead_sponsor')as (y:
chararray);B1 = foreach A1 GENERATE FLATTEN(REGEX_EXTRACT_ALL(y,
 '<lead_sponsor>\\n\\s*<agency>(.*)</agency>\\n\\s*<agency_class>(.*)</agency_class>\\n\\s*</lead_sponsor>'))
   as (agency: chararray,agency_class: chararray);D = foreach B1 GENERATE
CONCAT('1$',CONCAT(CONCAT(agency,'$'),agency_class));STORE D into
'piglab/result2';data1 = load 'piglab/result2' USING PigStorage('$') as
(b1: int,b2: chararray,b3: chararray);result= JOIN data by a1,data1 by b1
;store result into 'piglab/result' USING PigStorage('$');If you can give me
one sample PIG script which parses such nested XML file then I can go
forward with that.*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB