Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Need help


Hi Daniel

     I need help so badly , I hope you would understand my situation

 The use case is, I have one folder which has multiple XML files and I need
to write a PIG script which recursively parse all the files and generate
one flat file.

The XML looks like this and each XML file has different clinical_study_rank
such as <*clinical_study rank="687"*
*<?xml version="1.0" encoding="UTF-8"?>*
*<clinical_study rank="687">*
*  <!-- This xml conforms to an XML Schema at:*
*    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
<http://clinicaltrials.gov/ct2/html/images/info/public.xsd>*
* and an XML DTD at:*
*    http://clinicaltrials.gov/ct2/html/images/info/public.dtd
<http://clinicaltrials.gov/ct2/html/images/info/public.dtd> -->*
*  <required_header>*
*    <download_date>ClinicalTrials.gov processed this data on November 07,
2013</download_date>*
*    <link_text>Link to the current ClinicalTrials.gov record.</link_text>*
*    <url>http://clinicaltrials.gov/show/NCT00000611
<http://clinicaltrials.gov/show/NCT00000611></url>*
*  </required_header>*
*  <id_info>*
*    <org_study_id>114</org_study_id>*
*    <nct_id>NCT00000611</nct_id>*
*  </id_info>*
*  <brief_title>Women's Health Initiative (WHI)</brief_title>*
*  <sponsors>*
*    <lead_sponsor>*
*      <agency>National Heart, Lung, and Blood Institute (NHLBI)</agency>*
*      <agency_class>NIH</agency_class>*
*    </lead_sponsor>*
*    <collaborator>*
*      <agency>National Institute of Arthritis and Musculoskeletal and Skin
Diseases (NIAMS)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*    <collaborator>*
*      <agency>National Cancer Institute (NCI)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*    <collaborator>*
*      <agency>National Institute on Aging (NIA)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*  </sponsors>*
*<**clinical_study>*

*I have written the below script by considering  one XML file but this is
not working as per requirement since it generating many small file and I
dont know how merge them to make one.*
*Below is my Pig script.*

*register piggybank.jar;A = load 'piglab/NCT00000611.xml' using
org.apache.pig.piggybank.storage.XMLLoader('id_info')as (x: chararray);B foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
 '<id_info>\\n\\s*<org_study_id>(.*)</org_study_id>\\n\\s*<nct_id>(.*)</nct_id>\\n\\s*</id_info>'))
   as (org_study_id: chararray,nct_id : chararray);C = foreach B GENERATE
CONCAT('1$',CONCAT(CONCAT(org_study_id,'$'),nct_id));STORE C into
'piglab/result1';data = load 'piglab/result1' USING PigStorage('$') as (a1:
int,a2: chararray,a3: chararray);A1 = load 'piglab/NCT00000611.xml' using
org.apache.pig.piggybank.storage.XMLLoader('lead_sponsor')as (y:
chararray);B1 = foreach A1 GENERATE FLATTEN(REGEX_EXTRACT_ALL(y,
 '<lead_sponsor>\\n\\s*<agency>(.*)</agency>\\n\\s*<agency_class>(.*)</agency_class>\\n\\s*</lead_sponsor>'))
   as (agency: chararray,agency_class: chararray);D = foreach B1 GENERATE
CONCAT('1$',CONCAT(CONCAT(agency,'$'),agency_class));STORE D into
'piglab/result2';data1 = load 'piglab/result2' USING PigStorage('$') as
(b1: int,b2: chararray,b3: chararray);result= JOIN data by a1,data1 by b1
;store result into 'piglab/result' USING PigStorage('$');If you can give me
one sample PIG script which parses such nested XML file then I can go
forward with that.*