Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - How to read and store XML files from a folder one by one


Copy link to this message
-
How to read and store XML files from a folder one by one
Haider 2013-11-30, 14:37
Hi

   I am new to PIG scripting and need help or suggestion to resolve below
problem.

 I have 1000 XML files in a folder and my PIG script has to take them one
by one to parse for some values and has to store those values in a single
files.
I tried with below script but it is not working as expected.
register piggybank.jar;

*A = load 'XML/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('org_study_id') as (x:
chararray);*
*A2 = foreach A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x,'<org_study_id>(.*)</org_study_id>')) as
(org_study_id : chararray);*
*A3 = foreach A2 GENERATE CONCAT('#$',CONCAT(org_study_id,'$'));*
*STORE A3 into 'piglab/result1';*
*data = load 'piglab/result1' USING PigStorage('$') as (a1: chararray,a2:
chararray);*

*C = load 'XML/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('nct_id') as (x1: chararray);*
*C2 = foreach C GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x1,'<nct_id>(.*)</nct_id>')) as (nct_id :
chararray);*
*C3 = foreach C2 GENERATE CONCAT('#$',CONCAT(nct_id,'$'));*
*STORE C3 into 'piglab/result11';*
*data11 = load 'piglab/result11' USING PigStorage('$') as (c1:
chararray,c2: chararray);*

*I = load 'piglab/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('minimum_age') as (x5:
chararray);*
*I2 = foreach I GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x5,'<minimum_age>(.*)</minimum_age>')) as
(minimum_age: chararray);*
*I3 = foreach I2 GENERATE CONCAT('#$',CONCAT(minimum_age,'$'));*
*STORE I3 into 'piglab/result9';*
*data8 = load 'piglab/result9' USING PigStorage('$') as (i1: chararray,i2:
chararray);*

*result3 = JOIN data by a1,data11 by c1,data8 by i1;*
*Store result3 into 'piglab/result'*;

The XML looks like this and each XML file has different clinical_study_rank

> such as <*clinical_study rank="687"*
> *<?xml version="1.0" encoding="UTF-8"?>*
> *<clinical_study rank="687">*
> *  <!-- This xml conforms to an XML Schema at:*
> *    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
> <http://clinicaltrials.gov/ct2/html/images/info/public.xsd>*
> * and an XML DTD at:*
> *    http://clinicaltrials.gov/ct2/html/images/info/public.dtd
> <http://clinicaltrials.gov/ct2/html/images/info/public.dtd> -->*
> *  <required_header>*
> *    <download_date>ClinicalTrials.gov processed this data on November 07,
> 2013</download_date>*
> *    <link_text>Link to the current ClinicalTrials.gov
record.</link_text>*
> *    <url>http://clinicaltrials.gov/show/NCT00000611
> <http://clinicaltrials.gov/show/NCT00000611></url>*
> *  </required_header>*
> *  <id_info>*
> *    <org_study_id>114</org_study_id>*
> *    <nct_id>NCT00000611</nct_id>*
> *  </id_info>*
> *  <brief_title>Women's Health Initiative (WHI)</brief_title>*
> *  <sponsors>*
> *    <lead_sponsor>*
> *      <agency>National Heart, Lung, and Blood Institute (NHLBI)</agency>*
> *      <agency_class>NIH</agency_class>*
> *    </lead_sponsor>*
> *    <collaborator>*
> *      <agency>National Institute of Arthritis and Musculoskeletal and
Skin
> Diseases (NIAMS)</agency>*
> *      <agency_class>NIH</agency_class>*
> *    </collaborator>*
> *    <collaborator>*
> *      <agency>National Cancer Institute (NCI)</agency>*
> *      <agency_class>NIH</agency_class>*
> *    </collaborator>*
> *    <collaborator>*
> *      <agency>National Institute on Aging (NIA)</agency>*
> *      <agency_class>NIH</agency_class>*
> *    </collaborator>*
> *  </sponsors>*
> *<**clinical_study>*

any help on this will be highly appreciable.

thanks