Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to read and store XML files from a folder one by one


Copy link to this message
-
How to read and store XML files from a folder one by one
Hi

   I am new to PIG scripting and need help or suggestion to resolve below
problem.

 I have 1000 XML files in a folder and my PIG script has to take them one
by one to parse for some values and has to store those values in a single
files.
I tried with below script but it is not working as expected.
register piggybank.jar;

*A = load 'XML/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('org_study_id') as (x:
chararray);*
*A2 = foreach A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x,'<org_study_id>(.*)</org_study_id>')) as
(org_study_id : chararray);*
*A3 = foreach A2 GENERATE CONCAT('#$',CONCAT(org_study_id,'$'));*
*STORE A3 into 'piglab/result1';*
*data = load 'piglab/result1' USING PigStorage('$') as (a1: chararray,a2:
chararray);*

*C = load 'XML/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('nct_id') as (x1: chararray);*
*C2 = foreach C GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x1,'<nct_id>(.*)</nct_id>')) as (nct_id :
chararray);*
*C3 = foreach C2 GENERATE CONCAT('#$',CONCAT(nct_id,'$'));*
*STORE C3 into 'piglab/result11';*
*data11 = load 'piglab/result11' USING PigStorage('$') as (c1:
chararray,c2: chararray);*

*I = load 'piglab/NCT{00000611,00000768}.xml' using
org.apache.pig.piggybank.storage.XMLLoader('minimum_age') as (x5:
chararray);*
*I2 = foreach I GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x5,'<minimum_age>(.*)</minimum_age>')) as
(minimum_age: chararray);*
*I3 = foreach I2 GENERATE CONCAT('#$',CONCAT(minimum_age,'$'));*
*STORE I3 into 'piglab/result9';*
*data8 = load 'piglab/result9' USING PigStorage('$') as (i1: chararray,i2:
chararray);*

*result3 = JOIN data by a1,data11 by c1,data8 by i1;*
*Store result3 into 'piglab/result'*;

The XML looks like this and each XML file has different clinical_study_rank

> such as <*clinical_study rank="687"*
> *<?xml version="1.0" encoding="UTF-8"?>*
> *<clinical_study rank="687">*
> *  <!-- This xml conforms to an XML Schema at:*
> *    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
> <http://clinicaltrials.gov/ct2/html/images/info/public.xsd>*
> * and an XML DTD at:*
> *    http://clinicaltrials.gov/ct2/html/images/info/public.dtd
> <http://clinicaltrials.gov/ct2/html/images/info/public.dtd> -->*
> *  <required_header>*
> *    <download_date>ClinicalTrials.gov processed this data on November 07,
> 2013</download_date>*
> *    <link_text>Link to the current ClinicalTrials.gov
record.</link_text>*
> *    <url>http://clinicaltrials.gov/show/NCT00000611
> <http://clinicaltrials.gov/show/NCT00000611></url>*
> *  </required_header>*
> *  <id_info>*
> *    <org_study_id>114</org_study_id>*
> *    <nct_id>NCT00000611</nct_id>*
> *  </id_info>*
> *  <brief_title>Women's Health Initiative (WHI)</brief_title>*
> *  <sponsors>*
> *    <lead_sponsor>*
> *      <agency>National Heart, Lung, and Blood Institute (NHLBI)</agency>*
> *      <agency_class>NIH</agency_class>*
> *    </lead_sponsor>*
> *    <collaborator>*
> *      <agency>National Institute of Arthritis and Musculoskeletal and
Skin
> Diseases (NIAMS)</agency>*
> *      <agency_class>NIH</agency_class>*
> *    </collaborator>*
> *    <collaborator>*
> *      <agency>National Cancer Institute (NCI)</agency>*
> *      <agency_class>NIH</agency_class>*
> *    </collaborator>*
> *    <collaborator>*
> *      <agency>National Institute on Aging (NIA)</agency>*
> *      <agency_class>NIH</agency_class>*
> *    </collaborator>*
> *  </sponsors>*
> *<**clinical_study>*

any help on this will be highly appreciable.

thanks
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB