Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Need help to parse nested XML


Copy link to this message
-
Need help to parse nested XML
Hi All,

 I need help so badly , I hope you would understand my situation

 The use case is, I have one folder which has multiple XML files and I need
to write a PIG script which recursively parse all the files and generate
one flat file.

The XML looks like this and each XML file has different clinical_study_rank
such as *clinical_study rank="687"*

*<?xml version="1.0" encoding="UTF-8"?>*
*<clinical_study rank="687">*
*  <!-- This xml conforms to an XML Schema at:*
*    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
<http://clinicaltrials.gov/ct2/html/images/info/public.xsd>*
* and an XML DTD at:*
*    http://clinicaltrials.gov/ct2/html/images/info/public.dtd
<http://clinicaltrials.gov/ct2/html/images/info/public.dtd> -->*
*  <required_header>*
*    <download_date>ClinicalTrials.gov processed this data on November 07,
2013</download_date>*
*    <link_text>Link to the current ClinicalTrials.gov record.</link_text>*
*    <url>http://clinicaltrials.gov/show/NCT00000611
<http://clinicaltrials.gov/show/NCT00000611></url>*
*  </required_header>*
*  <id_info>*
*    <org_study_id>114</org_study_id>*
*    <nct_id>NCT00000611</nct_id>*
*  </id_info>*
*  <brief_title>Women's Health Initiative (WHI)</brief_title>*
*  <sponsors>*
*    <lead_sponsor>*
*      <agency>National Heart, Lung, and Blood Institute (NHLBI)</agency>*
*      <agency_class>NIH</agency_class>*
*    </lead_sponsor>*
*    <collaborator>*
*      <agency>National Institute of Arthritis and Musculoskeletal and Skin
Diseases (NIAMS)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*    <collaborator>*
*      <agency>National Cancer Institute (NCI)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*    <collaborator>*
*      <agency>National Institute on Aging (NIA)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*  </sponsors>*
*  <source>National Heart, Lung, and Blood Institute (NHLBI)</source>*
*  <oversight_info>*
*    <authority>United States: Federal Government</authority>*
*  </oversight_info>*
*  <brief_summary>*
*    <textblock>*
*      To address cardiovascular disease, cancer, and osteoporosis, the
most common causes of*
*      death, disability, and impaired quality of life in postmenopausal
women.  The three major*
*      components of the WHI are: a randomized controlled clinical trial of
hormone replacement*
*      therapy (HRT), dietary modification (DM), and calcium/vitamin D
supplementation  (CaD); an*
*      observational study (OS); and a community prevention study (CPS).
 On October 1, 1997,*
*      administration of the WHI was transferred to the NHLBI where it is
conducted as a consortium*
*      effort led by the NHLBI in cooperation with the National Institute
of Arthritis and*
*      Musculoskeletal and Skin Diseases (NIAMS), the National Cancer
Institute (NCI), and the*
*      National Institute on Aging (NIA).*
*    </textblock>*
*  </brief_summary>*
*  <detailed_description>*
*    <textblock>*
*      BACKGROUND:*

*      Prior to 1991, little research had focused on health issues unique
to, or more common for,*
*      women.  This was especially the case for studies of chronic diseases
and their prevention in*
*      mature women. These conditions (coronary heart disease, cancer, and
osteoporosis) are the*
*      leading causes of impairment of quality of life, morbidity, and
mortality in post-menopausal*
*      United States women.  The WHI, mandated by Congress, was established
in 1991 by the National*
*      Institutes of Health and located in the Office of the Director (OD).
 The Clinical*
*      Coordinating Center for the clinical trial/observational study was
funded in September 1992*
*      and the 16 Vanguard Clinical Centers were funded in March 1993.  The
initial protocol was*
*      developed jointly by the Clinical Coordinating Center and the
Program Office and was*
*      reviewed and approved by the Investigators Committee on April 20,
1993.  Additional clinical*
*      centers were funded in 1994.*

*      On October 1, 1997, administration of the WHI was transferred to the
NHLBI where it is*
*      conducted as a consortium effort led by the NHLBI in cooperation
with the National Institute*
*      of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), the
National Cancer Institute*
*      (NCI), and the National Institute on Aging (NIA).*

*      DESIGN NARRATIVE:*

*      As has been described in the objective, the WHI has three major
components: a randomized*
*      controlled clinical trial, an observational study, and a study of
community approaches to*
*      developing healthful behaviors.  Recruitment for the WHI began in
September 1993 and ended*
*      in December 1998. Six clinical centers completed recruitment in
January 1997.  The remaining*
*      34 centers completed recruitment in December 1998.*

*      CLINICAL TRIAL COMPONENT*

*      The clinical trial component consists of three subtrials: the
hormone replacement trial, the*
*      dietary modification trial, and the calcium /vitamin D
supplementation trial. Approximately*
*      27,500 women aged 50 to 79 are participating in the HRT, which tests
whether long-term HRT*
*      reduces coronary heart disease and fractures without increasing
breast cancer risk.  Women*
*      with a uterus were randomized to receive either estrogen plus
progestin or a placebo.*
*      Progestin was added to protect women with a uterus from endometrial
cancer.  Women who have*
*      had a hysterectomy were randomized to receive either estrogen alone
or a placebo. The*
*      estrogen plus progestin trial was stopped early on July 8, 2002
after an average follow-up*
*      of 5.2 years on the recommendation of the Data and Safety Monitoring
Board. The estrogen*
*      alone study continued unchanged until March 2, 2004 when the NIH
instructed participants to*
*      stop taking their study pills and to begin the follow-up phase of
the study. . Particip
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB