Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Need help to parse nested XML


Copy link to this message
-
Need help to parse nested XML
Hi All,

 I need help so badly , I hope you would understand my situation

 The use case is, I have one folder which has multiple XML files and I need
to write a PIG script which recursively parse all the files and generate
one flat file.

The XML looks like this and each XML file has different clinical_study_rank
such as *clinical_study rank="687"*

*<?xml version="1.0" encoding="UTF-8"?>*
*<clinical_study rank="687">*
*  <!-- This xml conforms to an XML Schema at:*
*    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
<http://clinicaltrials.gov/ct2/html/images/info/public.xsd>*
* and an XML DTD at:*
*    http://clinicaltrials.gov/ct2/html/images/info/public.dtd
<http://clinicaltrials.gov/ct2/html/images/info/public.dtd> -->*
*  <required_header>*
*    <download_date>ClinicalTrials.gov processed this data on November 07,
2013</download_date>*
*    <link_text>Link to the current ClinicalTrials.gov record.</link_text>*
*    <url>http://clinicaltrials.gov/show/NCT00000611
<http://clinicaltrials.gov/show/NCT00000611></url>*
*  </required_header>*
*  <id_info>*
*    <org_study_id>114</org_study_id>*
*    <nct_id>NCT00000611</nct_id>*
*  </id_info>*
*  <brief_title>Women's Health Initiative (WHI)</brief_title>*
*  <sponsors>*
*    <lead_sponsor>*
*      <agency>National Heart, Lung, and Blood Institute (NHLBI)</agency>*
*      <agency_class>NIH</agency_class>*
*    </lead_sponsor>*
*    <collaborator>*
*      <agency>National Institute of Arthritis and Musculoskeletal and Skin
Diseases (NIAMS)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*    <collaborator>*
*      <agency>National Cancer Institute (NCI)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*    <collaborator>*
*      <agency>National Institute on Aging (NIA)</agency>*
*      <agency_class>NIH</agency_class>*
*    </collaborator>*
*  </sponsors>*
*  <source>National Heart, Lung, and Blood Institute (NHLBI)</source>*
*  <oversight_info>*
*    <authority>United States: Federal Government</authority>*
*  </oversight_info>*
*  <brief_summary>*
*    <textblock>*
*      To address cardiovascular disease, cancer, and osteoporosis, the
most common causes of*
*      death, disability, and impaired quality of life in postmenopausal
women.  The three major*
*      components of the WHI are: a randomized controlled clinical trial of
hormone replacement*
*      therapy (HRT), dietary modification (DM), and calcium/vitamin D
supplementation  (CaD); an*
*      observational study (OS); and a community prevention study (CPS).
 On October 1, 1997,*
*      administration of the WHI was transferred to the NHLBI where it is
conducted as a consortium*
*      effort led by the NHLBI in cooperation with the National Institute
of Arthritis and*
*      Musculoskeletal and Skin Diseases (NIAMS), the National Cancer
Institute (NCI), and the*
*      National Institute on Aging (NIA).*
*    </textblock>*
*  </brief_summary>*
*  <detailed_description>*
*    <textblock>*
*      BACKGROUND:*

*      Prior to 1991, little research had focused on health issues unique
to, or more common for,*
*      women.  This was especially the case for studies of chronic diseases
and their prevention in*
*      mature women. These conditions (coronary heart disease, cancer, and
osteoporosis) are the*
*      leading causes of impairment of quality of life, morbidity, and
mortality in post-menopausal*
*      United States women.  The WHI, mandated by Congress, was established
in 1991 by the National*
*      Institutes of Health and located in the Office of the Director (OD).
 The Clinical*
*      Coordinating Center for the clinical trial/observational study was
funded in September 1992*
*      and the 16 Vanguard Clinical Centers were funded in March 1993.  The
initial protocol was*
*      developed jointly by the Clinical Coordinating Center and the
Program Office and was*
*      reviewed and approved by the Investigators Committee on April 20,
1993.  Additional clinical*
*      centers were funded in 1994.*

*      On October 1, 1997, administration of the WHI was transferred to the
NHLBI where it is*
*      conducted as a consortium effort led by the NHLBI in cooperation
with the National Institute*
*      of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), the
National Cancer Institute*
*      (NCI), and the National Institute on Aging (NIA).*

*      DESIGN NARRATIVE:*

*      As has been described in the objective, the WHI has three major
components: a randomized*
*      controlled clinical trial, an observational study, and a study of
community approaches to*
*      developing healthful behaviors.  Recruitment for the WHI began in
September 1993 and ended*
*      in December 1998. Six clinical centers completed recruitment in
January 1997.  The remaining*
*      34 centers completed recruitment in December 1998.*

*      CLINICAL TRIAL COMPONENT*

*      The clinical trial component consists of three subtrials: the
hormone replacement trial, the*
*      dietary modification trial, and the calcium /vitamin D
supplementation trial. Approximately*
*      27,500 women aged 50 to 79 are participating in the HRT, which tests
whether long-term HRT*
*      reduces coronary heart disease and fractures without increasing
breast cancer risk.  Women*
*      with a uterus were randomized to receive either estrogen plus
progestin or a placebo.*
*      Progestin was added to protect women with a uterus from endometrial
cancer.  Women who have*
*      had a hysterectomy were randomized to receive either estrogen alone
or a placebo. The*
*      estrogen plus progestin trial was stopped early on July 8, 2002
after an average follow-up*
*      of 5.2 years on the recommendation of the Data and Safety Monitoring
Board. The estrogen*
*      alone study continued unchanged until March 2, 2004 when the NIH
instructed participants to*
*      stop taking their study pills and to begin the follow-up phase of
the study. . Particip