Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Re: Need help


Copy link to this message
-
Re: Need help
Pradeep Gollakota 2013-11-28, 03:16
This question belongs on the user list. The dev list is meant for Pig
developers to discuss issues related to the development of Pig. I’ve
forwarded this to the user list. It also helps tremendously if you format
your data and scripts nicely as they’re much easier to read and understand.
I use a chrome extension called MarkdownHere to get the proper HTML (see
below).

Data:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="687">
  <required_header>
    <download_date>ClinicalTrials.gov processed this data on November
07, 2013</download_date>
    <link_text>Link to the current ClinicalTrials.gov record.</link_text>
    <url>http://clinicaltrials.gov/show/NCT00000611</url>
  </required_header>
  <id_info>
    <org_study_id>114</org_study_id>
    <nct_id>NCT00000611</nct_id>
  </id_info>
  <brief_title>Women's Health Initiative (WHI)</brief_title>
  <sponsors>
    <lead_sponsor>
      <agency>National Heart, Lung, and Blood Institute (NHLBI)</agency>
      <agency_class>NIH</agency_class>
    </lead_sponsor>
    <collaborator>
      <agency>National Institute of Arthritis and Musculoskeletal and
Skin Diseases (NIAMS)</agency>
      <agency_class>NIH</agency_class>
    </collaborator>
    <collaborator>
      <agency>National Cancer Institute (NCI)</agency>
      <agency_class>NIH</agency_class>
    </collaborator>
    <collaborator>
      <agency>National Institute on Aging (NIA)</agency>
      <agency_class>NIH</agency_class>
    </collaborator>
  </sponsors>
<clinical_study>

Script:

register piggybank.jar;
A = load 'piglab/NCT00000611.xml' using
org.apache.pig.piggybank.storage.XMLLoader('id_info')as (x:
chararray);
B = foreach A GENERATE FLATTEN(
            REGEX_EXTRACT_ALL(x,
'<id_info>\\n\\s*<org_study_id>(.*)</org_study_id>\\n\\s*<nct_id>(.*)</nct_id>\\n\\s*</id_info>'))
            as (org_study_id: chararray,nct_id : chararray);
C = foreach B GENERATE CONCAT('1$',CONCAT(CONCAT(org_study_id,'$'),nct_id));
STORE C into 'piglab/result1';
data = load 'piglab/result1' USING PigStorage('$') as (a1: int,a2:
chararray,a3: chararray);
A1 = load 'piglab/NCT00000611.xml' using
org.apache.pig.piggybank.storage.XMLLoader('lead_sponsor') as (y:
chararray);
B1 = foreach A1 GENERATE FLATTEN(REGEX_EXTRACT_ALL(y,
'<lead_sponsor>\\n\\s*<agency>(.*)</agency>\\n\\s*<agency_class>(.*)</agency_class>\\n\\s*</lead_sponsor>'))
as (agency: chararray,agency_class: chararray);
D = foreach B1 GENERATE CONCAT('1$',CONCAT(CONCAT(agency,'$'),agency_class));
STORE D into 'piglab/result2';
data1 = load 'piglab/result2' USING PigStorage('$') as (b1: int,b2:
chararray,b3: chararray);
result= JOIN data by a1,data1 by b1;
store result into 'piglab/result' USING PigStorage('$');

On Wed, Nov 27, 2013 at 6:03 PM, Haider <[EMAIL PROTECTED]> wrote:

> Hi Daniel
>
>      I need help so badly , I hope you would understand my situation
>
>  The use case is, I have one folder which has multiple XML files and I need
> to write a PIG script which recursively parse all the files and generate
> one flat file.
>
> The XML looks like this and each XML file has different clinical_study_rank
> such as <*clinical_study rank="687"*
> *<?xml version="1.0" encoding="UTF-8"?>*
> *<clinical_study rank="687">*
> *  <!-- This xml conforms to an XML Schema at:*
> *    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
> <http://clinicaltrials.gov/ct2/html/images/info/public.xsd>*
> * and an XML DTD at:*
> *    http://clinicaltrials.gov/ct2/html/images/info/public.dtd
> <http://clinicaltrials.gov/ct2/html/images/info/public.dtd> -->*
> *  <required_header>*
> *    <download_date>ClinicalTrials.gov processed this data on November 07,
> 2013</download_date>*
> *    <link_text>Link to the current ClinicalTrials.gov record.</link_text>*
> *    <url>http://clinicaltrials.gov/show/NCT00000611
> <http://clinicaltrials.gov/show/NCT00000611></url>*
> *  </required_header>*
> *  <id_info>*
> *    <org_study_id>114</org_study_id>*
> *    <nct_id>NCT00000611</nct_id>*