-Re: Pig and XML parsing
ajay kumar 2013-10-18, 03:54
how about this,
A = load 'input' using org.apache.pig.piggybank.storage.XMLLoader('property
') as (variable: datatype);
On Fri, Oct 18, 2013 at 4:38 AM, Sameer Tilak <[EMAIL PROTECTED]> wrote:
> Hi All,
> I have a lot of small (~2 to 3 MB) XML files that I would like to process.
> I was thinking along the following lines, please let me know if you have
> any thoughts on this.
> 1. Create SeqeunceFiles such that each sequence file between 60 to 64 MB
> and no XML file is split onto 2 Sequence Files.
> 2. Write Pig Script to that loads the sequence file, then iterates over
> individual XML files and analyzes them.
> I was planning to use Elephant-Bird to read sequencefiles. Here is what
> their documentation says:
> Hadoop SequenceFiles and Pig
> Reading and writing Hadoop SequenceFiles with Pig is supported via classes
> SequenceFileStorage. These
> classes make use of a
> interface, allowing pluggable conversion of key and value instances to and
> Pig data types.
> Here's a short example: Suppose you have SequenceFile<Text, LongWritable>
> sitting beneath path input. We can load that data with the following Pig
> REGISTER '/path/to/elephant-bird.jar';
> %declare SEQFILE_LOADER
> %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
> %declare LONG_CONVERTER
> pairs = LOAD 'input' USING $SEQFILE_LOADER (
> '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
> ) AS (key: chararray, value: long);
> I was looking at XMLLoader from piggybank. Has anyone used XPATH queries
> in their Pig scripts?
*Thanks & Regards,*
*S. Ajay Kumar