Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig and XML parsing


Copy link to this message
-
Re: Pig and XML parsing
how about this,
A = load 'input' using org.apache.pig.piggybank.storage.XMLLoader('property
') as (variable: datatype);
On Fri, Oct 18, 2013 at 4:38 AM, Sameer Tilak <[EMAIL PROTECTED]> wrote:

> Hi All,
> I have a lot of small (~2 to 3 MB) XML files that I would like to process.
> I was thinking along the following lines, please let me know if you have
> any thoughts on this.
>
> 1. Create SeqeunceFiles such that each sequence file between 60 to 64 MB
> and no XML file is split onto 2 Sequence Files.
> 2. Write Pig Script to that loads the sequence file, then iterates over
> individual XML files and analyzes them.
> I was planning to use Elephant-Bird to read sequencefiles. Here is what
> their documentation says:
> Hadoop SequenceFiles and Pig
>
> Reading and writing Hadoop SequenceFiles with Pig is supported via classes
> SequenceFileLoader
> and
> SequenceFileStorage. These
> classes make use of a
> WritableConverter
> interface, allowing pluggable conversion of key and value instances to and
> from
> Pig data types.
>
>
> Here's a short example: Suppose you have SequenceFile<Text, LongWritable>
> data
> sitting beneath path input. We can load that data with the following Pig
> script:
>
>
> REGISTER '/path/to/elephant-bird.jar';
>
> %declare SEQFILE_LOADER
> 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
> %declare LONG_CONVERTER
> 'com.twitter.elephantbird.pig.util.LongWritableConverter';
>
> pairs = LOAD 'input' USING $SEQFILE_LOADER (
>   '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
> ) AS (key: chararray, value: long);
>
>
> I was looking at XMLLoader from piggybank. Has anyone used XPATH queries
> in their Pig scripts?
>
--
*Thanks & Regards,*
*S. Ajay Kumar
+91-9966159106*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB