Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Custom Loaders that use Input Streams for reading data?

Copy link to this message
RE: Custom Loaders that use Input Streams for reading data?
I'm using org.apache.pig.piggybank.storage.XMLLoader from piggybank and that's working well for me.  I do something like this:

 -- The analyze_src_recs.py script reads XML from stdin, and writes to
 -- stdout comma-separated lines   rec_type,...
 define analyze_src `analyze_src_recs.py`
    input  (stdin)
    output (stdout USING PigStreaming(','))
    ship   ('$scriptDir/analyze_src_recs.py');
SrcLines  = load '$src_xml/*.xml*'
    using org.apache.pig.piggybank.storage.XMLLoader('REC')
    as (doc:chararray);
ParseOut = stream SrcLines through analyze_src
          as (rec_type   : int,
        -- other fields my parser pulled out of the XML

William F Dowling
Senior Technologist
Thomson Reuters
0 +1 215 823 3853
-----Original Message-----
From: Rory McCann [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 13, 2012 7:12 AM
Subject: Custom Loaders that use Input Streams for reading data?

Hi all,

I'm new to Pig (and a bit rusty with Java!) and still just playing
around with it, nothing serious yet. I might be misunderstanding
something important here.

I'm trying to write a custom loader for a custom XML file format, i.e.
deserialize the XML into Pig data type. However all the documentation
and other code is based on taking a RecordReader and spitting out things
from getNext().

Is there anyway to make a custom loader that works on InputStreams or
more common java-io-y type stuff? I'd like to use more commonly
available XML parsers (which work on these). Since it's XML, line by
line parsing doesn't really work. I will just have one input file that
will be parsed. Is there some reason why there are no InputStreams?

I have also asked this question on StackOverflow: