Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Custom Loaders that use Input Streams for reading data?


+
Rory McCann 2012-01-13, 12:12
Copy link to this message
-
RE: Custom Loaders that use Input Streams for reading data?
I'm using org.apache.pig.piggybank.storage.XMLLoader from piggybank and that's working well for me.  I do something like this:

 -- The analyze_src_recs.py script reads XML from stdin, and writes to
 -- stdout comma-separated lines   rec_type,...
 --
 define analyze_src `analyze_src_recs.py`
    input  (stdin)
    output (stdout USING PigStreaming(','))
    ship   ('$scriptDir/analyze_src_recs.py');
SrcLines  = load '$src_xml/*.xml*'
    using org.apache.pig.piggybank.storage.XMLLoader('REC')
    as (doc:chararray);
ParseOut = stream SrcLines through analyze_src
          as (rec_type   : int,
        -- other fields my parser pulled out of the XML
             );

William F Dowling
Senior Technologist
Thomson Reuters
0 +1 215 823 3853
-----Original Message-----
From: Rory McCann [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 13, 2012 7:12 AM
To: [EMAIL PROTECTED]
Subject: Custom Loaders that use Input Streams for reading data?

Hi all,

I'm new to Pig (and a bit rusty with Java!) and still just playing
around with it, nothing serious yet. I might be misunderstanding
something important here.

I'm trying to write a custom loader for a custom XML file format, i.e.
deserialize the XML into Pig data type. However all the documentation
and other code is based on taking a RecordReader and spitting out things
from getNext().

Is there anyway to make a custom loader that works on InputStreams or
more common java-io-y type stuff? I'd like to use more commonly
available XML parsers (which work on these). Since it's XML, line by
line parsing doesn't really work. I will just have one input file that
will be parsed. Is there some reason why there are no InputStreams?

I have also asked this question on StackOverflow:
http://stackoverflow.com/questions/8843790/custom-apache-pig-loadfunc-where-can-i-get-the-inputstream-on-the-file

--
Rory

+
Dmitriy Ryaboy 2012-01-13, 18:28
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB