Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pass one file at a time to pig STREAM command?

Copy link to this message
Pass one file at a time to pig STREAM command?

I need to decode data files encoded in a proprietary binary format. In
order to be properly decoded, exactly one file must be passed to the
decoder executable per execution.

I'm experimenting with two approaches:
1. starting the process and consuming stdout manually and
2. pushing the file through pig streaming

[1] is fine but not as fine as [2] since [2] was designed for this general

The way I get [2] to sort-of work is to read each file into a tuple with
one item, and pass the tuple to the decoder binary.

The problem is that pig will concatenate the serialized tuples together,
and my decoder won't be able to decode the file properly.

Providing a PigSerializer alternative doesn't look like it will work since
it doesn't support limiting the number of tuples per file (I pass the input
using "input ('file')").

As far as I can tell this is a dead end.
Can anyone offer any suggestions or show otherwise?