-Managing stdout in streaming
Keith Wiley 2011-02-01, 20:55
So streaming uses stdout to organize the mapper/reducer output, one record per line with each key/val split at the first TAB.
(Presumably multiple TABS are permitted and become embedded in the value string, I haven't experimented with this yet).
Obviously, one must be very careful not to write any debugging or logging output to stdout. It seems fairly straight-forward to simply use stderr instead, such that all associated output appears in the job tracker logs.
Buuuuut, what if I'm using a third-party library and I can't tell it to send output elsewhere? I know that it is possible to redirect stdout using tricks like freopen(), but I believe it can be quite tricky to redirect stdout back to its original stream. So if I directed stdout away from the original stream for processing, I'm not sure how I would latch it back onto the stream for the purpose of generating my mapper/reducer output data (in the Hadoop streaming TAB-delimited line-per-record format).
Any thoughts on this? The cluster is running Linux incidentally. I realize details like that become important when one starts fiddling with redirecting streams and such.
Keith Wiley [EMAIL PROTECTED] www.keithwiley.com
"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
-- Keith Wiley