Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> load union store


Copy link to this message
-
load union store
Hi all,

I'm writing a bit of code to grab some logfiles, parse them, and run some
sanity checks on them (before subjecting them to further analysis).
Naturally, logfiles being logfiles, they accumulate, and I was wondering
how efficiently pig would handle a request to add recently accumulated
log data to a bit of logfile that's already been started.

In particular, two approaches that I'm contemplating are

raw = LOAD 'logfile' ...
-- snipped parsing/cleaning steps producing a relation with alias "cleanfile"
oldclean = LOAD 'existing_log';
newclean = UNION oldclean, cleanfile;
STORE newclean INTO 'tmp_log';
rm existing_log;
mv tmp_log existing_log;

...ALTERNATELY...

raw = LOAD 'logfile' ...
-- snipped parsing/cleaning steps producing a relation with alias "cleanfile"
STORE cleanfile INTO 'tmp_log';

followed by renumbering all the part files in tmp_log and copying them
to existing_log.

Is pig clever enough to handle the first set of instructions reasonably
efficiently (and if not, are there any gotchas I'd have to watch out for
with the second approach, e.g. a catalogue file that'd have to be edited
when the new parts are added).

Thanks,
Kris

--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3