Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> load union store


Copy link to this message
-
load union store
Hi all,

I'm writing a bit of code to grab some logfiles, parse them, and run some
sanity checks on them (before subjecting them to further analysis).
Naturally, logfiles being logfiles, they accumulate, and I was wondering
how efficiently pig would handle a request to add recently accumulated
log data to a bit of logfile that's already been started.

In particular, two approaches that I'm contemplating are

raw = LOAD 'logfile' ...
-- snipped parsing/cleaning steps producing a relation with alias "cleanfile"
oldclean = LOAD 'existing_log';
newclean = UNION oldclean, cleanfile;
STORE newclean INTO 'tmp_log';
rm existing_log;
mv tmp_log existing_log;

...ALTERNATELY...

raw = LOAD 'logfile' ...
-- snipped parsing/cleaning steps producing a relation with alias "cleanfile"
STORE cleanfile INTO 'tmp_log';

followed by renumbering all the part files in tmp_log and copying them
to existing_log.

Is pig clever enough to handle the first set of instructions reasonably
efficiently (and if not, are there any gotchas I'd have to watch out for
with the second approach, e.g. a catalogue file that'd have to be edited
when the new parts are added).

Thanks,
Kris

--
Kris Coward http://unripe.melon.org/
GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB