Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - load union store


Copy link to this message
-
Re: load union store
Kris Coward 2011-01-28, 17:11

I want to flatten things at least a little, since I'm looking for
year-long trends in logfiles that are rotated hourly (and loading the
data back out of 8760 distinct directories isn't my idea of a good
time).

Any reason that moving/renaming the part-nnnn files wouldn't work?

Thanks,
Kris

On Thu, Jan 27, 2011 at 05:57:32PM -0800, Dmitriy Ryaboy wrote:
> Kris,
> As logs accumulate over time the union will get slow since you have to read
> all the data off disk and write it back to disk.
>
> Why not just have a hierarchy in your cleaned log directory? You can do
> something like
> define newdir `date +%s`
>
> store newclean into 'cleaned_files/$newdir/'
>
>
> then to load all logs you can just load 'cleaned_files'
>
> you can also format the date output differently and wind up with your
> cleaned files nicely organized by year/month/day/hour/ ...
>
> D
>
> On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <[EMAIL PROTECTED]> wrote:
>
> > Hi all,
> >
> > I'm writing a bit of code to grab some logfiles, parse them, and run some
> > sanity checks on them (before subjecting them to further analysis).
> > Naturally, logfiles being logfiles, they accumulate, and I was wondering
> > how efficiently pig would handle a request to add recently accumulated
> > log data to a bit of logfile that's already been started.
> >
> > In particular, two approaches that I'm contemplating are
> >
> > raw = LOAD 'logfile' ...
> > -- snipped parsing/cleaning steps producing a relation with alias
> > "cleanfile"
> > oldclean = LOAD 'existing_log';
> > newclean = UNION oldclean, cleanfile;
> > STORE newclean INTO 'tmp_log';
> > rm existing_log;
> > mv tmp_log existing_log;
> >
> > ...ALTERNATELY...
> >
> > raw = LOAD 'logfile' ...
> > -- snipped parsing/cleaning steps producing a relation with alias
> > "cleanfile"
> > STORE cleanfile INTO 'tmp_log';
> >
> > followed by renumbering all the part files in tmp_log and copying them
> > to existing_log.
> >
> > Is pig clever enough to handle the first set of instructions reasonably
> > efficiently (and if not, are there any gotchas I'd have to watch out for
> > with the second approach, e.g. a catalogue file that'd have to be edited
> > when the new parts are added).
> >
> > Thanks,
> > Kris
> >
> > --
> > Kris Coward                                     http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3