Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> load union store


Copy link to this message
-
Re: load union store

I missed the globbing on my previous passes over the documentation for
LOAD. Having missed that, my objection would have been that with all the
files in a single directory, I can get them with a single LOAD command.
That said, a wildcard would also solve that. Thanks for pushing back hard
enough to make me re-read that.

Cheers,
Kris

On Fri, Jan 28, 2011 at 01:27:54PM -0800, Dmitriy Ryaboy wrote:
> It's a pain to rename everything, especially since the number of renames
> grows every day. You'll stress out the namenode at some point.
>
> I am not sure why loading data back out of 8760 distinct directories is
> worse than 8760 distinct files. There is no real difference.
>
> That's what we do at Twitter, fwiw, and that's also what the standard setup
> for Hive logs is.. Can you explain in greater detail what your objection is
> if this doesn't work for you?
>
> D
>
>
> On Fri, Jan 28, 2011 at 9:11 AM, Kris Coward <[EMAIL PROTECTED]> wrote:
>
> >
> > I want to flatten things at least a little, since I'm looking for
> > year-long trends in logfiles that are rotated hourly (and loading the
> > data back out of 8760 distinct directories isn't my idea of a good
> > time).
> >
> > Any reason that moving/renaming the part-nnnn files wouldn't work?
> >
> > Thanks,
> > Kris
> >
> > On Thu, Jan 27, 2011 at 05:57:32PM -0800, Dmitriy Ryaboy wrote:
> > > Kris,
> > > As logs accumulate over time the union will get slow since you have to
> > read
> > > all the data off disk and write it back to disk.
> > >
> > > Why not just have a hierarchy in your cleaned log directory? You can do
> > > something like
> > > define newdir `date +%s`
> > >
> > > store newclean into 'cleaned_files/$newdir/'
> > >
> > >
> > > then to load all logs you can just load 'cleaned_files'
> > >
> > > you can also format the date output differently and wind up with your
> > > cleaned files nicely organized by year/month/day/hour/ ...
> > >
> > > D
> > >
> > > On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'm writing a bit of code to grab some logfiles, parse them, and run
> > some
> > > > sanity checks on them (before subjecting them to further analysis).
> > > > Naturally, logfiles being logfiles, they accumulate, and I was
> > wondering
> > > > how efficiently pig would handle a request to add recently accumulated
> > > > log data to a bit of logfile that's already been started.
> > > >
> > > > In particular, two approaches that I'm contemplating are
> > > >
> > > > raw = LOAD 'logfile' ...
> > > > -- snipped parsing/cleaning steps producing a relation with alias
> > > > "cleanfile"
> > > > oldclean = LOAD 'existing_log';
> > > > newclean = UNION oldclean, cleanfile;
> > > > STORE newclean INTO 'tmp_log';
> > > > rm existing_log;
> > > > mv tmp_log existing_log;
> > > >
> > > > ...ALTERNATELY...
> > > >
> > > > raw = LOAD 'logfile' ...
> > > > -- snipped parsing/cleaning steps producing a relation with alias
> > > > "cleanfile"
> > > > STORE cleanfile INTO 'tmp_log';
> > > >
> > > > followed by renumbering all the part files in tmp_log and copying them
> > > > to existing_log.
> > > >
> > > > Is pig clever enough to handle the first set of instructions reasonably
> > > > efficiently (and if not, are there any gotchas I'd have to watch out
> > for
> > > > with the second approach, e.g. a catalogue file that'd have to be
> > edited
> > > > when the new parts are added).
> > > >
> > > > Thanks,
> > > > Kris
> > > >
> > > > --
> > > > Kris Coward
> > http://unripe.melon.org/
> > > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3