Does it really though? When dealing with compressed files, is all the
decompression done in-memory? Is that realistic for all sizes of files? If
not, then doesn't the file have to be decompressed first to disk?
On Fri, Aug 21, 2009 at 4:05 PM, Ashish Thusoo <[EMAIL PROTECTED]> wrote:
> From our experience it seems that lzo has better space/compute tradeoffs
> than gzip. Cpmpression does usually help as it also reduces the amount of
> data to be read from the disk and thus gets rid of a major bottleneck.
> *From:* Vijay [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, August 20, 2009 4:45 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Using hive for (advanced) access log analysis
> Hi, I'm quite new to hive and so far everything has been working very good.
> I'm able to setup a small vm-based cluster, ingest a lot of our access logs
> and generate some pretty neat reports mostly to do with patterns of urls,
> etc. I'm looking for some advice on some more advanced forms of analysis
> from people who might have already done similar analysis.
> 1. First off, many of our daily logs are about 1GB raw in size, around
> 120MB compressed (gzip). I'm keeping the compressed files in hive. For these
> kind of numbers, is that good or bad? Obviously for every query hive has to
> decompress every file so may be it's not a good idea? Of course, there is a
> space/speed trade off as well.
> 2. What are some ideas for doing session-based analysis? For example, most
> visited urls, average visit length, and other kinds of "analytics" stuff.
> Are there any useful recipes that people can share here?
> Thanks in advance,