|
|
-
Using hive for (advanced) access log analysis
Vijay 2009-08-20, 23:44
Hi, I'm quite new to hive and so far everything has been working very good. I'm able to setup a small vm-based cluster, ingest a lot of our access logs and generate some pretty neat reports mostly to do with patterns of urls, etc. I'm looking for some advice on some more advanced forms of analysis from people who might have already done similar analysis.
1. First off, many of our daily logs are about 1GB raw in size, around 120MB compressed (gzip). I'm keeping the compressed files in hive. For these kind of numbers, is that good or bad? Obviously for every query hive has to decompress every file so may be it's not a good idea? Of course, there is a space/speed trade off as well. 2. What are some ideas for doing session-based analysis? For example, most visited urls, average visit length, and other kinds of "analytics" stuff. Are there any useful recipes that people can share here?
Thanks in advance, Vijay
-
Re: Using hive for (advanced) access log analysis
Vijay 2009-08-23, 01:25
Does it really though? When dealing with compressed files, is all the decompression done in-memory? Is that realistic for all sizes of files? If not, then doesn't the file have to be decompressed first to disk?
On Fri, Aug 21, 2009 at 4:05 PM, Ashish Thusoo <[EMAIL PROTECTED]> wrote:
> From our experience it seems that lzo has better space/compute tradeoffs > than gzip. Cpmpression does usually help as it also reduces the amount of > data to be read from the disk and thus gets rid of a major bottleneck. > > Ashish > > ------------------------------ > *From:* Vijay [mailto:[EMAIL PROTECTED]] > *Sent:* Thursday, August 20, 2009 4:45 PM > *To:* [EMAIL PROTECTED] > *Subject:* Using hive for (advanced) access log analysis > > Hi, I'm quite new to hive and so far everything has been working very good. > I'm able to setup a small vm-based cluster, ingest a lot of our access logs > and generate some pretty neat reports mostly to do with patterns of urls, > etc. I'm looking for some advice on some more advanced forms of analysis > from people who might have already done similar analysis. > > 1. First off, many of our daily logs are about 1GB raw in size, around > 120MB compressed (gzip). I'm keeping the compressed files in hive. For these > kind of numbers, is that good or bad? Obviously for every query hive has to > decompress every file so may be it's not a good idea? Of course, there is a > space/speed trade off as well. > 2. What are some ideas for doing session-based analysis? For example, most > visited urls, average visit length, and other kinds of "analytics" stuff. > Are there any useful recipes that people can share here? > > Thanks in advance, > Vijay >
-
Re: Using hive for (advanced) access log analysis
Edward Capriolo 2009-08-23, 02:50
On Sat, Aug 22, 2009 at 9:25 PM, Vijay<[EMAIL PROTECTED]> wrote: > Does it really though? When dealing with compressed files, is all the > decompression done in-memory? Is that realistic for all sizes of files? If > not, then doesn't the file have to be decompressed first to disk? > > On Fri, Aug 21, 2009 at 4:05 PM, Ashish Thusoo <[EMAIL PROTECTED]> wrote: >> >> From our experience it seems that lzo has better space/compute tradeoffs >> than gzip. Cpmpression does usually help as it also reduces the amount of >> data to be read from the disk and thus gets rid of a major bottleneck. >> >> Ashish >> ________________________________ >> From: Vijay [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, August 20, 2009 4:45 PM >> To: [EMAIL PROTECTED] >> Subject: Using hive for (advanced) access log analysis >> >> Hi, I'm quite new to hive and so far everything has been working very >> good. I'm able to setup a small vm-based cluster, ingest a lot of our access >> logs and generate some pretty neat reports mostly to do with patterns of >> urls, etc. I'm looking for some advice on some more advanced forms of >> analysis from people who might have already done similar analysis. >> >> 1. First off, many of our daily logs are about 1GB raw in size, around >> 120MB compressed (gzip). I'm keeping the compressed files in hive. For these >> kind of numbers, is that good or bad? Obviously for every query hive has to >> decompress every file so may be it's not a good idea? Of course, there is a >> space/speed trade off as well. >> 2. What are some ideas for doing session-based analysis? For example, most >> visited urls, average visit length, and other kinds of "analytics" stuff. >> Are there any useful recipes that people can share here? >> >> Thanks in advance, >> Vijay > >
The answer to this is that some compression formats compress and split into blocks. LZO is one of those formats. As a result the decompression is done on the fly. You do NOT need to decompress the entire file to process it.
|
|