Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Using hive for (advanced) access log analysis


Copy link to this message
-
Re: Using hive for (advanced) access log analysis
Does it really though? When dealing with compressed files, is all the
decompression done in-memory? Is that realistic for all sizes of files? If
not, then doesn't the file have to be decompressed first to disk?

On Fri, Aug 21, 2009 at 4:05 PM, Ashish Thusoo <[EMAIL PROTECTED]> wrote:

>  From our experience it seems that lzo has better space/compute tradeoffs
> than gzip. Cpmpression does usually help as it also reduces the amount of
> data to be read from the disk and thus gets rid of a major bottleneck.
>
> Ashish
>
>  ------------------------------
> *From:* Vijay [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, August 20, 2009 4:45 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Using hive for (advanced) access log analysis
>
> Hi, I'm quite new to hive and so far everything has been working very good.
> I'm able to setup a small vm-based cluster, ingest a lot of our access logs
> and generate some pretty neat reports mostly to do with patterns of urls,
> etc. I'm looking for some advice on some more advanced forms of analysis
> from people who might have already done similar analysis.
>
> 1. First off, many of our daily logs are about 1GB raw in size, around
> 120MB compressed (gzip). I'm keeping the compressed files in hive. For these
> kind of numbers, is that good or bad? Obviously for every query hive has to
> decompress every file so may be it's not a good idea? Of course, there is a
> space/speed trade off as well.
> 2. What are some ideas for doing session-based analysis? For example, most
> visited urls, average visit length, and other kinds of "analytics" stuff.
> Are there any useful recipes that people can share here?
>
> Thanks in advance,
> Vijay
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB