Hi, I'm quite new to hive and so far everything has been working very good.
I'm able to setup a small vm-based cluster, ingest a lot of our access logs
and generate some pretty neat reports mostly to do with patterns of urls,
etc. I'm looking for some advice on some more advanced forms of analysis
from people who might have already done similar analysis.
1. First off, many of our daily logs are about 1GB raw in size, around 120MB
compressed (gzip). I'm keeping the compressed files in hive. For these kind
of numbers, is that good or bad? Obviously for every query hive has to
decompress every file so may be it's not a good idea? Of course, there is a
space/speed trade off as well.
2. What are some ideas for doing session-based analysis? For example, most
visited urls, average visit length, and other kinds of "analytics" stuff.
Are there any useful recipes that people can share here?
Thanks in advance,