I've setup a virtual machine hosting Hive.
My use case is a Web traffic analytics, hence most of requests are :
- how many requests today ?
- how many request today, grouped by country ?
- most requested urls ?
- average http server response time (5 minutes slots) ?
In other words, lets consider :
CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
SELECT COUNT(*) FROM logs;
SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
SELECT COUNT(*),url FROM logs BROUP BY url;
SELECT AVG(http_rt) FROM logs ...
2 questions here :
- How to generate 5 minutes slots to make my averages (in Postgresql, I
used to generate_series() and JOIN) ? I wish I could avoid doing multiple
requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
mapping the date string to a aslot number ?
- What is the best storage method pour this table ? Since it's purpose is
analytical, I thought columnar format was the way to go. So I tried RCFILE
buy the results are as follow for around 1 million rows (quite small, I
know) and are quite the opposite I was expecting :
Storage / query duration / disk table size
TEXTFILE / 22 seconds / 250MB
RCFILE / 31 seconds / 320 MB
I thought getting values in columns would speed up the aggregate process.
Maybe the dataset is too small to tell, or I missed something ? Will adding
Snappy compression help (not sure whether RCFiles are compressed or not) ?
Thank you !