Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Best table storage for analytical use case


Copy link to this message
-
Best table storage for analytical use case
Hi there,

I've setup a virtual machine hosting Hive.
My use case is a Web traffic analytics, hence most of requests are :

- how many requests today ?
- how many request today, grouped by country ?
- most requested urls ?
- average http server response time (5 minutes slots) ?

In other words, lets consider :
CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
and

SELECT COUNT(*) FROM logs;
SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
SELECT COUNT(*),url FROM logs BROUP BY url;
SELECT AVG(http_rt) FROM logs ...

2 questions here :
- How to generate 5 minutes slots to make my averages (in Postgresql, I
used to generate_series() and JOIN) ? I wish I could avoid doing multiple
requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
mapping the date string to a aslot number ?

- What is the best storage method pour this table ? Since it's purpose is
analytical, I thought columnar format was the way to go. So I tried RCFILE
buy the results are as follow for around 1 million rows (quite small, I
know) and are quite the opposite I was expecting :

Storage / query duration / disk table size
TEXTFILE / 22 seconds / 250MB
RCFILE / 31 seconds / 320 MB

 I thought getting values in columns would speed up the aggregate process.
Maybe the dataset is too small to tell, or I missed something ? Will adding
Snappy compression help (not sure whether RCFiles are compressed or not) ?

Thank you !
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB