RCFile won't help much (and apparently not all in this case ;) unless you
have a lot of columns and you always query just a few of them. However, you
should get better results with Sequence Files (binary format) and usually
with a compression scheme like BZip that supports block-level (as opposed
to file-level) compression. Why? compressed files and also using sequence
files reduces the amount of disk IO and hence improves IO performance (a
Do you almost always query with a WHERE clause with a time range? If so,
consider partitioning your data by time ranges, e.g., year/month/day. Your
actual timestamp granularity would be chosen so that each folder (and yes,
they'll be individual folders) has data files at least 64MB or whatever
multiple of 64MB your using in your cluster. It could be that per-day is
the finest granularity or even per hour or minute, if you really have a lot
of data. Briefly, you want to minimize the number of mapper processes used
to process the data, and this is the granularity per mapper. Why partition,
because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3
AND day = 4, Hive knows it only has to read the contents of that single
directory, not all the directories...
You might also consider clustering by URL. This feature (and the others) is
described on the Hive wiki. It can also speed up sampling of large data
sets and joins.
I assume you're just using the virtual machine for experimenting. Lots of
overhead there, too!
Hope this helps.
On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly <[EMAIL PROTECTED]>wrote:
> Hi there,
> I've setup a virtual machine hosting Hive.
> My use case is a Web traffic analytics, hence most of requests are :
> - how many requests today ?
> - how many request today, grouped by country ?
> - most requested urls ?
> - average http server response time (5 minutes slots) ?
> In other words, lets consider :
> CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
> SELECT COUNT(*) FROM logs;
> SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
> SELECT COUNT(*),url FROM logs BROUP BY url;
> SELECT AVG(http_rt) FROM logs ...
> 2 questions here :
> - How to generate 5 minutes slots to make my averages (in Postgresql, I
> used to generate_series() and JOIN) ? I wish I could avoid doing multiple
> requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
> mapping the date string to a aslot number ?
> - What is the best storage method pour this table ? Since it's purpose is
> analytical, I thought columnar format was the way to go. So I tried RCFILE
> buy the results are as follow for around 1 million rows (quite small, I
> know) and are quite the opposite I was expecting :
> Storage / query duration / disk table size
> TEXTFILE / 22 seconds / 250MB
> RCFILE / 31 seconds / 320 MB
> I thought getting values in columns would speed up the aggregate process.
> Maybe the dataset is too small to tell, or I missed something ? Will adding
> Snappy compression help (not sure whether RCFiles are compressed or not) ?
> Thank you !
*Dean Wampler, Ph.D.*