Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - external table on flume log files in S3


Copy link to this message
-
external table on flume log files in S3
Søren 2012-04-24, 14:20
Hi Hive community

We are collecting huge amounts of data into Amazon S3 using Flume.

In Elastic Mapreduce, we have so far managed to create an external Hive
table on JSON formatted gzipped log files in S3 using a customized
serde. The log files are collected  and stored in one single folder with
file names following this pattern:
usr-20120423-012725137+0000.2392780833002846.00000029.gz
usr-20120423-012928765+0000.2392904461259123.00000029.gz
usr-20120423-013032368+0000.2392968063991639.00000029.gz

There are thousands to millions of these files. Is there a way to make
HIVE benefit from the datetime stamp in the filenames? For example to
make  queries on smaller subsets. Or filtering when creating the
external table.

If using the INPUT__FILE__NAME, the job gets done but there is no
significant performance gain. I guess, due the the evaluation order of
the SQL statement. I.e. processing the entire repository takes the same
time as only one day's logs. Same large number of total open-file jobs.

SELECT *
FROM mytable
WHERE INPUT__FILE__NAME LIKE 's3://myflume-logs/usr-20120423%';

Best practise knowledge from others who have been down this road is very
welcomed.

thanks in advance
Soren

+
Bejoy KS 2012-04-24, 14:30