-Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?
bejoy.hadoop@... 2012-02-07, 10:29
If you are looking at a solution to ingest data into hdfs in real time, ie as soon as it is generated, you need to look into Cloudera Flume. It makes realtime data ingestion on to hdfs possible. You can configure the flume collector to dump data into hdfs when the collector buffer reaches a particular size of after a particular time interval.
Currently your log collector would give you a file for each log Id in an hour. You may need to think of a design like replace the log collector with flume. You can make flume ingest data into hour sub dir in hdfs . Once it is done what left would be
- do ddls in hive to add partition for those tables
- trigger your hive jobs for previous hour.
Bejoy K S
From handheld, Please excuse typos.
From: Xiaobin She <[EMAIL PROTECTED]>
Date: Tue, 7 Feb 2012 18:03:57
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: What's the best practice of loading logs into hdfs while using hive
to do log analytic?
Sorry if it is not appropriate to send one thread into two maillist.
I'm tring to use hadoop and hive to do some log analytic jobs.
Our system generate lots of logs every day, for example, it produce about
370GB logs(including lots of log files) yesterday, and every day the logs
And we want to use hadoop and hive to replace our old log analysic system.
We distinguish our logs with logid, we have an log collector which will
collect logs from clients and then generate log files.
for every logid, there will be one log file every hour, for some logid,
this hourly log file can be 1~2GB
I have set up an test cluster with hadoop and hive, and I have run some
test which seems good for us.
For reference, we will create one table in hive for every logid which will
be partitoned by hour.
Now I have a question, what's the best practice for loading logs files into
hdfs or hive warehouse dir ?
My first thought is, at the begining of every hour, compress the log file
of the last hour of every logid and then use the hive cmd tool to load
these compressed log files into hdfs.
using commands like "LOAD DATA LOCAL inpath '$logname' OVERWRITE INTO
TABLE $tablename PARTITION (dt='$h') "
I think this can work, and I have run some test on our 3-nodes test
But the problem is, there are lots of logid which means there are lots of
log files, so every hour we will have to load lots of files into hdfs
and there is another problem, we will run hourly analysis job on these
hourly collected log files,
which inroduces the problem, because there are lots of log files, if we
load these log files at the same time at the begining of every hour, I
think there will some network flows and there will be data delivery
For data delivery latency problem, I mean it will take some time for the
log files to be copyed into hdfs, and this will cause our hourly log
analysis job to start later.
So I want to figure out if we can write or append logs into an compressed
file which is already located in hdfs, and I have posted an thread in the
mailist, and from what I have learned, this is not possible.
So, what's the best practice of loading logs into hdfs while using hive to
do log analytic?
Or what's the common methods to handle problem I have describe above?
Can anyone give me some advices?
Thank you very much for your help!