-Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?
bejoy.hadoop@... 2012-02-07, 15:38
Assume your external table location points to an hdfs dir say /ext_tables/table1/ you can write your custom code in flume that would ingest data into sub dir within the parent folder like /ext_tables/table1/2012-01-01/12 (currentDate/currentHour) . Configure collector to dump into hdfs on every hour with a max buffer matching your block size. So what would happen is, within an hour if the buffer gets filled the data would be dumped into hdfs that instant. Now whatever the buffer size is, at the end of every hour flume would dump data into hdfs. At every hour (give a delay of 5 min to be on the safe side ie at n hour 05 min) issue a ddl on hive as
add partition with location as /ext_tables/table1/currentDate/previousHour
Now the hour partition may contain one or more blocks/files based on your input data.
How would this approach fit your use case?
Bejoy K S
From handheld, Please excuse typos.
From: alo alt <[EMAIL PROTECTED]>
Date: Tue, 7 Feb 2012 15:27:18
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Cc: <[EMAIL PROTECTED]>; ä½˜æ™“å½¬<[EMAIL PROTECTED]>
Subject: Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?
You can use partitioned tables in hive to append in a new table without moving the data. For flume you can define small sinks, but you're right, the file in hdfs is closed and written when flume send the closing. Please note, the gzip codec has no marker inside so you have to wait till flume has closing the file in hdfs before you can process them. Snappy would fit, but I have no longtime tests within an productive environment.
For blocksizing you're right, but I think that you can move behind.
On Feb 7, 2012, at 3:09 PM, Xiaobin She wrote:
> hi Bejoy and Alex,
> thank you for your advice.
> Actually I have look at Scribe first, and I have heard of Flume.
> I look at flume's user guide just now, and flume seems promising, as Bejoy said , the flume collector can dump data into hdfs when the collector buffer reaches a particular size of after a particular time interval, this is good and I think it can solve the problem of data delivery latency.
> But what about compress?
> from the user's guide of flume, I see that flum supports compression of log files, but if flume did not wait until the collector has collect one hour of log and then compress it and send it to hdfs, then it will send part of the one hour log to hdfs, am I right?
> so if I want to use thest data in hive (assume I have an external table in hive), I have to specify at least two partiton key while creating table, one for day-month-hour, and one for some other time interval like ten miniutes, then I add hive partition to the existed external table with specified partition key.
> Is the above process right ?
> If this right, then there could be some other problem, like the ten miniute logs after compress is not big enough to fit the block size of hdfs which may couse lots of small files ( for some of our log id, this may come true), or if I set the time interval to be half an hour, then at the end of hour, it may still cause the data delivery latency problem.
> this seems not a very good solution, am I making some mistakes or misunderstanding here?
> thank you very much!
> 2012/2/7 alo alt <[EMAIL PROTECTED]>
> a first start with flume:
> Facebook's scribe could also be work for you.
> - Alex
> Alexander Lorenz
> On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:
> > Hi all,
> > Sorry if it is not appropriate to send one thread into two maillist.
> > **
> > I'm tring to use hadoop and hive to do some log analytic jobs.
> > Our system generate lots of logs every day, for example, it produce about