|
|
+
Ricky Ho 2010-09-06, 05:02
+
Ranjib Dey 2010-09-06, 05:42
+
Bill Graham 2010-09-07, 05:50
-
Re: Real-time log processing in HadoopJeff Hammerbacher 2010-09-09, 06:27
We've got a number of customers at Cloudera using Flume (
http://github.com/cloudera/flume) and HBase together to get low latency aggregates in a reliable fashion. See https://issues.cloudera.org/browse/FLUME-126 for an example of one approach from a recent Cloudera Hackathon. On Mon, Sep 6, 2010 at 10:50 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > We're using Chukwa to do steps a-d before writing summary data into MySQL. > Data is written into new directories every 5 minutes. Our MR jobs and data > load into MySQL takes < 5 minutes, so after a 5 minute window closes, we > typically have summary data from that interval in MySQL about a few minutes > later. > > But as Ranjib points out, how fast you can process your data depends on > both > cluster size and data rate. > > thanks, > Bill > > On Sun, Sep 5, 2010 at 10:42 PM, Ranjib Dey <[EMAIL PROTECTED] > >wrote: > > > we are using hadoop for log crunching, and the mined data feeds on of > our > > app. its not exactly real time, the is basically a mail responder which > > provides certain services given an e-mail (with prescribed format) > against > > it ([EMAIL PROTECTED]). We have been able to bring down the response time to > 30 > > mins. This includes automated hadoop job submission -> processing the out > > put , and intermediate status notification. From our experiences we have > > learned the entire response time is dependent on your data size, your > > hadoop > > clusters strength etc. And you need to do the performance optimization at > > each level (as they required), which includes jvm tuning (different > tuning > > in name nodes / data nodes) to app level code refactoring (like using har > > on > > hdfs for smaller files , etc). > > > > regards > > ranjib > > > > On Mon, Sep 6, 2010 at 10:32 AM, Ricky Ho <[EMAIL PROTECTED]> > wrote: > > > > > Can anyone share their experience in doing real-time log processing > using > > > Chukwa/Scribe + Hadoop ? > > > > > > I am wondering how "real-time" can this be given Hadoop is designed for > > > batch > > > rather than stream processing .... > > > 1) The startup / Teardown time of running Hadoop jobs typically takes > > > minutes > > > 2) Data is typically stored in HDFS which is large file, it takes some > > time > > > to > > > accumulate data to that size. > > > > > > All these will add up to the latencies of Hadoop. So I am wondering > what > > > is the > > > shortest latencies are people doing log processing at real-life. > > > > > > To my understanding, the Chukwa/Scribe model accumulates log requests > > (from > > > many > > > machines) and write them to HDFS (inside a directory). After the > logger > > > switch > > > to a new directory, the old one is ready for Map/Reduce processing, and > > > then > > > produce the result. > > > > > > So the latency is ... > > > a) Accumulate enough data to fill an HDFS block size > > > b) Write the block to HDFS > > > c) Keep doing (b) until the criteria of switching to a new directory is > > met > > > d) Start the Map/Reduce processing in the old directory > > > e) Write the processed data to the output directory > > > f) Load the output to a queriable form. > > > > > > I think the above can easily be a 30 minutes or 1 hour duration. Is > this > > > ball-part inline with the real-life projects that you have done ? > > > > > > Rgds, > > > Ricky > > > > > > > > > > > > > > > > > > +
Steve Hoffman 2010-09-16, 20:56
|