Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Real-time log processing in Hadoop


+
Ricky Ho 2010-09-06, 05:02
+
Ranjib Dey 2010-09-06, 05:42
+
Bill Graham 2010-09-07, 05:50
Copy link to this message
-
Re: Real-time log processing in Hadoop
We've got a number of customers at Cloudera using Flume (
http://github.com/cloudera/flume) and HBase together to get low latency
aggregates in a reliable fashion. See
https://issues.cloudera.org/browse/FLUME-126 for an example of one approach
from a recent Cloudera Hackathon.

On Mon, Sep 6, 2010 at 10:50 PM, Bill Graham <[EMAIL PROTECTED]> wrote:

> We're using Chukwa to do steps a-d before writing summary data into MySQL.
> Data is written into new directories every 5 minutes. Our MR jobs and data
> load into MySQL takes < 5 minutes, so after a 5 minute window closes, we
> typically have summary data from that interval in MySQL about a few minutes
> later.
>
> But as Ranjib points out, how fast you can process your data depends on
> both
> cluster size and data rate.
>
> thanks,
> Bill
>
> On Sun, Sep 5, 2010 at 10:42 PM, Ranjib Dey <[EMAIL PROTECTED]
> >wrote:
>
> > we are  using hadoop for log crunching, and the mined data feeds on of
> our
> > app. its not exactly real time, the is basically a mail responder which
> > provides certain services given an e-mail (with prescribed format)
> against
> > it ([EMAIL PROTECTED]). We have been able to bring down the response time to
> 30
> > mins. This includes automated hadoop job submission -> processing the out
> > put , and intermediate status notification. From our experiences we have
> > learned the entire response time is dependent on your data size, your
> > hadoop
> > clusters strength etc. And you need to do the performance optimization at
> > each level (as they required), which includes jvm tuning (different
> tuning
> > in name nodes / data nodes) to app level code refactoring (like using har
> > on
> > hdfs  for smaller files , etc).
> >
> > regards
> > ranjib
> >
> > On Mon, Sep 6, 2010 at 10:32 AM, Ricky Ho <[EMAIL PROTECTED]>
> wrote:
> >
> > > Can anyone share their experience in doing real-time log processing
> using
> > > Chukwa/Scribe + Hadoop ?
> > >
> > > I am wondering how "real-time" can this be given Hadoop is designed for
> > > batch
> > > rather than stream processing ....
> > > 1) The startup / Teardown time of running Hadoop jobs typically takes
> > > minutes
> > > 2) Data is typically stored in HDFS which is large file, it takes some
> > time
> > > to
> > > accumulate data to that size.
> > >
> > > All these will add up to the latencies of Hadoop.  So I am wondering
> what
> > > is the
> > > shortest latencies are people doing log processing at real-life.
> > >
> > > To my understanding, the Chukwa/Scribe model accumulates log requests
> > (from
> > > many
> > > machines) and write them to HDFS (inside a directory).  After the
> logger
> > > switch
> > > to a new directory, the old one is ready for Map/Reduce processing, and
> > > then
> > > produce the result.
> > >
> > > So the latency is ...
> > > a) Accumulate enough data to fill an HDFS block size
> > > b) Write the block to HDFS
> > > c) Keep doing (b) until the criteria of switching to a new directory is
> > met
> > > d) Start the Map/Reduce processing in the old directory
> > > e) Write the processed data to the output directory
> > > f) Load the output to a queriable form.
> > >
> > > I think the above can easily be a 30 minutes or 1 hour duration.  Is
> this
> > > ball-part inline with the real-life projects that you have done ?
> > >
> > > Rgds,
> > > Ricky
> > >
> > >
> > >
> > >
> > >
> >
>
+
Steve Hoffman 2010-09-16, 20:56