Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Real-time log processing in Hadoop


Copy link to this message
-
Re: Real-time log processing in Hadoop
We've got a number of customers at Cloudera using Flume (
http://github.com/cloudera/flume) and HBase together to get low latency
aggregates in a reliable fashion. See
https://issues.cloudera.org/browse/FLUME-126 for an example of one approach
from a recent Cloudera Hackathon.

On Mon, Sep 6, 2010 at 10:50 PM, Bill Graham <[EMAIL PROTECTED]> wrote:

> We're using Chukwa to do steps a-d before writing summary data into MySQL.
> Data is written into new directories every 5 minutes. Our MR jobs and data
> load into MySQL takes < 5 minutes, so after a 5 minute window closes, we
> typically have summary data from that interval in MySQL about a few minutes
> later.
>
> But as Ranjib points out, how fast you can process your data depends on
> both
> cluster size and data rate.
>
> thanks,
> Bill
>
> On Sun, Sep 5, 2010 at 10:42 PM, Ranjib Dey <[EMAIL PROTECTED]
> >wrote:
>
> > we are  using hadoop for log crunching, and the mined data feeds on of
> our
> > app. its not exactly real time, the is basically a mail responder which
> > provides certain services given an e-mail (with prescribed format)
> against
> > it ([EMAIL PROTECTED]). We have been able to bring down the response time to
> 30
> > mins. This includes automated hadoop job submission -> processing the out
> > put , and intermediate status notification. From our experiences we have
> > learned the entire response time is dependent on your data size, your
> > hadoop
> > clusters strength etc. And you need to do the performance optimization at
> > each level (as they required), which includes jvm tuning (different
> tuning
> > in name nodes / data nodes) to app level code refactoring (like using har
> > on
> > hdfs  for smaller files , etc).
> >
> > regards
> > ranjib
> >
> > On Mon, Sep 6, 2010 at 10:32 AM, Ricky Ho <[EMAIL PROTECTED]>
> wrote:
> >
> > > Can anyone share their experience in doing real-time log processing
> using
> > > Chukwa/Scribe + Hadoop ?
> > >
> > > I am wondering how "real-time" can this be given Hadoop is designed for
> > > batch
> > > rather than stream processing ....
> > > 1) The startup / Teardown time of running Hadoop jobs typically takes
> > > minutes
> > > 2) Data is typically stored in HDFS which is large file, it takes some
> > time
> > > to
> > > accumulate data to that size.
> > >
> > > All these will add up to the latencies of Hadoop.  So I am wondering
> what
> > > is the
> > > shortest latencies are people doing log processing at real-life.
> > >
> > > To my understanding, the Chukwa/Scribe model accumulates log requests
> > (from
> > > many
> > > machines) and write them to HDFS (inside a directory).  After the
> logger
> > > switch
> > > to a new directory, the old one is ready for Map/Reduce processing, and
> > > then
> > > produce the result.
> > >
> > > So the latency is ...
> > > a) Accumulate enough data to fill an HDFS block size
> > > b) Write the block to HDFS
> > > c) Keep doing (b) until the criteria of switching to a new directory is
> > met
> > > d) Start the Map/Reduce processing in the old directory
> > > e) Write the processed data to the output directory
> > > f) Load the output to a queriable form.
> > >
> > > I think the above can easily be a 30 minutes or 1 hour duration.  Is
> this
> > > ball-part inline with the real-life projects that you have done ?
> > >
> > > Rgds,
> > > Ricky
> > >
> > >
> > >
> > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB