Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Hadoop real time


Copy link to this message
-
Re: Hadoop real time
There are additional off-shoots of Hadoop that can specifically address
real-time needs such as Spark, S4 and Hstreaming.

Most real-time-ish applications come, however, with a 100% uptime guarantee.
 Most simply put, a system that is down and is going to take 10's to 100's
of minutes to come back is going to miss a lot of real-time windows.

As such, you may need to investigate derivatives of Hadoop that explicitly
support high availability.

On Sat, Sep 3, 2011 at 11:38 PM, Jacques <[EMAIL PROTECTED]> wrote:

> It is hard to reply to an article that you don't actually reference but
> I'll
> do my best.  Also, you don't define real-time so I'll consider it as being
> something that would come back within 1-2 seconds (e.g. an end user on a
> web
> site is waiting for the info).
>
> >>Can you please tell me why Hadoop is said not to be used for Real time
> processing of data?
>
> There are two different parts to the core  Hadoop project.  Both of these
> are focused more on batch processing by themselves as opposed to real time
> workflows.
> 1. HDFS, a distributed file system that is good at safely managing a large
> quantity of very large files.  Generally speaking, Hadoop is a write once
> file system.  You can't modify the middle of a file after it is written.
>  You also can't append to the end of a file without a special version of
> Hadoop.  Also, you can't tail a file directly as it is being written.  As
> such, it would be hard to use it directly to create a real-time work flow.
>
> 2. MapReduce is a distributed computing framework.  It is used to process
> those large files held on HDFS.  Because of the design of MapReduce, jobs
> usually take at least 10 seconds and typically much longer. This would also
> mean you're looking at batch processing large quantities of data in some
> non-real-time period.
>
> HBase, is a separate, sub-project from the Hadoop project proper.  It is
> built specifically to handle real time loads.  You can insert a row and get
> it back immediately.
>
>  >I was thinking we can replace the DB with Hadoop...I do not  see any
> issue?
>
> HBase can replace many of the functions of existing databases but should be
> used primarily when you need the massive scale it can provide.  You have to
> give up things like transactions and SQL to HBase when compared to
> traditional RDBMS's (Mysql, PostreSQL, etc).  The schema design is very
> different and generally your application must be built with this in mind.
>  You should probably spend some time with the HBase book (
> http://hbase.apache.org/book.html) and looking at your current
> applications
> to determine what kinds of things you would need to do.  Many people
> actually use HBase in parallel with a traditional RDBMS, leveraging the
> strengths of each.
>
> Good luck!
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB