Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - Hadoop real time


Copy link to this message
-
Re: Hadoop real time
Ted Dunning 2011-09-04, 06:31
There are additional off-shoots of Hadoop that can specifically address
real-time needs such as Spark, S4 and Hstreaming.

Most real-time-ish applications come, however, with a 100% uptime guarantee.
 Most simply put, a system that is down and is going to take 10's to 100's
of minutes to come back is going to miss a lot of real-time windows.

As such, you may need to investigate derivatives of Hadoop that explicitly
support high availability.

On Sat, Sep 3, 2011 at 11:38 PM, Jacques <[EMAIL PROTECTED]> wrote:

> It is hard to reply to an article that you don't actually reference but
> I'll
> do my best.  Also, you don't define real-time so I'll consider it as being
> something that would come back within 1-2 seconds (e.g. an end user on a
> web
> site is waiting for the info).
>
> >>Can you please tell me why Hadoop is said not to be used for Real time
> processing of data?
>
> There are two different parts to the core  Hadoop project.  Both of these
> are focused more on batch processing by themselves as opposed to real time
> workflows.
> 1. HDFS, a distributed file system that is good at safely managing a large
> quantity of very large files.  Generally speaking, Hadoop is a write once
> file system.  You can't modify the middle of a file after it is written.
>  You also can't append to the end of a file without a special version of
> Hadoop.  Also, you can't tail a file directly as it is being written.  As
> such, it would be hard to use it directly to create a real-time work flow.
>
> 2. MapReduce is a distributed computing framework.  It is used to process
> those large files held on HDFS.  Because of the design of MapReduce, jobs
> usually take at least 10 seconds and typically much longer. This would also
> mean you're looking at batch processing large quantities of data in some
> non-real-time period.
>
> HBase, is a separate, sub-project from the Hadoop project proper.  It is
> built specifically to handle real time loads.  You can insert a row and get
> it back immediately.
>
>  >I was thinking we can replace the DB with Hadoop...I do not  see any
> issue?
>
> HBase can replace many of the functions of existing databases but should be
> used primarily when you need the massive scale it can provide.  You have to
> give up things like transactions and SQL to HBase when compared to
> traditional RDBMS's (Mysql, PostreSQL, etc).  The schema design is very
> different and generally your application must be built with this in mind.
>  You should probably spend some time with the HBase book (
> http://hbase.apache.org/book.html) and looking at your current
> applications
> to determine what kinds of things you would need to do.  Many people
> actually use HBase in parallel with a traditional RDBMS, leveraging the
> strengths of each.
>
> Good luck!
>