-Re: Hadoop real time
Jacques 2011-09-04, 04:38
It is hard to reply to an article that you don't actually reference but I'll
do my best. Also, you don't define real-time so I'll consider it as being
something that would come back within 1-2 seconds (e.g. an end user on a web
site is waiting for the info).
>>Can you please tell me why Hadoop is said not to be used for Real time
processing of data?
There are two different parts to the core Hadoop project. Both of these
are focused more on batch processing by themselves as opposed to real time
1. HDFS, a distributed file system that is good at safely managing a large
quantity of very large files. Generally speaking, Hadoop is a write once
file system. You can't modify the middle of a file after it is written.
You also can't append to the end of a file without a special version of
Hadoop. Also, you can't tail a file directly as it is being written. As
such, it would be hard to use it directly to create a real-time work flow.
2. MapReduce is a distributed computing framework. It is used to process
those large files held on HDFS. Because of the design of MapReduce, jobs
usually take at least 10 seconds and typically much longer. This would also
mean you're looking at batch processing large quantities of data in some
HBase, is a separate, sub-project from the Hadoop project proper. It is
built specifically to handle real time loads. You can insert a row and get
it back immediately.
>I was thinking we can replace the DB with Hadoop...I do not see any
HBase can replace many of the functions of existing databases but should be
used primarily when you need the massive scale it can provide. You have to
give up things like transactions and SQL to HBase when compared to
traditional RDBMS's (Mysql, PostreSQL, etc). The schema design is very
different and generally your application must be built with this in mind.
You should probably spend some time with the HBase book (
http://hbase.apache.org/book.html) and looking at your current applications
to determine what kinds of things you would need to do. Many people
actually use HBase in parallel with a traditional RDBMS, leveraging the
strengths of each.