Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Questions about data distribution in HBase


+
William Kang 2010-03-27, 17:06
+
Tim Robertson 2010-03-27, 17:54
+
William Kang 2010-03-27, 18:51
+
Dan Washusen 2010-03-27, 22:38
+
William Kang 2010-03-28, 02:42
+
Tim Robertson 2010-03-28, 07:13
Copy link to this message
-
Re: Questions about data distribution in HBase
Hi William,

I think you are slightly confused about the usage and intention of HBase.
Let me first say that HBase is a *storage* system designed for low latency,
random access retrieval - built on top of HDFS for high availability. That
is, it's a storage system, not a processing system. It solves the "large
file problems" of HDFS wherein access to arbitrary slices of a file require
scanning through every segment preceding that segment, giving random access
by record key.

For further details of HBase, I'll +1 the suggestion of reviewing the post
by Lars George.

You haven't contributed any details re: what kind of "processing" you wish
to accomplish over this video data. Based on your focus on low latency, I
will assume the m/r batch processing suggested earlier is not acceptable and
you require some kind of low-latency, immediate response solution. If this
is indeed the case, I suggest you look at using Katta (
http://katta.sourceforge.net/) for your low-latency processing. It says
"Distributed Lucene" but they actually mean "Distributed, Low-Latency
Aggregates." Perhaps an acceptable solution for you is random-access storage
of your video data in HBase combined with a custom Katta server for
processing of low-latency requests.

Any further details you can provide about your project will aid in the
direction and advice The List can provide.

Cheers,
-Nick

On Sat, Mar 27, 2010 at 7:42 PM, William Kang <[EMAIL PROTECTED]>wrote:

> Hi Dan,
> Thanks for your reply.
> But I still have some questions about your answers:
> 1. What's the differences makes using the HMaster or any other machine
> since
> you mention "If you run the program from a single machine (don't use the
> HMaster)
> then yes, it would have to transfer the data to that machine using the
> network." Is there a way to run the program in multiple machines without
> using M/R?
> 2. Still, what about the latency if we use M/R in HBase?
> Thanks.
>
>
> Willliam
>
> On Sat, Mar 27, 2010 at 6:38 PM, Dan Washusen <[EMAIL PROTECTED]> wrote:
>
> > Hi William,
> > I've put a few comments inline...
> >
> > On 28 March 2010 04:06, William Kang <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > > I am quite confused about the distributions of data in a HBase system.
> > > For instance, if I store 10 videos in 10 HTable rows' cell, I assume
> that
> > > these 10 videos will be stored in different data nodes (regionservers)
> in
> > > HBase.
> >
> > The distribution of the data would depend on the size of the videos.
> > Assuming the videos are 10MB each then all videos will be contained
> > within a single region and served by a single region server.  Once a
> > region contains more than 256MB of data (default) the region is split
> > in two.  The two regions will then (probably) be served by two region
> > servers, etc...
> >
> > You may also be getting the terminologies a little mixed.  I'd suggest
> > having a read of the excellent HBase Architecture 101 article that
> > Lars George wrote:
> > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
> >
> > > Now, if I wrote a program that do some processes for these 10 videos
> > > parallel, what' going to happen?
> > > Since I only deployed the program in a jar to the master server in
> HBase,
> > > will all videos in the HBase system have to be transfered into the
> master
> > > server to get processed?
> >
> > If you run the program from a single machine (don't use the HMaster)
> > then yes, it would have to transfer the data to that machine using the
> > network.
> >
> > > 1. Or do I have another option to assign where the computing should
> > happen
> > > so I do not have to transfer the data over the network and use the
> region
> > > server's cpu to calculate the process?
> > > 2. Or should I deploy the program jar to each region server so the
> region
> > > server can use local cpu on the local data? Will HBase system do that
> > > automatically?
> > > 3. Or I need plug M/R into HBase in order to use the local data and
+
Karthik K 2010-03-29, 23:25
+
William Kang 2010-03-30, 00:14
+
Andrew Purtell 2010-03-30, 00:33
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB