-Re: Questions about data distribution in HBase
Nick Dimiduk 2010-03-29, 21:27
I think you are slightly confused about the usage and intention of HBase.
Let me first say that HBase is a *storage* system designed for low latency,
random access retrieval - built on top of HDFS for high availability. That
is, it's a storage system, not a processing system. It solves the "large
file problems" of HDFS wherein access to arbitrary slices of a file require
scanning through every segment preceding that segment, giving random access
by record key.
For further details of HBase, I'll +1 the suggestion of reviewing the post
by Lars George.
You haven't contributed any details re: what kind of "processing" you wish
to accomplish over this video data. Based on your focus on low latency, I
will assume the m/r batch processing suggested earlier is not acceptable and
you require some kind of low-latency, immediate response solution. If this
is indeed the case, I suggest you look at using Katta (
http://katta.sourceforge.net/) for your low-latency processing. It says
"Distributed Lucene" but they actually mean "Distributed, Low-Latency
Aggregates." Perhaps an acceptable solution for you is random-access storage
of your video data in HBase combined with a custom Katta server for
processing of low-latency requests.
Any further details you can provide about your project will aid in the
direction and advice The List can provide.
On Sat, Mar 27, 2010 at 7:42 PM, William Kang <[EMAIL PROTECTED]>wrote:
> Hi Dan,
> Thanks for your reply.
> But I still have some questions about your answers:
> 1. What's the differences makes using the HMaster or any other machine
> you mention "If you run the program from a single machine (don't use the
> then yes, it would have to transfer the data to that machine using the
> network." Is there a way to run the program in multiple machines without
> using M/R?
> 2. Still, what about the latency if we use M/R in HBase?
> On Sat, Mar 27, 2010 at 6:38 PM, Dan Washusen <[EMAIL PROTECTED]> wrote:
> > Hi William,
> > I've put a few comments inline...
> > On 28 March 2010 04:06, William Kang <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > > I am quite confused about the distributions of data in a HBase system.
> > > For instance, if I store 10 videos in 10 HTable rows' cell, I assume
> > > these 10 videos will be stored in different data nodes (regionservers)
> > > HBase.
> > The distribution of the data would depend on the size of the videos.
> > Assuming the videos are 10MB each then all videos will be contained
> > within a single region and served by a single region server. Once a
> > region contains more than 256MB of data (default) the region is split
> > in two. The two regions will then (probably) be served by two region
> > servers, etc...
> > You may also be getting the terminologies a little mixed. I'd suggest
> > having a read of the excellent HBase Architecture 101 article that
> > Lars George wrote:
> > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
> > > Now, if I wrote a program that do some processes for these 10 videos
> > > parallel, what' going to happen?
> > > Since I only deployed the program in a jar to the master server in
> > > will all videos in the HBase system have to be transfered into the
> > > server to get processed?
> > If you run the program from a single machine (don't use the HMaster)
> > then yes, it would have to transfer the data to that machine using the
> > network.
> > > 1. Or do I have another option to assign where the computing should
> > happen
> > > so I do not have to transfer the data over the network and use the
> > > server's cpu to calculate the process?
> > > 2. Or should I deploy the program jar to each region server so the
> > > server can use local cpu on the local data? Will HBase system do that
> > > automatically?
> > > 3. Or I need plug M/R into HBase in order to use the local data and