|
|
-
Questions about data distribution in HBase
William Kang 2010-03-27, 17:06
Hi, I am quite confused about the distributions of data in a HBase system. For instance, if I store 10 videos in 10 HTable rows' cell, I assume that these 10 videos will be stored in different data nodes (regionservers) in HBase. Now, if I wrote a program that do some processes for these 10 videos parallel, what' going to happen? Since I only deployed the program in a jar to the master server in HBase, will all videos in the HBase system have to be transfered into the master server to get processed? 1. Or do I have another option to assign where the computing should happen so I do not have to transfer the data over the network and use the region server's cpu to calculate the process? 2. Or should I deploy the program jar to each region server so the region server can use local cpu on the local data? Will HBase system do that automatically? 3. Or I need plug M/R into HBase in order to use the local data and parallelization in processes? Many thanks. William
-
Re: Questions about data distribution in HBase
Tim Robertson 2010-03-27, 17:54
I would consider option 3) if it were me (I am not an expert). It is common to use HBase tables as the input format for map reduce jobs. I don't think it is as easy as assuming that the 3 videos will go over 3 machines when storing, but certainly as the volume grows it will distribute, and by using MR the processing will try and run as close to the data as possible.
Cheers, Tim On Sat, Mar 27, 2010 at 6:06 PM, William Kang <[EMAIL PROTECTED]> wrote: > Hi, > I am quite confused about the distributions of data in a HBase system. > For instance, if I store 10 videos in 10 HTable rows' cell, I assume that > these 10 videos will be stored in different data nodes (regionservers) in > HBase. Now, if I wrote a program that do some processes for these 10 videos > parallel, what' going to happen? > Since I only deployed the program in a jar to the master server in HBase, > will all videos in the HBase system have to be transfered into the master > server to get processed? > 1. Or do I have another option to assign where the computing should happen > so I do not have to transfer the data over the network and use the region > server's cpu to calculate the process? > 2. Or should I deploy the program jar to each region server so the region > server can use local cpu on the local data? Will HBase system do that > automatically? > 3. Or I need plug M/R into HBase in order to use the local data and > parallelization in processes? > Many thanks. > > > William >
-
Re: Questions about data distribution in HBase
William Kang 2010-03-27, 18:51
Hi Tim, The problem is that M/R is a batch system which has a high latency. The reason I am using HBase is for its low latency. If I plug M/R with HBase, the advantage of HBase cloud be compromised. I have not used M/R for a while, so I am not sure if the new release has improved a lot in terms of speed. But, what would be ideal is a low latency system could also handle some parallel jobs. Do you have any suggestions? Thanks for your replies. William
On Sat, Mar 27, 2010 at 1:54 PM, Tim Robertson <[EMAIL PROTECTED]>wrote:
> I would consider option 3) if it were me (I am not an expert). It is > common to use HBase tables as the input format for map reduce jobs. > I don't think it is as easy as assuming that the 3 videos will go over > 3 machines when storing, but certainly as the volume grows it will > distribute, and by using MR the processing will try and run as close > to the data as possible. > > Cheers, > Tim > > > On Sat, Mar 27, 2010 at 6:06 PM, William Kang <[EMAIL PROTECTED]> > wrote: > > Hi, > > I am quite confused about the distributions of data in a HBase system. > > For instance, if I store 10 videos in 10 HTable rows' cell, I assume that > > these 10 videos will be stored in different data nodes (regionservers) in > > HBase. Now, if I wrote a program that do some processes for these 10 > videos > > parallel, what' going to happen? > > Since I only deployed the program in a jar to the master server in HBase, > > will all videos in the HBase system have to be transfered into the master > > server to get processed? > > 1. Or do I have another option to assign where the computing should > happen > > so I do not have to transfer the data over the network and use the region > > server's cpu to calculate the process? > > 2. Or should I deploy the program jar to each region server so the region > > server can use local cpu on the local data? Will HBase system do that > > automatically? > > 3. Or I need plug M/R into HBase in order to use the local data and > > parallelization in processes? > > Many thanks. > > > > > > William > > >
-
Re: Questions about data distribution in HBase
Dan Washusen 2010-03-27, 22:38
Hi William, I've put a few comments inline... On 28 March 2010 04:06, William Kang <[EMAIL PROTECTED]> wrote: > > Hi, > I am quite confused about the distributions of data in a HBase system. > For instance, if I store 10 videos in 10 HTable rows' cell, I assume that > these 10 videos will be stored in different data nodes (regionservers) in > HBase. The distribution of the data would depend on the size of the videos. Assuming the videos are 10MB each then all videos will be contained within a single region and served by a single region server. Once a region contains more than 256MB of data (default) the region is split in two. The two regions will then (probably) be served by two region servers, etc... You may also be getting the terminologies a little mixed. I'd suggest having a read of the excellent HBase Architecture 101 article that Lars George wrote: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html> Now, if I wrote a program that do some processes for these 10 videos > parallel, what' going to happen? > Since I only deployed the program in a jar to the master server in HBase, > will all videos in the HBase system have to be transfered into the master > server to get processed? If you run the program from a single machine (don't use the HMaster) then yes, it would have to transfer the data to that machine using the network. > 1. Or do I have another option to assign where the computing should happen > so I do not have to transfer the data over the network and use the region > server's cpu to calculate the process? > 2. Or should I deploy the program jar to each region server so the region > server can use local cpu on the local data? Will HBase system do that > automatically? > 3. Or I need plug M/R into HBase in order to use the local data and > parallelization in processes? > Many thanks. HBase uses HDFS to store files. The data that a region server is serving does not necessarily reside on the same machine as the region server. As a result options 1 and 2 don't really make sense... As Tim Robertson suggests you are left option 3 to consider... > > > William I hope that helps a little. I'd really strongly recommend that you have a read of the HBase Architecture 101 article... Cheers, Dan
-
Re: Questions about data distribution in HBase
William Kang 2010-03-28, 02:42
Hi Dan, Thanks for your reply. But I still have some questions about your answers: 1. What's the differences makes using the HMaster or any other machine since you mention "If you run the program from a single machine (don't use the HMaster) then yes, it would have to transfer the data to that machine using the network." Is there a way to run the program in multiple machines without using M/R? 2. Still, what about the latency if we use M/R in HBase? Thanks. Willliam On Sat, Mar 27, 2010 at 6:38 PM, Dan Washusen <[EMAIL PROTECTED]> wrote: > Hi William, > I've put a few comments inline... > > On 28 March 2010 04:06, William Kang <[EMAIL PROTECTED]> wrote: > > > > Hi, > > I am quite confused about the distributions of data in a HBase system. > > For instance, if I store 10 videos in 10 HTable rows' cell, I assume that > > these 10 videos will be stored in different data nodes (regionservers) in > > HBase. > > The distribution of the data would depend on the size of the videos. > Assuming the videos are 10MB each then all videos will be contained > within a single region and served by a single region server. Once a > region contains more than 256MB of data (default) the region is split > in two. The two regions will then (probably) be served by two region > servers, etc... > > You may also be getting the terminologies a little mixed. I'd suggest > having a read of the excellent HBase Architecture 101 article that > Lars George wrote: > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html> > > Now, if I wrote a program that do some processes for these 10 videos > > parallel, what' going to happen? > > Since I only deployed the program in a jar to the master server in HBase, > > will all videos in the HBase system have to be transfered into the master > > server to get processed? > > If you run the program from a single machine (don't use the HMaster) > then yes, it would have to transfer the data to that machine using the > network. > > > 1. Or do I have another option to assign where the computing should > happen > > so I do not have to transfer the data over the network and use the region > > server's cpu to calculate the process? > > 2. Or should I deploy the program jar to each region server so the region > > server can use local cpu on the local data? Will HBase system do that > > automatically? > > 3. Or I need plug M/R into HBase in order to use the local data and > > parallelization in processes? > > Many thanks. > > HBase uses HDFS to store files. The data that a region server is > serving does not necessarily reside on the same machine as the region > server. As a result options 1 and 2 don't really make sense... > > As Tim Robertson suggests you are left option 3 to consider... > > > > > > > William > > I hope that helps a little. I'd really strongly recommend that you > have a read of the HBase Architecture 101 article... > > Cheers, > Dan >
-
Re: Questions about data distribution in HBase
Tim Robertson 2010-03-28, 07:13
Hi WIlliam, My thoughts: You could put your processing code on all machines (expect master), and write something that load balances incoming requests across the machines to select a node. It sounds like you want to process on demand so you would need to load balance requests onto the machines, and one of those machines would then, through the HBase API, collect the video (hence transfering data across the cluster) and process and serve it. I would think using a separate cluster other than the actual HBase/Hadoop machines would be best if you have reasonable traffic (e.g. 3/5 zookeeper machines, 1+ Master, 3+ slaves, and N video processing / request serving machines). MapReduce will have similar latency to what you have observed before (e.g. even a small number of items to process are going to be 10-20 secs minimum, but probably more like minutes if it involves complex processing). Is it possible for your needs, that you MR all videos and store the processed result back into HBase, so the data is preprocessed and ready to serve in real time? Storing the same content in multiple formats seems quite a common approach with HBase as storage is "cheap". E.g. After a video is stored, it is picked up during a periodic MR job, that processes it and stores it in a different column in the same row, and only then it is made available for real time serving? With timestamps, you would only process videos changed since the last run. This would provide parallel processing of the videos in a simple manner, but would mean there was a latency between the first storage to the availability. If your processing is not generic, and depends on the actual request coming in, then this model would not be suitable, and you would be looking for load balancing processing across machines based on incoming requests as above. If you describe how large the videos are, how many, the saving rate, what you do when you process them (is it generic?, how long does it take?, do you need to store the processed output to save future processing?), how people request them (is it 1 or many videos at a time?), what are the expectations of the video clients (can they request many videos, and receive notification when they are available) etc, it will be easier for the list subscribers to offer more concrete advice on deployment options. Cheers, Tim On Sun, Mar 28, 2010 at 4:42 AM, William Kang <[EMAIL PROTECTED]> wrote: > Hi Dan, > Thanks for your reply. > But I still have some questions about your answers: > 1. What's the differences makes using the HMaster or any other machine since > you mention "If you run the program from a single machine (don't use the > HMaster) > then yes, it would have to transfer the data to that machine using the > network." Is there a way to run the program in multiple machines without > using M/R? > 2. Still, what about the latency if we use M/R in HBase? > Thanks. > > > Willliam > > On Sat, Mar 27, 2010 at 6:38 PM, Dan Washusen <[EMAIL PROTECTED]> wrote: > >> Hi William, >> I've put a few comments inline... >> >> On 28 March 2010 04:06, William Kang <[EMAIL PROTECTED]> wrote: >> > >> > Hi, >> > I am quite confused about the distributions of data in a HBase system. >> > For instance, if I store 10 videos in 10 HTable rows' cell, I assume that >> > these 10 videos will be stored in different data nodes (regionservers) in >> > HBase. >> >> The distribution of the data would depend on the size of the videos. >> Assuming the videos are 10MB each then all videos will be contained >> within a single region and served by a single region server. Once a >> region contains more than 256MB of data (default) the region is split >> in two. The two regions will then (probably) be served by two region >> servers, etc... >> >> You may also be getting the terminologies a little mixed. I'd suggest >> having a read of the excellent HBase Architecture 101 article that >> Lars George wrote: >> http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
-
Re: Questions about data distribution in HBase
Nick Dimiduk 2010-03-29, 21:27
Hi William, I think you are slightly confused about the usage and intention of HBase. Let me first say that HBase is a *storage* system designed for low latency, random access retrieval - built on top of HDFS for high availability. That is, it's a storage system, not a processing system. It solves the "large file problems" of HDFS wherein access to arbitrary slices of a file require scanning through every segment preceding that segment, giving random access by record key. For further details of HBase, I'll +1 the suggestion of reviewing the post by Lars George. You haven't contributed any details re: what kind of "processing" you wish to accomplish over this video data. Based on your focus on low latency, I will assume the m/r batch processing suggested earlier is not acceptable and you require some kind of low-latency, immediate response solution. If this is indeed the case, I suggest you look at using Katta ( http://katta.sourceforge.net/) for your low-latency processing. It says "Distributed Lucene" but they actually mean "Distributed, Low-Latency Aggregates." Perhaps an acceptable solution for you is random-access storage of your video data in HBase combined with a custom Katta server for processing of low-latency requests. Any further details you can provide about your project will aid in the direction and advice The List can provide. Cheers, -Nick On Sat, Mar 27, 2010 at 7:42 PM, William Kang <[EMAIL PROTECTED]>wrote: > Hi Dan, > Thanks for your reply. > But I still have some questions about your answers: > 1. What's the differences makes using the HMaster or any other machine > since > you mention "If you run the program from a single machine (don't use the > HMaster) > then yes, it would have to transfer the data to that machine using the > network." Is there a way to run the program in multiple machines without > using M/R? > 2. Still, what about the latency if we use M/R in HBase? > Thanks. > > > Willliam > > On Sat, Mar 27, 2010 at 6:38 PM, Dan Washusen <[EMAIL PROTECTED]> wrote: > > > Hi William, > > I've put a few comments inline... > > > > On 28 March 2010 04:06, William Kang <[EMAIL PROTECTED]> wrote: > > > > > > Hi, > > > I am quite confused about the distributions of data in a HBase system. > > > For instance, if I store 10 videos in 10 HTable rows' cell, I assume > that > > > these 10 videos will be stored in different data nodes (regionservers) > in > > > HBase. > > > > The distribution of the data would depend on the size of the videos. > > Assuming the videos are 10MB each then all videos will be contained > > within a single region and served by a single region server. Once a > > region contains more than 256MB of data (default) the region is split > > in two. The two regions will then (probably) be served by two region > > servers, etc... > > > > You may also be getting the terminologies a little mixed. I'd suggest > > having a read of the excellent HBase Architecture 101 article that > > Lars George wrote: > > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html> > > > > Now, if I wrote a program that do some processes for these 10 videos > > > parallel, what' going to happen? > > > Since I only deployed the program in a jar to the master server in > HBase, > > > will all videos in the HBase system have to be transfered into the > master > > > server to get processed? > > > > If you run the program from a single machine (don't use the HMaster) > > then yes, it would have to transfer the data to that machine using the > > network. > > > > > 1. Or do I have another option to assign where the computing should > > happen > > > so I do not have to transfer the data over the network and use the > region > > > server's cpu to calculate the process? > > > 2. Or should I deploy the program jar to each region server so the > region > > > server can use local cpu on the local data? Will HBase system do that > > > automatically? > > > 3. Or I need plug M/R into HBase in order to use the local data and
-
Re: Questions about data distribution in HBase
Karthik K 2010-03-29, 23:25
William - If you are processing video files (depending on how big they are), a better prospect might be to store video files in hdfs only and exploit hadoop rpc (see - avro) for a custom protocol to process the same. Katta suggested inline is a great example of that ( custom protocol on top of avro / hadoop rpc ). To give a hint about the locality of the files on hdfs, you can use the following in DistributedFileSystem . BlockLocation[] DistributedFileSystem#getBlockLocations(String src, long start, long length); and can have as a guiding factor for your protocol , for locality. On Mon, Mar 29, 2010 at 2:27 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > Hi William, > > I think you are slightly confused about the usage and intention of HBase. > Let me first say that HBase is a *storage* system designed for low latency, > random access retrieval - built on top of HDFS for high availability. That > is, it's a storage system, not a processing system. It solves the "large > file problems" of HDFS wherein access to arbitrary slices of a file require > scanning through every segment preceding that segment, giving random access > by record key. > > For further details of HBase, I'll +1 the suggestion of reviewing the post > by Lars George. > > You haven't contributed any details re: what kind of "processing" you wish > to accomplish over this video data. Based on your focus on low latency, I > will assume the m/r batch processing suggested earlier is not acceptable > and > you require some kind of low-latency, immediate response solution. If this > is indeed the case, I suggest you look at using Katta ( > http://katta.sourceforge.net/) for your low-latency processing. It says > "Distributed Lucene" but they actually mean "Distributed, Low-Latency > Aggregates." Perhaps an acceptable solution for you is random-access > storage > of your video data in HBase combined with a custom Katta server for > processing of low-latency requests. > > Any further details you can provide about your project will aid in the > direction and advice The List can provide. > > Cheers, > -Nick > > On Sat, Mar 27, 2010 at 7:42 PM, William Kang <[EMAIL PROTECTED] > >wrote: > > > Hi Dan, > > Thanks for your reply. > > But I still have some questions about your answers: > > 1. What's the differences makes using the HMaster or any other machine > > since > > you mention "If you run the program from a single machine (don't use the > > HMaster) > > then yes, it would have to transfer the data to that machine using the > > network." Is there a way to run the program in multiple machines without > > using M/R? > > 2. Still, what about the latency if we use M/R in HBase? > > Thanks. > > > > > > Willliam > > > > On Sat, Mar 27, 2010 at 6:38 PM, Dan Washusen <[EMAIL PROTECTED]> wrote: > > > > > Hi William, > > > I've put a few comments inline... > > > > > > On 28 March 2010 04:06, William Kang <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi, > > > > I am quite confused about the distributions of data in a HBase > system. > > > > For instance, if I store 10 videos in 10 HTable rows' cell, I assume > > that > > > > these 10 videos will be stored in different data nodes > (regionservers) > > in > > > > HBase. > > > > > > The distribution of the data would depend on the size of the videos. > > > Assuming the videos are 10MB each then all videos will be contained > > > within a single region and served by a single region server. Once a > > > region contains more than 256MB of data (default) the region is split > > > in two. The two regions will then (probably) be served by two region > > > servers, etc... > > > > > > You may also be getting the terminologies a little mixed. I'd suggest > > > having a read of the excellent HBase Architecture 101 article that > > > Lars George wrote: > > > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html> > > > > > > Now, if I wrote a program that do some processes for these 10 videos > > > > parallel, what' going to happen?
-
Re: Questions about data distribution in HBase
William Kang 2010-03-30, 00:14
Hi, Thanks a lot for your detailed suggestions. To answer Tim's question, let me elaborate a little bit of the case I am working on. What I need is a low latency system can perform some videos processes on the fly. For this reason, a M/R probably won't do the job. The reason I chose hadoop is because its parallelization. I am trying to use the multiple machines to make the video process at the same time. Each video clip should be around 50M to 100M. A whole video has been sliced into around 10 video clips already. These clips should be stored in HBase's table for fast retrieval. But to make a process on the fly for real application, I need these 10 video clips to be processed at the same time where they are stored. To satisfy this purpose, I need to implement "local awareness", that is to say, my program which process video clips should be run on the machine which store the video clips. So, my question can be rephrased into: 1. Dose HBase provide local awareness of where the data is stored? 2. If yes to question 1, is there any current framework I can use to distribute my processes with hbase? If no the question 2, I think I will have to make some custom rpc interfaces in my program. The reason I need local awareness and run the processes at local data node is that I want to avoid transporting data over network and use multiple cpus. The reason I need hbase instead of hadoop m/r or hdfs with rpc is because the latency is quite important for this on the fly process. If it is necessary, I can give more detailed description of my case. Thanks a lot. William On Mon, Mar 29, 2010 at 7:25 PM, Karthik K <[EMAIL PROTECTED]> wrote: > William - > If you are processing video files (depending on how big they are), a > better prospect might be to store video files in hdfs only and exploit > hadoop rpc (see - avro) for a custom protocol to process the same. Katta > suggested inline is a great example of that ( custom protocol on top of > avro > / hadoop rpc ). > > To give a hint about the locality of the files on hdfs, you can use the > following in DistributedFileSystem . > BlockLocation[] DistributedFileSystem#getBlockLocations(String src, long > start, long length); > and can have as a guiding factor for your protocol , for locality. > > > > > > On Mon, Mar 29, 2010 at 2:27 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > > > Hi William, > > > > I think you are slightly confused about the usage and intention of HBase. > > Let me first say that HBase is a *storage* system designed for low > latency, > > random access retrieval - built on top of HDFS for high availability. > That > > is, it's a storage system, not a processing system. It solves the "large > > file problems" of HDFS wherein access to arbitrary slices of a file > require > > scanning through every segment preceding that segment, giving random > access > > by record key. > > > > For further details of HBase, I'll +1 the suggestion of reviewing the > post > > by Lars George. > > > > You haven't contributed any details re: what kind of "processing" you > wish > > to accomplish over this video data. Based on your focus on low latency, I > > will assume the m/r batch processing suggested earlier is not acceptable > > and > > you require some kind of low-latency, immediate response solution. If > this > > is indeed the case, I suggest you look at using Katta ( > > http://katta.sourceforge.net/) for your low-latency processing. It says > > "Distributed Lucene" but they actually mean "Distributed, Low-Latency > > Aggregates." Perhaps an acceptable solution for you is random-access > > storage > > of your video data in HBase combined with a custom Katta server for > > processing of low-latency requests. > > > > Any further details you can provide about your project will aid in the > > direction and advice The List can provide. > > > > Cheers, > > -Nick > > > > On Sat, Mar 27, 2010 at 7:42 PM, William Kang <[EMAIL PROTECTED] > > >wrote: > > > > > Hi Dan, > > > Thanks for your reply.
-
Re: Questions about data distribution in HBase
Andrew Purtell 2010-03-30, 00:33
This use case is an ideal one for coprocessors. Alas, the coprocessor feature is not finished yet. More inline. > From: William Kang > Subject: Re: Questions about data distribution in HBase > > What I need is a low latency system can perform some videos > processes on the fly. For this reason, a M/R probably won't > do the job. The reason I chose hadoop is because its > parallelization. Unless I somehow misunderstand, Hadoop parallelization == M/R. That is, the only parallel scheduling for user tasks on the Hadoop platform is MapReduce. > I am trying to use the multiple machines to make the video > process at the same time. Each video > clip should be around 50M to 100M. A whole video has been > sliced into around 10 video clips already. These clips should > be stored in HBase's table for fast retrieval. But to make a > process on the fly for real application, I need these 10 video > clips to be processed at the same time where they are > stored. > > To satisfy this purpose, I need to implement "local > awareness", that is to say, my program which process video > clips should be run on the machine which store the video > clips. So, my question can be rephrased into: > 1. Dose HBase provide local awareness of where the data is > stored? You know the row key, so you can find via the master the region server currently hosting the region which contains the key. Over time, after major compaction, regionservers bring the HDFS blocks backing a region local. > 2. If yes to question 1, is there any current framework I > can use to distribute my processes with hbase? Coprocessors. HBASE-2000, HBASE-2001 http://issues.apache.org/jira/browse/HBASE-2000Alas, unfinished. > If no the question 2, I think I will have to make some > custom rpc interfaces in my program. It might be easier to help work on HBASE-2001. > The reason I need local awareness and run the processes at > local data node is that I want to avoid transporting data > over network and use multiple cpus. You can transcode at put time or at get time with a coprocessor. - Andy
|
|