|
|
-
hbase table as a queue.
Jack Levin 2011-07-16, 16:38
Hello, we are thinking about using Hbase table as a simple queue which will dispatch the work for a mapreduce job, as well as real time fetching of data to present to end user. In simple terms, suppose you had a data source table and a queue table. The queue table has a smaller set of Rows that point to Values which in turn point to Perma-set table, which has large collection of Rows. (so Queue{Row, Value} -> Perma-Set {Row, Value}). Or Q-Value -> P-Row. Our Goal is to look up which Rows to retrieve from the Perma-Set table by looking through the Queue. Once the lookup into the Queue is done, the Row from the Queue must be deleted to avoid the same process of Perma-Set lookup be done twice; We expect many concurrent lookups to happen, so I assume the first thing we need to do is to have a client that does the work is acquire a lock on the Queue Row, process the work, then Remove the Queue Row.
Has anyone done something similar before? Any gotchas we should be away of?
Thanks.
-Jack
-
Re: hbase table as a queue.
Jack Levin 2011-07-16, 16:47
One thing I need to point out, is that we do not need the Queue items to be worked on in order, so there is no traditional head and tail of the Queue. The Queue table is simply a set of work orders that can be fetched randomly or by applying a scan for a particular set of Rows that can even come out from the middle of the table.
-Jack
On Sat, Jul 16, 2011 at 9:38 AM, Jack Levin <[EMAIL PROTECTED]> wrote: > Hello, we are thinking about using Hbase table as a simple queue which > will dispatch the work for a mapreduce job, as well as real time > fetching of data to present to end user. In simple terms, suppose you > had a data source table and a queue table. The queue table has a > smaller set of Rows that point to Values which in turn point to > Perma-set table, which has large collection of Rows. (so Queue{Row, > Value} -> Perma-Set {Row, Value}). Or Q-Value -> P-Row. Our Goal is > to look up which Rows to retrieve from the Perma-Set table by looking > through the Queue. Once the lookup into the Queue is done, the Row > from the Queue must be deleted to avoid the same process of Perma-Set > lookup be done twice; We expect many concurrent lookups to happen, so > I assume the first thing we need to do is to have a client that does > the work is acquire a lock on the Queue Row, process the work, then > Remove the Queue Row. > > Has anyone done something similar before? Any gotchas we should be away of? > > Thanks. > > -Jack >
-
Re: hbase table as a queue.
Stack 2011-07-16, 20:24
I learned friday that our fellas on the frontend are using an hbase table to do simple queuing. They insert stuff to be processed by distributed processes and when processes are done with the work, they'll remove the processed element from the hbase table. They are queuing, processing, and removing millions of items a day. Elements were added on the end of the queue (FIFO).
The issue to avoid was that over time, especially if a while between major compactions, the latency was going up. Turns out, the table had been splitting when the queue backed. Then a scan for new stuff to process had to first traverse regions that had nought in them (the key was time-based and the tail of the table had moved on past these first regions). This traversal, especially if no major compaction so lots of deletes to process, was taking time to get to the first row.
To fix, we rid the table of its empty regions and made it so the table would on longer split so only ever one region in it. This should make it so we don't end up with empty regions to skip through before we get to the first element in the table (need the major compaction running on a somewhat regular basis to temper latencies). Will report back to the list if we find otherwise.
Do not use locks. Doesn't scale. Maybe update a cell when task is taken out for processing. If too much time elapses since last update, maybe give it out again?
St.Ack
On Sat, Jul 16, 2011 at 9:38 AM, Jack Levin <[EMAIL PROTECTED]> wrote: > Hello, we are thinking about using Hbase table as a simple queue which > will dispatch the work for a mapreduce job, as well as real time > fetching of data to present to end user. In simple terms, suppose you > had a data source table and a queue table. The queue table has a > smaller set of Rows that point to Values which in turn point to > Perma-set table, which has large collection of Rows. (so Queue{Row, > Value} -> Perma-Set {Row, Value}). Or Q-Value -> P-Row. Our Goal is > to look up which Rows to retrieve from the Perma-Set table by looking > through the Queue. Once the lookup into the Queue is done, the Row > from the Queue must be deleted to avoid the same process of Perma-Set > lookup be done twice; We expect many concurrent lookups to happen, so > I assume the first thing we need to do is to have a client that does > the work is acquire a lock on the Queue Row, process the work, then > Remove the Queue Row. > > Has anyone done something similar before? Any gotchas we should be away of? > > Thanks. > > -Jack >
-
Re: hbase table as a queue.
Ted Dunning 2011-07-17, 00:52
Up to a pretty high transaction rate, you can simply use Zookeeper, especially if you check out a block of tasks at once.
With blocks of 100-1000, you should be able to handle a million events per second with very simple ZK data structures.
On Sat, Jul 16, 2011 at 1:24 PM, Stack <[EMAIL PROTECTED]> wrote:
> Do not use locks. Doesn't scale. Maybe update a cell when task is > taken out for processing. If too much time elapses since last update, > maybe give it out again? >
-
Re: hbase table as a queue.
Stack 2011-07-17, 01:10
Yes. I should have mentioned this. Thanks Ted.
On Jul 16, 2011, at 17:52, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Up to a pretty high transaction rate, you can simply use Zookeeper, > especially if you check out a block of tasks at once. > > With blocks of 100-1000, you should be able to handle a million events per > second with very simple ZK data structures. > > On Sat, Jul 16, 2011 at 1:24 PM, Stack <[EMAIL PROTECTED]> wrote: > >> Do not use locks. Doesn't scale. Maybe update a cell when task is >> taken out for processing. If too much time elapses since last update, >> maybe give it out again? >>
-
Re: hbase table as a queue.
Daniel Einspanjer 2011-07-19, 15:26
We use a queue table like this too and ran into the same problem. How did you configure it such that it never splits?
-Daniel
On 7/16/11 4:24 PM, Stack wrote: > I learned friday that our fellas on the frontend are using an hbase > table to do simple queuing. They insert stuff to be processed by > distributed processes and when processes are done with the work, > they'll remove the processed element from the hbase table. They are > queuing, processing, and removing millions of items a day. Elements > were added on the end of the queue (FIFO). > > The issue to avoid was that over time, especially if a while between > major compactions, the latency was going up. Turns out, the table had > been splitting when the queue backed. Then a scan for new stuff to > process had to first traverse regions that had nought in them (the key > was time-based and the tail of the table had moved on past these first > regions). This traversal, especially if no major compaction so lots > of deletes to process, was taking time to get to the first row. > > To fix, we rid the table of its empty regions and made it so the table > would on longer split so only ever one region in it. This should make > it so we don't end up with empty regions to skip through before we get > to the first element in the table (need the major compaction running > on a somewhat regular basis to temper latencies). Will report back to > the list if we find otherwise. > > Do not use locks. Doesn't scale. Maybe update a cell when task is > taken out for processing. If too much time elapses since last update, > maybe give it out again? > > St.Ack > > On Sat, Jul 16, 2011 at 9:38 AM, Jack Levin<[EMAIL PROTECTED]> wrote: >> Hello, we are thinking about using Hbase table as a simple queue which >> will dispatch the work for a mapreduce job, as well as real time >> fetching of data to present to end user. In simple terms, suppose you >> had a data source table and a queue table. The queue table has a >> smaller set of Rows that point to Values which in turn point to >> Perma-set table, which has large collection of Rows. (so Queue{Row, >> Value} -> Perma-Set {Row, Value}). Or Q-Value -> P-Row. Our Goal is >> to look up which Rows to retrieve from the Perma-Set table by looking >> through the Queue. Once the lookup into the Queue is done, the Row >> from the Queue must be deleted to avoid the same process of Perma-Set >> lookup be done twice; We expect many concurrent lookups to happen, so >> I assume the first thing we need to do is to have a client that does >> the work is acquire a lock on the Queue Row, process the work, then >> Remove the Queue Row. >> >> Has anyone done something similar before? Any gotchas we should be away of? >> >> Thanks. >> >> -Jack >>
-
RE: hbase table as a queue.
Michael Segel 2011-07-19, 16:00
I'm not sure how they are doing this, but just a quick thought...
You can increase the file size 1-2GB as an example and then run compactions on a regular basis to clean up rows deleted from the queue. This will stop the table from splitting.
The assumption is that your MAX_FILESIZE is much larger than your anticipated queue size.
HTH
-Mike
> Date: Tue, 19 Jul 2011 11:26:47 -0400 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: Re: hbase table as a queue. > > We use a queue table like this too and ran into the same problem. How > did you configure it such that it never splits? > > -Daniel > > On 7/16/11 4:24 PM, Stack wrote: > > I learned friday that our fellas on the frontend are using an hbase > > table to do simple queuing. They insert stuff to be processed by > > distributed processes and when processes are done with the work, > > they'll remove the processed element from the hbase table. They are > > queuing, processing, and removing millions of items a day. Elements > > were added on the end of the queue (FIFO). > > > > The issue to avoid was that over time, especially if a while between > > major compactions, the latency was going up. Turns out, the table had > > been splitting when the queue backed. Then a scan for new stuff to > > process had to first traverse regions that had nought in them (the key > > was time-based and the tail of the table had moved on past these first > > regions). This traversal, especially if no major compaction so lots > > of deletes to process, was taking time to get to the first row. > > > > To fix, we rid the table of its empty regions and made it so the table > > would on longer split so only ever one region in it. This should make > > it so we don't end up with empty regions to skip through before we get > > to the first element in the table (need the major compaction running > > on a somewhat regular basis to temper latencies). Will report back to > > the list if we find otherwise. > > > > Do not use locks. Doesn't scale. Maybe update a cell when task is > > taken out for processing. If too much time elapses since last update, > > maybe give it out again? > > > > St.Ack > > > > On Sat, Jul 16, 2011 at 9:38 AM, Jack Levin<[EMAIL PROTECTED]> wrote: > >> Hello, we are thinking about using Hbase table as a simple queue which > >> will dispatch the work for a mapreduce job, as well as real time > >> fetching of data to present to end user. In simple terms, suppose you > >> had a data source table and a queue table. The queue table has a > >> smaller set of Rows that point to Values which in turn point to > >> Perma-set table, which has large collection of Rows. (so Queue{Row, > >> Value} -> Perma-Set {Row, Value}). Or Q-Value -> P-Row. Our Goal is > >> to look up which Rows to retrieve from the Perma-Set table by looking > >> through the Queue. Once the lookup into the Queue is done, the Row > >> from the Queue must be deleted to avoid the same process of Perma-Set > >> lookup be done twice; We expect many concurrent lookups to happen, so > >> I assume the first thing we need to do is to have a client that does > >> the work is acquire a lock on the Queue Row, process the work, then > >> Remove the Queue Row. > >> > >> Has anyone done something similar before? Any gotchas we should be away of? > >> > >> Thanks. > >> > >> -Jack > >>
-
Re: hbase table as a queue.
Stack 2011-07-19, 16:05
Set region size very large (In trunk you can actually disable splitting). St.Ack
On Tue, Jul 19, 2011 at 8:26 AM, Daniel Einspanjer <[EMAIL PROTECTED]> wrote: > We use a queue table like this too and ran into the same problem. How did > you configure it such that it never splits? > > -Daniel > > On 7/16/11 4:24 PM, Stack wrote: >> >> I learned friday that our fellas on the frontend are using an hbase >> table to do simple queuing. They insert stuff to be processed by >> distributed processes and when processes are done with the work, >> they'll remove the processed element from the hbase table. They are >> queuing, processing, and removing millions of items a day. Elements >> were added on the end of the queue (FIFO). >> >> The issue to avoid was that over time, especially if a while between >> major compactions, the latency was going up. Turns out, the table had >> been splitting when the queue backed. Then a scan for new stuff to >> process had to first traverse regions that had nought in them (the key >> was time-based and the tail of the table had moved on past these first >> regions). This traversal, especially if no major compaction so lots >> of deletes to process, was taking time to get to the first row. >> >> To fix, we rid the table of its empty regions and made it so the table >> would on longer split so only ever one region in it. This should make >> it so we don't end up with empty regions to skip through before we get >> to the first element in the table (need the major compaction running >> on a somewhat regular basis to temper latencies). Will report back to >> the list if we find otherwise. >> >> Do not use locks. Doesn't scale. Maybe update a cell when task is >> taken out for processing. If too much time elapses since last update, >> maybe give it out again? >> >> St.Ack >> >> On Sat, Jul 16, 2011 at 9:38 AM, Jack Levin<[EMAIL PROTECTED]> wrote: >>> >>> Hello, we are thinking about using Hbase table as a simple queue which >>> will dispatch the work for a mapreduce job, as well as real time >>> fetching of data to present to end user. In simple terms, suppose you >>> had a data source table and a queue table. The queue table has a >>> smaller set of Rows that point to Values which in turn point to >>> Perma-set table, which has large collection of Rows. (so Queue{Row, >>> Value} -> Perma-Set {Row, Value}). Or Q-Value -> P-Row. Our Goal is >>> to look up which Rows to retrieve from the Perma-Set table by looking >>> through the Queue. Once the lookup into the Queue is done, the Row >>> from the Queue must be deleted to avoid the same process of Perma-Set >>> lookup be done twice; We expect many concurrent lookups to happen, so >>> I assume the first thing we need to do is to have a client that does >>> the work is acquire a lock on the Queue Row, process the work, then >>> Remove the Queue Row. >>> >>> Has anyone done something similar before? Any gotchas we should be away >>> of? >>> >>> Thanks. >>> >>> -Jack >>> >
-
Re: hbase table as a queue.
Daniel Einspanjer 2011-07-19, 16:26
Cool. filed a task for us to work on that. https://bugzilla.mozilla.org/show_bug.cgi?id=672527On 7/19/11 12:05 PM, Stack wrote: > Set region size very large (In trunk you can actually disable splitting). > St.Ack > > On Tue, Jul 19, 2011 at 8:26 AM, Daniel Einspanjer > <[EMAIL PROTECTED]> wrote: >> We use a queue table like this too and ran into the same problem. How did >> you configure it such that it never splits? >> >> -Daniel >> >> On 7/16/11 4:24 PM, Stack wrote: >>> I learned friday that our fellas on the frontend are using an hbase >>> table to do simple queuing. They insert stuff to be processed by >>> distributed processes and when processes are done with the work, >>> they'll remove the processed element from the hbase table. They are >>> queuing, processing, and removing millions of items a day. Elements >>> were added on the end of the queue (FIFO). >>> >>> The issue to avoid was that over time, especially if a while between >>> major compactions, the latency was going up. Turns out, the table had >>> been splitting when the queue backed. Then a scan for new stuff to >>> process had to first traverse regions that had nought in them (the key >>> was time-based and the tail of the table had moved on past these first >>> regions). This traversal, especially if no major compaction so lots >>> of deletes to process, was taking time to get to the first row. >>> >>> To fix, we rid the table of its empty regions and made it so the table >>> would on longer split so only ever one region in it. This should make >>> it so we don't end up with empty regions to skip through before we get >>> to the first element in the table (need the major compaction running >>> on a somewhat regular basis to temper latencies). Will report back to >>> the list if we find otherwise. >>> >>> Do not use locks. Doesn't scale. Maybe update a cell when task is >>> taken out for processing. If too much time elapses since last update, >>> maybe give it out again? >>> >>> St.Ack >>> >>> On Sat, Jul 16, 2011 at 9:38 AM, Jack Levin<[EMAIL PROTECTED]> wrote: >>>> Hello, we are thinking about using Hbase table as a simple queue which >>>> will dispatch the work for a mapreduce job, as well as real time >>>> fetching of data to present to end user. In simple terms, suppose you >>>> had a data source table and a queue table. The queue table has a >>>> smaller set of Rows that point to Values which in turn point to >>>> Perma-set table, which has large collection of Rows. (so Queue{Row, >>>> Value} -> Perma-Set {Row, Value}). Or Q-Value -> P-Row. Our Goal is >>>> to look up which Rows to retrieve from the Perma-Set table by looking >>>> through the Queue. Once the lookup into the Queue is done, the Row >>>> from the Queue must be deleted to avoid the same process of Perma-Set >>>> lookup be done twice; We expect many concurrent lookups to happen, so >>>> I assume the first thing we need to do is to have a client that does >>>> the work is acquire a lock on the Queue Row, process the work, then >>>> Remove the Queue Row. >>>> >>>> Has anyone done something similar before? Any gotchas we should be away >>>> of? >>>> >>>> Thanks. >>>> >>>> -Jack >>>>
-
Re: hbase table as a queue.
Gary Helmling 2011-07-19, 18:27
All excellent points here in terms of tuning! For the higher-level question about using a table as a queue, I just wanted to add in a link to the Lily guys' rowlog library, since it does exactly that: http://www.lilyproject.org/lily/about/playground/hbaserowlog.htmlOn Tue, Jul 19, 2011 at 9:26 AM, Daniel Einspanjer <[EMAIL PROTECTED]>wrote: > Cool. filed a task for us to work on that. > https://bugzilla.mozilla.org/**show_bug.cgi?id=672527<https://bugzilla.mozilla.org/show_bug.cgi?id=672527>> > > On 7/19/11 12:05 PM, Stack wrote: > >> Set region size very large (In trunk you can actually disable splitting). >> St.Ack >> >> On Tue, Jul 19, 2011 at 8:26 AM, Daniel Einspanjer >> <[EMAIL PROTECTED]> wrote: >> >>> We use a queue table like this too and ran into the same problem. How >>> did >>> you configure it such that it never splits? >>> >>> -Daniel >>> >>> On 7/16/11 4:24 PM, Stack wrote: >>> >>>> I learned friday that our fellas on the frontend are using an hbase >>>> table to do simple queuing. They insert stuff to be processed by >>>> distributed processes and when processes are done with the work, >>>> they'll remove the processed element from the hbase table. They are >>>> queuing, processing, and removing millions of items a day. Elements >>>> were added on the end of the queue (FIFO). >>>> >>>> The issue to avoid was that over time, especially if a while between >>>> major compactions, the latency was going up. Turns out, the table had >>>> been splitting when the queue backed. Then a scan for new stuff to >>>> process had to first traverse regions that had nought in them (the key >>>> was time-based and the tail of the table had moved on past these first >>>> regions). This traversal, especially if no major compaction so lots >>>> of deletes to process, was taking time to get to the first row. >>>> >>>> To fix, we rid the table of its empty regions and made it so the table >>>> would on longer split so only ever one region in it. This should make >>>> it so we don't end up with empty regions to skip through before we get >>>> to the first element in the table (need the major compaction running >>>> on a somewhat regular basis to temper latencies). Will report back to >>>> the list if we find otherwise. >>>> >>>> Do not use locks. Doesn't scale. Maybe update a cell when task is >>>> taken out for processing. If too much time elapses since last update, >>>> maybe give it out again? >>>> >>>> St.Ack >>>> >>>> On Sat, Jul 16, 2011 at 9:38 AM, Jack Levin<[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hello, we are thinking about using Hbase table as a simple queue which >>>>> will dispatch the work for a mapreduce job, as well as real time >>>>> fetching of data to present to end user. In simple terms, suppose you >>>>> had a data source table and a queue table. The queue table has a >>>>> smaller set of Rows that point to Values which in turn point to >>>>> Perma-set table, which has large collection of Rows. (so Queue{Row, >>>>> Value} -> Perma-Set {Row, Value}). Or Q-Value -> P-Row. Our >>>>> Goal is >>>>> to look up which Rows to retrieve from the Perma-Set table by looking >>>>> through the Queue. Once the lookup into the Queue is done, the Row >>>>> from the Queue must be deleted to avoid the same process of Perma-Set >>>>> lookup be done twice; We expect many concurrent lookups to happen, so >>>>> I assume the first thing we need to do is to have a client that does >>>>> the work is acquire a lock on the Queue Row, process the work, then >>>>> Remove the Queue Row. >>>>> >>>>> Has anyone done something similar before? Any gotchas we should be >>>>> away >>>>> of? >>>>> >>>>> Thanks. >>>>> >>>>> -Jack >>>>> >>>>>
|
|