|
Lars H
2012-08-23, 01:38
Gurjeet Singh
2012-08-23, 02:01
lars hofhansl
2012-08-24, 17:27
Gurjeet Singh
2012-08-12, 06:04
lars hofhansl
2012-08-12, 22:24
Gurjeet Singh
2012-08-12, 22:51
lars hofhansl
2012-08-12, 23:00
Gurjeet Singh
2012-08-13, 05:10
Stack
2012-08-13, 07:27
Gurjeet Singh
2012-08-13, 07:51
Gurjeet Singh
2012-08-13, 22:12
lars hofhansl
2012-08-14, 00:30
Gurjeet Singh
2012-08-14, 01:10
Stack
2012-08-15, 22:13
lars hofhansl
2012-08-16, 00:16
Gurjeet Singh
2012-08-16, 18:26
lars hofhansl
2012-08-16, 18:36
Gurjeet Singh
2012-08-16, 18:40
Gurjeet Singh
2012-08-21, 02:42
lars hofhansl
2012-08-21, 02:50
lars hofhansl
2012-08-21, 18:18
Gurjeet Singh
2012-08-21, 18:31
lars hofhansl
2012-08-21, 23:33
Mohit Anchlia
2012-08-22, 00:56
J Mohamed Zahoor
2012-08-22, 05:00
Gurjeet Singh
2012-08-22, 16:42
Mohammad Tariq
2012-08-12, 22:49
Gurjeet Singh
2012-08-12, 22:52
Mohammad Tariq
2012-08-12, 23:00
Jacques
2012-08-12, 23:13
Gurjeet Singh
2012-08-13, 04:41
Mohammad Tariq
2012-08-12, 23:34
Jacques
2012-08-12, 22:59
Stack
2012-08-12, 08:17
Gurjeet Singh
2012-08-12, 12:32
Ted Yu
2012-08-12, 14:11
Gurjeet Singh
2012-08-12, 14:23
Jacques
2012-08-12, 21:05
Gurjeet Singh
2012-08-12, 22:46
|
-
Re: Slow full-table scansLars H 2012-08-23, 01:38
Your puts are much faster because in the old case you're doing a Put per column, rather than per row.
That's the first thing I changed in you sample code (but since this was about scan performance I did not mention that). I'm still interested in tracking this down if it is an actual performance problem. -- Lars Gurjeet Singh <[EMAIL PROTECTED]> schrieb: >Okay, I just ran extensive tests with my minimal test case and you are >correct, the old and the new version do the scans in about the same >amount of time (although puts are MUCH faster in the packed scheme). > >I guess my test case is too minimal. I will try to make a better >testcase since in my production code, there is still a 500x >difference. > >Gurjeet > >On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: >> Try a quick TestDFSIO to see if things are okay. >> >> ./zahoor >> >> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: >> >>> It's possible that there is a bad or slower disk on Gurjeet's machine. I >>> think details of iostat and cpu would clear things up. >>> >>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[EMAIL PROTECTED]> >>> wrote: >>> >>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size >>> > 100 >>> > >>> > >>> > >>> > ________________________________ >>> > From: Gurjeet Singh <[EMAIL PROTECTED]> >>> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>> > Sent: Tuesday, August 21, 2012 11:31 AM >>> > Subject: Re: Slow full-table scans >>> > >>> > How does that compare with the newScanTable on your build ? >>> > >>> > Gurjeet >>> > >>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[EMAIL PROTECTED]> >>> > wrote: >>> > > Hmm... So I tried in HBase (current trunk). >>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable). >>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo >>> > distributed mode - with your oldScanTable). >>> > > >>> > > -- Lars >>> > > >>> > > >>> > > >>> > > ----- Original Message ----- >>> > > From: lars hofhansl <[EMAIL PROTECTED]> >>> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>> > > Cc: >>> > > Sent: Monday, August 20, 2012 7:50 PM >>> > > Subject: Re: Slow full-table scans >>> > > >>> > > Thanks Gurjeet, >>> > > >>> > > I'll (hopefully) have a look tomorrow. >>> > > >>> > > -- Lars >>> > > >>> > > >>> > > >>> > > ----- Original Message ----- >>> > > From: Gurjeet Singh <[EMAIL PROTECTED]> >>> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>> > > Cc: >>> > > Sent: Monday, August 20, 2012 7:42 PM >>> > > Subject: Re: Slow full-table scans >>> > > >>> > > Hi Lars, >>> > > >>> > > Here is a testcase: >>> > > >>> > > https://gist.github.com/3410948 >>> > > >>> > > Benchmarking code: >>> > > >>> > > https://gist.github.com/3410952 >>> > > >>> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 >>> > > >>> > > Gurjeet >>> > > >>> > > >>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> >>> > wrote: >>> > >> Sure - I can create a minimal testcase and send it along. >>> > >> >>> > >> Gurjeet >>> > >> >>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> >>> > wrote: >>> > >>> That's interesting. >>> > >>> Could you share your old and new schema. I would like to track down >>> > the performance problems you saw. >>> > >>> (If you had a demo program that populates your rows with 200.000 >>> > columns in a way where you saw the performance issues, that'd be even >>> > better, but not necessary). >>> > >>> >>> > >>> >>> > >>> -- Lars >>> > >>> >>> > >>> >>> > >>> >>> > >>> ________________________________ >>> > >>> From: Gurjeet Singh <[EMAIL PROTECTED]> >>> > >>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>> > >>> Sent: Thursday, August 16, 2012 11:26 AM >>> > >>> Subject: Re: Slow full-table scans >>> > >>> >>> > >>> Sorry for the delay guys. >>> > >>> >>> > >>> Here are a few results: +
Lars H 2012-08-23, 01:38
-
Re: Slow full-table scansGurjeet Singh 2012-08-23, 02:01
Lars,
Can you send me the modified ingestion code ? I am trying to track down the problem as well and will keep you posted. Thanks for your help! Gurjeet On Wed, Aug 22, 2012 at 6:38 PM, Lars H <[EMAIL PROTECTED]> wrote: > Your puts are much faster because in the old case you're doing a Put per column, rather than per row. > That's the first thing I changed in you sample code (but since this was about scan performance I did not mention that). > > I'm still interested in tracking this down if it is an actual performance problem. > > -- Lars > > Gurjeet Singh <[EMAIL PROTECTED]> schrieb: > >>Okay, I just ran extensive tests with my minimal test case and you are >>correct, the old and the new version do the scans in about the same >>amount of time (although puts are MUCH faster in the packed scheme). >> >>I guess my test case is too minimal. I will try to make a better >>testcase since in my production code, there is still a 500x >>difference. >> >>Gurjeet >> >>On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: >>> Try a quick TestDFSIO to see if things are okay. >>> >>> ./zahoor >>> >>> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: >>> >>>> It's possible that there is a bad or slower disk on Gurjeet's machine. I >>>> think details of iostat and cpu would clear things up. >>>> >>>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size >>>> > 100 >>>> > >>>> > >>>> > >>>> > ________________________________ >>>> > From: Gurjeet Singh <[EMAIL PROTECTED]> >>>> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>>> > Sent: Tuesday, August 21, 2012 11:31 AM >>>> > Subject: Re: Slow full-table scans >>>> > >>>> > How does that compare with the newScanTable on your build ? >>>> > >>>> > Gurjeet >>>> > >>>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[EMAIL PROTECTED]> >>>> > wrote: >>>> > > Hmm... So I tried in HBase (current trunk). >>>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable). >>>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo >>>> > distributed mode - with your oldScanTable). >>>> > > >>>> > > -- Lars >>>> > > >>>> > > >>>> > > >>>> > > ----- Original Message ----- >>>> > > From: lars hofhansl <[EMAIL PROTECTED]> >>>> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>>> > > Cc: >>>> > > Sent: Monday, August 20, 2012 7:50 PM >>>> > > Subject: Re: Slow full-table scans >>>> > > >>>> > > Thanks Gurjeet, >>>> > > >>>> > > I'll (hopefully) have a look tomorrow. >>>> > > >>>> > > -- Lars >>>> > > >>>> > > >>>> > > >>>> > > ----- Original Message ----- >>>> > > From: Gurjeet Singh <[EMAIL PROTECTED]> >>>> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>>> > > Cc: >>>> > > Sent: Monday, August 20, 2012 7:42 PM >>>> > > Subject: Re: Slow full-table scans >>>> > > >>>> > > Hi Lars, >>>> > > >>>> > > Here is a testcase: >>>> > > >>>> > > https://gist.github.com/3410948 >>>> > > >>>> > > Benchmarking code: >>>> > > >>>> > > https://gist.github.com/3410952 >>>> > > >>>> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 >>>> > > >>>> > > Gurjeet >>>> > > >>>> > > >>>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> >>>> > wrote: >>>> > >> Sure - I can create a minimal testcase and send it along. >>>> > >> >>>> > >> Gurjeet >>>> > >> >>>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> >>>> > wrote: >>>> > >>> That's interesting. >>>> > >>> Could you share your old and new schema. I would like to track down >>>> > the performance problems you saw. >>>> > >>> (If you had a demo program that populates your rows with 200.000 >>>> > columns in a way where you saw the performance issues, that'd be even >>>> > better, but not necessary). >>>> > >>> >>>> > >>> >>>> > >>> -- Lars >>>> > >>> +
Gurjeet Singh 2012-08-23, 02:01
-
Re: Slow full-table scanslars hofhansl 2012-08-24, 17:27
Sent offline.
----- Original Message ----- From: Gurjeet Singh <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Wednesday, August 22, 2012 7:01 PM Subject: Re: Slow full-table scans Lars, Can you send me the modified ingestion code ? I am trying to track down the problem as well and will keep you posted. Thanks for your help! Gurjeet On Wed, Aug 22, 2012 at 6:38 PM, Lars H <[EMAIL PROTECTED]> wrote: > Your puts are much faster because in the old case you're doing a Put per column, rather than per row. > That's the first thing I changed in you sample code (but since this was about scan performance I did not mention that). > > I'm still interested in tracking this down if it is an actual performance problem. > > -- Lars > > Gurjeet Singh <[EMAIL PROTECTED]> schrieb: > >>Okay, I just ran extensive tests with my minimal test case and you are >>correct, the old and the new version do the scans in about the same >>amount of time (although puts are MUCH faster in the packed scheme). >> >>I guess my test case is too minimal. I will try to make a better >>testcase since in my production code, there is still a 500x >>difference. >> >>Gurjeet >> >>On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: >>> Try a quick TestDFSIO to see if things are okay. >>> >>> ./zahoor >>> >>> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: >>> >>>> It's possible that there is a bad or slower disk on Gurjeet's machine. I >>>> think details of iostat and cpu would clear things up. >>>> >>>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size >>>> > 100 >>>> > >>>> > >>>> > >>>> > ________________________________ >>>> > From: Gurjeet Singh <[EMAIL PROTECTED]> >>>> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>>> > Sent: Tuesday, August 21, 2012 11:31 AM >>>> > Subject: Re: Slow full-table scans >>>> > >>>> > How does that compare with the newScanTable on your build ? >>>> > >>>> > Gurjeet >>>> > >>>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[EMAIL PROTECTED]> >>>> > wrote: >>>> > > Hmm... So I tried in HBase (current trunk). >>>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable). >>>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo >>>> > distributed mode - with your oldScanTable). >>>> > > >>>> > > -- Lars >>>> > > >>>> > > >>>> > > >>>> > > ----- Original Message ----- >>>> > > From: lars hofhansl <[EMAIL PROTECTED]> >>>> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>>> > > Cc: >>>> > > Sent: Monday, August 20, 2012 7:50 PM >>>> > > Subject: Re: Slow full-table scans >>>> > > >>>> > > Thanks Gurjeet, >>>> > > >>>> > > I'll (hopefully) have a look tomorrow. >>>> > > >>>> > > -- Lars >>>> > > >>>> > > >>>> > > >>>> > > ----- Original Message ----- >>>> > > From: Gurjeet Singh <[EMAIL PROTECTED]> >>>> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>>> > > Cc: >>>> > > Sent: Monday, August 20, 2012 7:42 PM >>>> > > Subject: Re: Slow full-table scans >>>> > > >>>> > > Hi Lars, >>>> > > >>>> > > Here is a testcase: >>>> > > >>>> > > https://gist.github.com/3410948 >>>> > > >>>> > > Benchmarking code: >>>> > > >>>> > > https://gist.github.com/3410952 >>>> > > >>>> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 >>>> > > >>>> > > Gurjeet >>>> > > >>>> > > >>>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> >>>> > wrote: >>>> > >> Sure - I can create a minimal testcase and send it along. >>>> > >> >>>> > >> Gurjeet >>>> > >> >>>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> >>>> > wrote: >>>> > >>> That's interesting. >>>> > >>> Could you share your old and new schema. I would like to track down >>>> > the performance problems you saw. >>>> > >>> (If you had a demo program that populates your rows with 200.000 +
lars hofhansl 2012-08-24, 17:27
-
Slow full-table scansGurjeet Singh 2012-08-12, 06:04
Hi,
I am trying to read all the data out of an HBase table using a scan and it is extremely slow. Here are some characteristics of the data: 1. The total table size is tiny (~200MB) 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. Thus the size of each cell is ~10bytes and the size of each row is ~2MB 3. Currently scanning the whole table takes ~400s (both in a distributed setting with 12 nodes or so and on a single node), thus 5sec/row 4. The row keys are unique 8 byte crypto hashes of sequential numbers 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) and is set to fetch 100MB of data at a time (scan.setCaching) 6. Changing the caching size seems to have no effect on the total scan time at all 7. The column family is setup to keep a single version of the cells, no compression, and no block cache. Am I missing something ? Is there a way to optimize this ? I guess a general question I have is whether HBase is good datastore for storing many medium sized (~50GB), dense datasets with lots of columns when a lot of the queries require full table scans ? Thanks! Gurjeet +
Gurjeet Singh 2012-08-12, 06:04
-
Re: Slow full-table scanslars hofhansl 2012-08-12, 22:24
Do you really have to retrieve all 200.000 each time?
Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching). Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario. In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn. In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either). If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server. -- Lars ----- Original Message ----- From: Gurjeet Singh <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Saturday, August 11, 2012 11:04 PM Subject: Slow full-table scans Hi, I am trying to read all the data out of an HBase table using a scan and it is extremely slow. Here are some characteristics of the data: 1. The total table size is tiny (~200MB) 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. Thus the size of each cell is ~10bytes and the size of each row is ~2MB 3. Currently scanning the whole table takes ~400s (both in a distributed setting with 12 nodes or so and on a single node), thus 5sec/row 4. The row keys are unique 8 byte crypto hashes of sequential numbers 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) and is set to fetch 100MB of data at a time (scan.setCaching) 6. Changing the caching size seems to have no effect on the total scan time at all 7. The column family is setup to keep a single version of the cells, no compression, and no block cache. Am I missing something ? Is there a way to optimize this ? I guess a general question I have is whether HBase is good datastore for storing many medium sized (~50GB), dense datasets with lots of columns when a lot of the queries require full table scans ? Thanks! Gurjeet +
lars hofhansl 2012-08-12, 22:24
-
Re: Slow full-table scansGurjeet Singh 2012-08-12, 22:51
Hi Lars,
Yes, I need to retrieve all the values for a row at a time. That said, I did experiment with different batch sizes and that made no difference whatsoever. (caching on the other hand did make some difference ~2-3% faster for larger cache) I see your point about scanners returning sorted KVs. In my application, I simply don't care whether the results are sorted or not and I know the key range in advance. This is a great suggestion. Let me try replacing a single scan with a list of GETs or a bunch of SCANs with different start/stop rows. Thanks! Gurjeet On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Do you really have to retrieve all 200.000 each time? > Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching). > > Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario. > > In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn. > > In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either). > > If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server. > > > -- Lars > > > > ----- Original Message ----- > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Saturday, August 11, 2012 11:04 PM > Subject: Slow full-table scans > > Hi, > > I am trying to read all the data out of an HBase table using a scan > and it is extremely slow. > > Here are some characteristics of the data: > > 1. The total table size is tiny (~200MB) > 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > Thus the size of each cell is ~10bytes and the size of each row is > ~2MB > 3. Currently scanning the whole table takes ~400s (both in a > distributed setting with 12 nodes or so and on a single node), thus > 5sec/row > 4. The row keys are unique 8 byte crypto hashes of sequential numbers > 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > and is set to fetch 100MB of data at a time (scan.setCaching) > 6. Changing the caching size seems to have no effect on the total scan > time at all > 7. The column family is setup to keep a single version of the cells, > no compression, and no block cache. > > Am I missing something ? Is there a way to optimize this ? > > I guess a general question I have is whether HBase is good datastore > for storing many medium sized (~50GB), dense datasets with lots of > columns when a lot of the queries require full table scans ? > > Thanks! > Gurjeet > +
Gurjeet Singh 2012-08-12, 22:51
-
Re: Slow full-table scanslars hofhansl 2012-08-12, 23:00
You can use HTable.{getStartEndKeys|getEndKeys|getStartKeys} to get the current region demarcations for your table.
If you wanted to group threads by RegionServer (which you should) you get that information via HTable.getRegionLocation{s} -- Lars ----- Original Message ----- From: Gurjeet Singh <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Cc: Sent: Sunday, August 12, 2012 3:51 PM Subject: Re: Slow full-table scans Hi Lars, Yes, I need to retrieve all the values for a row at a time. That said, I did experiment with different batch sizes and that made no difference whatsoever. (caching on the other hand did make some difference ~2-3% faster for larger cache) I see your point about scanners returning sorted KVs. In my application, I simply don't care whether the results are sorted or not and I know the key range in advance. This is a great suggestion. Let me try replacing a single scan with a list of GETs or a bunch of SCANs with different start/stop rows. Thanks! Gurjeet On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Do you really have to retrieve all 200.000 each time? > Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching). > > Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario. > > In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn. > > In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either). > > If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server. > > > -- Lars > > > > ----- Original Message ----- > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Saturday, August 11, 2012 11:04 PM > Subject: Slow full-table scans > > Hi, > > I am trying to read all the data out of an HBase table using a scan > and it is extremely slow. > > Here are some characteristics of the data: > > 1. The total table size is tiny (~200MB) > 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > Thus the size of each cell is ~10bytes and the size of each row is > ~2MB > 3. Currently scanning the whole table takes ~400s (both in a > distributed setting with 12 nodes or so and on a single node), thus > 5sec/row > 4. The row keys are unique 8 byte crypto hashes of sequential numbers > 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > and is set to fetch 100MB of data at a time (scan.setCaching) > 6. Changing the caching size seems to have no effect on the total scan > time at all > 7. The column family is setup to keep a single version of the cells, > no compression, and no block cache. > > Am I missing something ? Is there a way to optimize this ? > > I guess a general question I have is whether HBase is good datastore > for storing many medium sized (~50GB), dense datasets with lots of > columns when a lot of the queries require full table scans ? > > Thanks! > Gurjeet > +
lars hofhansl 2012-08-12, 23:00
-
Re: Slow full-table scansGurjeet Singh 2012-08-13, 05:10
Thanks Lars!
One final question : is it advisable to issue multiple threads against a single HTable instance, like so: HTable table = ... for (i = 0; i < 10; i++) { new ScanThread(table, startRow, endRow, rowProcessor).start(); } .... class ScanThread implements Runnable { public void run() { Scan scan = new Scan() scan.setStartRow(startRow); scan.setEndRow(endRow); ResultScanner scanner = table.getScanner(scan); for (Result result : scanner) { rowProcessor.process(result); } } } On Sun, Aug 12, 2012 at 4:00 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > You can use HTable.{getStartEndKeys|getEndKeys|getStartKeys} to get the current region demarcations for your table. > If you wanted to group threads by RegionServer (which you should) you get that information via HTable.getRegionLocation{s} > > > -- Lars > > > ----- Original Message ----- > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Cc: > Sent: Sunday, August 12, 2012 3:51 PM > Subject: Re: Slow full-table scans > > Hi Lars, > > Yes, I need to retrieve all the values for a row at a time. That said, > I did experiment with different batch sizes and that made no > difference whatsoever. (caching on the other hand did make some > difference ~2-3% faster for larger cache) > > I see your point about scanners returning sorted KVs. In my > application, I simply don't care whether the results are sorted or not > and I know the key range in advance. This is a great suggestion. Let > me try replacing a single scan with a list of GETs or a bunch of SCANs > with different start/stop rows. > > Thanks! > Gurjeet > > On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: >> Do you really have to retrieve all 200.000 each time? >> Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching). >> >> Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario. >> >> In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn. >> >> In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either). >> >> If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server. >> >> >> -- Lars >> >> >> >> ----- Original Message ----- >> From: Gurjeet Singh <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Cc: >> Sent: Saturday, August 11, 2012 11:04 PM >> Subject: Slow full-table scans >> >> Hi, >> >> I am trying to read all the data out of an HBase table using a scan >> and it is extremely slow. >> >> Here are some characteristics of the data: >> >> 1. The total table size is tiny (~200MB) >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. >> Thus the size of each cell is ~10bytes and the size of each row is >> ~2MB >> 3. Currently scanning the whole table takes ~400s (both in a >> distributed setting with 12 nodes or so and on a single node), thus >> 5sec/row >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) >> and is set to fetch 100MB of data at a time (scan.setCaching) >> 6. Changing the caching size seems to have no effect on the total scan >> time at all >> 7. The column family is setup to keep a single version of the cells, >> no compression, and no block cache. >> >> Am I missing something ? Is there a way to optimize this ? +
Gurjeet Singh 2012-08-13, 05:10
-
Re: Slow full-table scansStack 2012-08-13, 07:27
On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote:
> Thanks Lars! > > One final question : is it advisable to issue multiple threads > against a single HTable instance, like so: > > HTable table = ... > for (i = 0; i < 10; i++) { > new ScanThread(table, startRow, endRow, rowProcessor).start(); > } > Make an HTable per thread. See the class comment: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html St.Ack +
Stack 2012-08-13, 07:27
-
Re: Slow full-table scansGurjeet Singh 2012-08-13, 07:51
Thanks a lot!
On Mon, Aug 13, 2012 at 12:27 AM, Stack <[EMAIL PROTECTED]> wrote: > On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >> Thanks Lars! >> >> One final question : is it advisable to issue multiple threads >> against a single HTable instance, like so: >> >> HTable table = ... >> for (i = 0; i < 10; i++) { >> new ScanThread(table, startRow, endRow, rowProcessor).start(); >> } >> > > Make an HTable per thread. See the class comment: > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html > > St.Ack +
Gurjeet Singh 2012-08-13, 07:51
-
Re: Slow full-table scansGurjeet Singh 2012-08-13, 22:12
Okay, I just ran this experiment. It did speed things up, but only by
4%. This all still seems awfully slow to me - does someone have another suggestion ? Thanks in advance! Gurjeet On Mon, Aug 13, 2012 at 12:51 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Thanks a lot! > > On Mon, Aug 13, 2012 at 12:27 AM, Stack <[EMAIL PROTECTED]> wrote: >> On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>> Thanks Lars! >>> >>> One final question : is it advisable to issue multiple threads >>> against a single HTable instance, like so: >>> >>> HTable table = ... >>> for (i = 0; i < 10; i++) { >>> new ScanThread(table, startRow, endRow, rowProcessor).start(); >>> } >>> >> >> Make an HTable per thread. See the class comment: >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html >> >> St.Ack +
Gurjeet Singh 2012-08-13, 22:12
-
Re: Slow full-table scanslars hofhansl 2012-08-14, 00:30
Only 4% in the 12 node cluster case? I'd guess you're using not more cores then before (i.e. the parallelizing on the client is bad), or you're IO bound (which is unlikely).
Are all your regionserver busy in terms of CPU? -- Lars ----- Original Message ----- From: Gurjeet Singh <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Monday, August 13, 2012 3:12 PM Subject: Re: Slow full-table scans Okay, I just ran this experiment. It did speed things up, but only by 4%. This all still seems awfully slow to me - does someone have another suggestion ? Thanks in advance! Gurjeet On Mon, Aug 13, 2012 at 12:51 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Thanks a lot! > > On Mon, Aug 13, 2012 at 12:27 AM, Stack <[EMAIL PROTECTED]> wrote: >> On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>> Thanks Lars! >>> >>> One final question : is it advisable to issue multiple threads >>> against a single HTable instance, like so: >>> >>> HTable table = ... >>> for (i = 0; i < 10; i++) { >>> new ScanThread(table, startRow, endRow, rowProcessor).start(); >>> } >>> >> >> Make an HTable per thread. See the class comment: >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html >> >> St.Ack +
lars hofhansl 2012-08-14, 00:30
-
Re: Slow full-table scansGurjeet Singh 2012-08-14, 01:10
I am beginning to think that this is a configuration issue on my
cluster. Do the following configuration files seem sane ? hbase-env.sh https://gist.github.com/3345338 hbase-site.xml https://gist.github.com/3345356 Gurjeet On Mon, Aug 13, 2012 at 5:30 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Only 4% in the 12 node cluster case? I'd guess you're using not more cores then before (i.e. the parallelizing on the client is bad), or you're IO bound (which is unlikely). > Are all your regionserver busy in terms of CPU? > > > -- Lars > > > > ----- Original Message ----- > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Monday, August 13, 2012 3:12 PM > Subject: Re: Slow full-table scans > > Okay, I just ran this experiment. It did speed things up, but only by > 4%. This all still seems awfully slow to me - does someone have > another suggestion ? > > Thanks in advance! > Gurjeet > > On Mon, Aug 13, 2012 at 12:51 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >> Thanks a lot! >> >> On Mon, Aug 13, 2012 at 12:27 AM, Stack <[EMAIL PROTECTED]> wrote: >>> On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>>> Thanks Lars! >>>> >>>> One final question : is it advisable to issue multiple threads >>>> against a single HTable instance, like so: >>>> >>>> HTable table = ... >>>> for (i = 0; i < 10; i++) { >>>> new ScanThread(table, startRow, endRow, rowProcessor).start(); >>>> } >>>> >>> >>> Make an HTable per thread. See the class comment: >>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html >>> >>> St.Ack > +
Gurjeet Singh 2012-08-14, 01:10
-
Re: Slow full-table scansStack 2012-08-15, 22:13
On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote:
> I am beginning to think that this is a configuration issue on my > cluster. Do the following configuration files seem sane ? > > hbase-env.sh https://gist.github.com/3345338 > Nothing wrong w/ this (Remove the -ea, you don't want asserts in production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). > hbase-site.xml https://gist.github.com/3345356 > This is all defaults effectively. I don't see any of the configs. recommended by the performance section of the reference guide and/or those suggested by the GBIF blog. You don't answer LarsH's query about where you see the 4% difference. How many regions in your table? Whats the HBase Master UI look like when this scan is running? St.Ack +
Stack 2012-08-15, 22:13
-
Re: Slow full-table scanslars hofhansl 2012-08-16, 00:16
Yeah... It looks OK.
Maybe 2G of heap is a bit low when dealing with 200.000 column rows. If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. -- Lars ----- Original Message ----- From: Stack <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Wednesday, August 15, 2012 3:13 PM Subject: Re: Slow full-table scans On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > I am beginning to think that this is a configuration issue on my > cluster. Do the following configuration files seem sane ? > > hbase-env.sh https://gist.github.com/3345338 > Nothing wrong w/ this (Remove the -ea, you don't want asserts in production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). > hbase-site.xml https://gist.github.com/3345356 > This is all defaults effectively. I don't see any of the configs. recommended by the performance section of the reference guide and/or those suggested by the GBIF blog. You don't answer LarsH's query about where you see the 4% difference. How many regions in your table? Whats the HBase Master UI look like when this scan is running? St.Ack +
lars hofhansl 2012-08-16, 00:16
-
Re: Slow full-table scansGurjeet Singh 2012-08-16, 18:26
Sorry for the delay guys.
Here are a few results: 1. Regions in the table = 11 2. The region servers don't appear to be very busy with the query ~5% CPU (but with parallelization, they are all busy) Finally, I changed the format of my data, such that each cell in HBase contains a chunk of a row instead of the single value it had. So, stuffing each Hbase cell with 500 columns of a row, gave me a performance boost of 1000x. It seems that the underlying issue was IO overhead per byte of actual data stored. On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Yeah... It looks OK. > Maybe 2G of heap is a bit low when dealing with 200.000 column rows. > > > If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. > > -- Lars > > > ----- Original Message ----- > From: Stack <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Wednesday, August 15, 2012 3:13 PM > Subject: Re: Slow full-table scans > > On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >> I am beginning to think that this is a configuration issue on my >> cluster. Do the following configuration files seem sane ? >> >> hbase-env.sh https://gist.github.com/3345338 >> > > Nothing wrong w/ this (Remove the -ea, you don't want asserts in > production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). > > >> hbase-site.xml https://gist.github.com/3345356 >> > > This is all defaults effectively. I don't see any of the configs. > recommended by the performance section of the reference guide and/or > those suggested by the GBIF blog. > > You don't answer LarsH's query about where you see the 4% difference. > > How many regions in your table? Whats the HBase Master UI look like > when this scan is running? > St.Ack > +
Gurjeet Singh 2012-08-16, 18:26
-
Re: Slow full-table scanslars hofhansl 2012-08-16, 18:36
That's interesting.
Could you share your old and new schema. I would like to track down the performance problems you saw. (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary). -- Lars ________________________________ From: Gurjeet Singh <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Sent: Thursday, August 16, 2012 11:26 AM Subject: Re: Slow full-table scans Sorry for the delay guys. Here are a few results: 1. Regions in the table = 11 2. The region servers don't appear to be very busy with the query ~5% CPU (but with parallelization, they are all busy) Finally, I changed the format of my data, such that each cell in HBase contains a chunk of a row instead of the single value it had. So, stuffing each Hbase cell with 500 columns of a row, gave me a performance boost of 1000x. It seems that the underlying issue was IO overhead per byte of actual data stored. On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Yeah... It looks OK. > Maybe 2G of heap is a bit low when dealing with 200.000 column rows. > > > If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. > > -- Lars > > > ----- Original Message ----- > From: Stack <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Wednesday, August 15, 2012 3:13 PM > Subject: Re: Slow full-table scans > > On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >> I am beginning to think that this is a configuration issue on my >> cluster. Do the following configuration files seem sane ? >> >> hbase-env.sh https://gist.github.com/3345338 >> > > Nothing wrong w/ this (Remove the -ea, you don't want asserts in > production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). > > >> hbase-site.xml https://gist.github.com/3345356 >> > > This is all defaults effectively. I don't see any of the configs. > recommended by the performance section of the reference guide and/or > those suggested by the GBIF blog. > > You don't answer LarsH's query about where you see the 4% difference. > > How many regions in your table? Whats the HBase Master UI look like > when this scan is running? > St.Ack > +
lars hofhansl 2012-08-16, 18:36
-
Re: Slow full-table scansGurjeet Singh 2012-08-16, 18:40
Sure - I can create a minimal testcase and send it along.
Gurjeet On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > That's interesting. > Could you share your old and new schema. I would like to track down the performance problems you saw. > (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary). > > > -- Lars > > > > ________________________________ > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Sent: Thursday, August 16, 2012 11:26 AM > Subject: Re: Slow full-table scans > > Sorry for the delay guys. > > Here are a few results: > > 1. Regions in the table = 11 > 2. The region servers don't appear to be very busy with the query ~5% > CPU (but with parallelization, they are all busy) > > Finally, I changed the format of my data, such that each cell in HBase > contains a chunk of a row instead of the single value it had. So, > stuffing each Hbase cell with 500 columns of a row, gave me a > performance boost of 1000x. It seems that the underlying issue was IO > overhead per byte of actual data stored. > > > On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: >> Yeah... It looks OK. >> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. >> >> >> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. >> >> -- Lars >> >> >> ----- Original Message ----- >> From: Stack <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Cc: >> Sent: Wednesday, August 15, 2012 3:13 PM >> Subject: Re: Slow full-table scans >> >> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>> I am beginning to think that this is a configuration issue on my >>> cluster. Do the following configuration files seem sane ? >>> >>> hbase-env.sh https://gist.github.com/3345338 >>> >> >> Nothing wrong w/ this (Remove the -ea, you don't want asserts in >> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). >> >> >>> hbase-site.xml https://gist.github.com/3345356 >>> >> >> This is all defaults effectively. I don't see any of the configs. >> recommended by the performance section of the reference guide and/or >> those suggested by the GBIF blog. >> >> You don't answer LarsH's query about where you see the 4% difference. >> >> How many regions in your table? Whats the HBase Master UI look like >> when this scan is running? >> St.Ack >> +
Gurjeet Singh 2012-08-16, 18:40
-
Re: Slow full-table scansGurjeet Singh 2012-08-21, 02:42
Hi Lars,
Here is a testcase: https://gist.github.com/3410948 Benchmarking code: https://gist.github.com/3410952 Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 Gurjeet On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Sure - I can create a minimal testcase and send it along. > > Gurjeet > > On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: >> That's interesting. >> Could you share your old and new schema. I would like to track down the performance problems you saw. >> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary). >> >> >> -- Lars >> >> >> >> ________________________________ >> From: Gurjeet Singh <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >> Sent: Thursday, August 16, 2012 11:26 AM >> Subject: Re: Slow full-table scans >> >> Sorry for the delay guys. >> >> Here are a few results: >> >> 1. Regions in the table = 11 >> 2. The region servers don't appear to be very busy with the query ~5% >> CPU (but with parallelization, they are all busy) >> >> Finally, I changed the format of my data, such that each cell in HBase >> contains a chunk of a row instead of the single value it had. So, >> stuffing each Hbase cell with 500 columns of a row, gave me a >> performance boost of 1000x. It seems that the underlying issue was IO >> overhead per byte of actual data stored. >> >> >> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: >>> Yeah... It looks OK. >>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. >>> >>> >>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. >>> >>> -- Lars >>> >>> >>> ----- Original Message ----- >>> From: Stack <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Cc: >>> Sent: Wednesday, August 15, 2012 3:13 PM >>> Subject: Re: Slow full-table scans >>> >>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>>> I am beginning to think that this is a configuration issue on my >>>> cluster. Do the following configuration files seem sane ? >>>> >>>> hbase-env.sh https://gist.github.com/3345338 >>>> >>> >>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in >>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). >>> >>> >>>> hbase-site.xml https://gist.github.com/3345356 >>>> >>> >>> This is all defaults effectively. I don't see any of the configs. >>> recommended by the performance section of the reference guide and/or >>> those suggested by the GBIF blog. >>> >>> You don't answer LarsH's query about where you see the 4% difference. >>> >>> How many regions in your table? Whats the HBase Master UI look like >>> when this scan is running? >>> St.Ack >>> +
Gurjeet Singh 2012-08-21, 02:42
-
Re: Slow full-table scanslars hofhansl 2012-08-21, 02:50
Thanks Gurjeet,
I'll (hopefully) have a look tomorrow. -- Lars ----- Original Message ----- From: Gurjeet Singh <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Cc: Sent: Monday, August 20, 2012 7:42 PM Subject: Re: Slow full-table scans Hi Lars, Here is a testcase: https://gist.github.com/3410948 Benchmarking code: https://gist.github.com/3410952 Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 Gurjeet On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Sure - I can create a minimal testcase and send it along. > > Gurjeet > > On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: >> That's interesting. >> Could you share your old and new schema. I would like to track down the performance problems you saw. >> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary). >> >> >> -- Lars >> >> >> >> ________________________________ >> From: Gurjeet Singh <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >> Sent: Thursday, August 16, 2012 11:26 AM >> Subject: Re: Slow full-table scans >> >> Sorry for the delay guys. >> >> Here are a few results: >> >> 1. Regions in the table = 11 >> 2. The region servers don't appear to be very busy with the query ~5% >> CPU (but with parallelization, they are all busy) >> >> Finally, I changed the format of my data, such that each cell in HBase >> contains a chunk of a row instead of the single value it had. So, >> stuffing each Hbase cell with 500 columns of a row, gave me a >> performance boost of 1000x. It seems that the underlying issue was IO >> overhead per byte of actual data stored. >> >> >> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: >>> Yeah... It looks OK. >>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. >>> >>> >>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. >>> >>> -- Lars >>> >>> >>> ----- Original Message ----- >>> From: Stack <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Cc: >>> Sent: Wednesday, August 15, 2012 3:13 PM >>> Subject: Re: Slow full-table scans >>> >>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>>> I am beginning to think that this is a configuration issue on my >>>> cluster. Do the following configuration files seem sane ? >>>> >>>> hbase-env.sh https://gist.github.com/3345338 >>>> >>> >>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in >>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). >>> >>> >>>> hbase-site.xml https://gist.github.com/3345356 >>>> >>> >>> This is all defaults effectively. I don't see any of the configs. >>> recommended by the performance section of the reference guide and/or >>> those suggested by the GBIF blog. >>> >>> You don't answer LarsH's query about where you see the 4% difference. >>> >>> How many regions in your table? Whats the HBase Master UI look like >>> when this scan is running? >>> St.Ack >>> +
lars hofhansl 2012-08-21, 02:50
-
Re: Slow full-table scanslars hofhansl 2012-08-21, 18:18
Hmm... So I tried in HBase (current trunk).
I created 100 rows with 200.000 columns each (using your oldMakeTable). The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo distributed mode - with your oldScanTable). -- Lars ----- Original Message ----- From: lars hofhansl <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Cc: Sent: Monday, August 20, 2012 7:50 PM Subject: Re: Slow full-table scans Thanks Gurjeet, I'll (hopefully) have a look tomorrow. -- Lars ----- Original Message ----- From: Gurjeet Singh <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Cc: Sent: Monday, August 20, 2012 7:42 PM Subject: Re: Slow full-table scans Hi Lars, Here is a testcase: https://gist.github.com/3410948 Benchmarking code: https://gist.github.com/3410952 Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 Gurjeet On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Sure - I can create a minimal testcase and send it along. > > Gurjeet > > On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: >> That's interesting. >> Could you share your old and new schema. I would like to track down the performance problems you saw. >> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary). >> >> >> -- Lars >> >> >> >> ________________________________ >> From: Gurjeet Singh <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >> Sent: Thursday, August 16, 2012 11:26 AM >> Subject: Re: Slow full-table scans >> >> Sorry for the delay guys. >> >> Here are a few results: >> >> 1. Regions in the table = 11 >> 2. The region servers don't appear to be very busy with the query ~5% >> CPU (but with parallelization, they are all busy) >> >> Finally, I changed the format of my data, such that each cell in HBase >> contains a chunk of a row instead of the single value it had. So, >> stuffing each Hbase cell with 500 columns of a row, gave me a >> performance boost of 1000x. It seems that the underlying issue was IO >> overhead per byte of actual data stored. >> >> >> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: >>> Yeah... It looks OK. >>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. >>> >>> >>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. >>> >>> -- Lars >>> >>> >>> ----- Original Message ----- >>> From: Stack <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Cc: >>> Sent: Wednesday, August 15, 2012 3:13 PM >>> Subject: Re: Slow full-table scans >>> >>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>>> I am beginning to think that this is a configuration issue on my >>>> cluster. Do the following configuration files seem sane ? >>>> >>>> hbase-env.sh https://gist.github.com/3345338 >>>> >>> >>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in >>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). >>> >>> >>>> hbase-site.xml https://gist.github.com/3345356 >>>> >>> >>> This is all defaults effectively. I don't see any of the configs. >>> recommended by the performance section of the reference guide and/or >>> those suggested by the GBIF blog. >>> >>> You don't answer LarsH's query about where you see the 4% difference. >>> >>> How many regions in your table? Whats the HBase Master UI look like >>> when this scan is running? >>> St.Ack >>> +
lars hofhansl 2012-08-21, 18:18
-
Re: Slow full-table scansGurjeet Singh 2012-08-21, 18:31
How does that compare with the newScanTable on your build ?
Gurjeet On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Hmm... So I tried in HBase (current trunk). > I created 100 rows with 200.000 columns each (using your oldMakeTable). The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo distributed mode - with your oldScanTable). > > -- Lars > > > > ----- Original Message ----- > From: lars hofhansl <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Cc: > Sent: Monday, August 20, 2012 7:50 PM > Subject: Re: Slow full-table scans > > Thanks Gurjeet, > > I'll (hopefully) have a look tomorrow. > > -- Lars > > > > ----- Original Message ----- > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Cc: > Sent: Monday, August 20, 2012 7:42 PM > Subject: Re: Slow full-table scans > > Hi Lars, > > Here is a testcase: > > https://gist.github.com/3410948 > > Benchmarking code: > > https://gist.github.com/3410952 > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 > > Gurjeet > > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >> Sure - I can create a minimal testcase and send it along. >> >> Gurjeet >> >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: >>> That's interesting. >>> Could you share your old and new schema. I would like to track down the performance problems you saw. >>> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary). >>> >>> >>> -- Lars >>> >>> >>> >>> ________________________________ >>> From: Gurjeet Singh <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>> Sent: Thursday, August 16, 2012 11:26 AM >>> Subject: Re: Slow full-table scans >>> >>> Sorry for the delay guys. >>> >>> Here are a few results: >>> >>> 1. Regions in the table = 11 >>> 2. The region servers don't appear to be very busy with the query ~5% >>> CPU (but with parallelization, they are all busy) >>> >>> Finally, I changed the format of my data, such that each cell in HBase >>> contains a chunk of a row instead of the single value it had. So, >>> stuffing each Hbase cell with 500 columns of a row, gave me a >>> performance boost of 1000x. It seems that the underlying issue was IO >>> overhead per byte of actual data stored. >>> >>> >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: >>>> Yeah... It looks OK. >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. >>>> >>>> >>>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. >>>> >>>> -- Lars >>>> >>>> >>>> ----- Original Message ----- >>>> From: Stack <[EMAIL PROTECTED]> >>>> To: [EMAIL PROTECTED] >>>> Cc: >>>> Sent: Wednesday, August 15, 2012 3:13 PM >>>> Subject: Re: Slow full-table scans >>>> >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>>>> I am beginning to think that this is a configuration issue on my >>>>> cluster. Do the following configuration files seem sane ? >>>>> >>>>> hbase-env.sh https://gist.github.com/3345338 >>>>> >>>> >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). >>>> >>>> >>>>> hbase-site.xml https://gist.github.com/3345356 >>>>> >>>> >>>> This is all defaults effectively. I don't see any of the configs. >>>> recommended by the performance section of the reference guide and/or >>>> those suggested by the GBIF blog. >>>> >>>> You don't answer LarsH's query about where you see the 4% difference. >>>> >>>> How many regions in your table? Whats the HBase Master UI look like >>>> when this scan is running? >>>> St.Ack >>>> +
Gurjeet Singh 2012-08-21, 18:31
-
Re: Slow full-table scanslars hofhansl 2012-08-21, 23:33
I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size 100
________________________________ From: Gurjeet Singh <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Sent: Tuesday, August 21, 2012 11:31 AM Subject: Re: Slow full-table scans How does that compare with the newScanTable on your build ? Gurjeet On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Hmm... So I tried in HBase (current trunk). > I created 100 rows with 200.000 columns each (using your oldMakeTable). The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo distributed mode - with your oldScanTable). > > -- Lars > > > > ----- Original Message ----- > From: lars hofhansl <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Cc: > Sent: Monday, August 20, 2012 7:50 PM > Subject: Re: Slow full-table scans > > Thanks Gurjeet, > > I'll (hopefully) have a look tomorrow. > > -- Lars > > > > ----- Original Message ----- > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Cc: > Sent: Monday, August 20, 2012 7:42 PM > Subject: Re: Slow full-table scans > > Hi Lars, > > Here is a testcase: > > https://gist.github.com/3410948 > > Benchmarking code: > > https://gist.github.com/3410952 > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 > > Gurjeet > > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >> Sure - I can create a minimal testcase and send it along. >> >> Gurjeet >> >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: >>> That's interesting. >>> Could you share your old and new schema. I would like to track down the performance problems you saw. >>> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary). >>> >>> >>> -- Lars >>> >>> >>> >>> ________________________________ >>> From: Gurjeet Singh <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >>> Sent: Thursday, August 16, 2012 11:26 AM >>> Subject: Re: Slow full-table scans >>> >>> Sorry for the delay guys. >>> >>> Here are a few results: >>> >>> 1. Regions in the table = 11 >>> 2. The region servers don't appear to be very busy with the query ~5% >>> CPU (but with parallelization, they are all busy) >>> >>> Finally, I changed the format of my data, such that each cell in HBase >>> contains a chunk of a row instead of the single value it had. So, >>> stuffing each Hbase cell with 500 columns of a row, gave me a >>> performance boost of 1000x. It seems that the underlying issue was IO >>> overhead per byte of actual data stored. >>> >>> >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: >>>> Yeah... It looks OK. >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. >>>> >>>> >>>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not. >>>> >>>> -- Lars >>>> >>>> >>>> ----- Original Message ----- >>>> From: Stack <[EMAIL PROTECTED]> >>>> To: [EMAIL PROTECTED] >>>> Cc: >>>> Sent: Wednesday, August 15, 2012 3:13 PM >>>> Subject: Re: Slow full-table scans >>>> >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >>>>> I am beginning to think that this is a configuration issue on my >>>>> cluster. Do the following configuration files seem sane ? >>>>> >>>>> hbase-env.sh https://gist.github.com/3345338 >>>>> >>>> >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores). >>>> >>>> >>>>> hbase-site.xml https://gist.github.com/3345356 >>>>> >>>> >>>> This is all defaults effectively. I don't see any of the configs. >>>> recommended by the performance section of the reference guide and/or +
lars hofhansl 2012-08-21, 23:33
-
Re: Slow full-table scansMohit Anchlia 2012-08-22, 00:56
It's possible that there is a bad or slower disk on Gurjeet's machine. I
think details of iostat and cpu would clear things up. On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size > 100 > > > > ________________________________ > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Sent: Tuesday, August 21, 2012 11:31 AM > Subject: Re: Slow full-table scans > > How does that compare with the newScanTable on your build ? > > Gurjeet > > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > Hmm... So I tried in HBase (current trunk). > > I created 100 rows with 200.000 columns each (using your oldMakeTable). > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo > distributed mode - with your oldScanTable). > > > > -- Lars > > > > > > > > ----- Original Message ----- > > From: lars hofhansl <[EMAIL PROTECTED]> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > Cc: > > Sent: Monday, August 20, 2012 7:50 PM > > Subject: Re: Slow full-table scans > > > > Thanks Gurjeet, > > > > I'll (hopefully) have a look tomorrow. > > > > -- Lars > > > > > > > > ----- Original Message ----- > > From: Gurjeet Singh <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > Cc: > > Sent: Monday, August 20, 2012 7:42 PM > > Subject: Re: Slow full-table scans > > > > Hi Lars, > > > > Here is a testcase: > > > > https://gist.github.com/3410948 > > > > Benchmarking code: > > > > https://gist.github.com/3410952 > > > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 > > > > Gurjeet > > > > > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> > wrote: > >> Sure - I can create a minimal testcase and send it along. > >> > >> Gurjeet > >> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > >>> That's interesting. > >>> Could you share your old and new schema. I would like to track down > the performance problems you saw. > >>> (If you had a demo program that populates your rows with 200.000 > columns in a way where you saw the performance issues, that'd be even > better, but not necessary). > >>> > >>> > >>> -- Lars > >>> > >>> > >>> > >>> ________________________________ > >>> From: Gurjeet Singh <[EMAIL PROTECTED]> > >>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > >>> Sent: Thursday, August 16, 2012 11:26 AM > >>> Subject: Re: Slow full-table scans > >>> > >>> Sorry for the delay guys. > >>> > >>> Here are a few results: > >>> > >>> 1. Regions in the table = 11 > >>> 2. The region servers don't appear to be very busy with the query ~5% > >>> CPU (but with parallelization, they are all busy) > >>> > >>> Finally, I changed the format of my data, such that each cell in HBase > >>> contains a chunk of a row instead of the single value it had. So, > >>> stuffing each Hbase cell with 500 columns of a row, gave me a > >>> performance boost of 1000x. It seems that the underlying issue was IO > >>> overhead per byte of actual data stored. > >>> > >>> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > >>>> Yeah... It looks OK. > >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. > >>>> > >>>> > >>>> If you can I'd like to know how busy your regionservers are during > these operations. That would be an indication on whether the > parallelization is good or not. > >>>> > >>>> -- Lars > >>>> > >>>> > >>>> ----- Original Message ----- > >>>> From: Stack <[EMAIL PROTECTED]> > >>>> To: [EMAIL PROTECTED] > >>>> Cc: > >>>> Sent: Wednesday, August 15, 2012 3:13 PM > >>>> Subject: Re: Slow full-table scans > >>>> > >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <[EMAIL PROTECTED]> > wrote: > >>>>> I am beginning to think that this is a configuration issue on my > >>>>> cluster. Do the following configuration files seem sane ? +
Mohit Anchlia 2012-08-22, 00:56
-
Re: Slow full-table scansJ Mohamed Zahoor 2012-08-22, 05:00
Try a quick TestDFSIO to see if things are okay.
./zahoor On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > It's possible that there is a bad or slower disk on Gurjeet's machine. I > think details of iostat and cpu would clear things up. > > On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size > > 100 > > > > > > > > ________________________________ > > From: Gurjeet Singh <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > Sent: Tuesday, August 21, 2012 11:31 AM > > Subject: Re: Slow full-table scans > > > > How does that compare with the newScanTable on your build ? > > > > Gurjeet > > > > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > Hmm... So I tried in HBase (current trunk). > > > I created 100 rows with 200.000 columns each (using your oldMakeTable). > > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo > > distributed mode - with your oldScanTable). > > > > > > -- Lars > > > > > > > > > > > > ----- Original Message ----- > > > From: lars hofhansl <[EMAIL PROTECTED]> > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > > Cc: > > > Sent: Monday, August 20, 2012 7:50 PM > > > Subject: Re: Slow full-table scans > > > > > > Thanks Gurjeet, > > > > > > I'll (hopefully) have a look tomorrow. > > > > > > -- Lars > > > > > > > > > > > > ----- Original Message ----- > > > From: Gurjeet Singh <[EMAIL PROTECTED]> > > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > > Cc: > > > Sent: Monday, August 20, 2012 7:42 PM > > > Subject: Re: Slow full-table scans > > > > > > Hi Lars, > > > > > > Here is a testcase: > > > > > > https://gist.github.com/3410948 > > > > > > Benchmarking code: > > > > > > https://gist.github.com/3410952 > > > > > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 > > > > > > Gurjeet > > > > > > > > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> > > wrote: > > >> Sure - I can create a minimal testcase and send it along. > > >> > > >> Gurjeet > > >> > > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > >>> That's interesting. > > >>> Could you share your old and new schema. I would like to track down > > the performance problems you saw. > > >>> (If you had a demo program that populates your rows with 200.000 > > columns in a way where you saw the performance issues, that'd be even > > better, but not necessary). > > >>> > > >>> > > >>> -- Lars > > >>> > > >>> > > >>> > > >>> ________________________________ > > >>> From: Gurjeet Singh <[EMAIL PROTECTED]> > > >>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > >>> Sent: Thursday, August 16, 2012 11:26 AM > > >>> Subject: Re: Slow full-table scans > > >>> > > >>> Sorry for the delay guys. > > >>> > > >>> Here are a few results: > > >>> > > >>> 1. Regions in the table = 11 > > >>> 2. The region servers don't appear to be very busy with the query ~5% > > >>> CPU (but with parallelization, they are all busy) > > >>> > > >>> Finally, I changed the format of my data, such that each cell in > HBase > > >>> contains a chunk of a row instead of the single value it had. So, > > >>> stuffing each Hbase cell with 500 columns of a row, gave me a > > >>> performance boost of 1000x. It seems that the underlying issue was IO > > >>> overhead per byte of actual data stored. > > >>> > > >>> > > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > >>>> Yeah... It looks OK. > > >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows. > > >>>> > > >>>> > > >>>> If you can I'd like to know how busy your regionservers are during > > these operations. That would be an indication on whether the > > parallelization is good or not. > > >>>> > > >>>> -- Lars > > >>>> > > >>>> +
J Mohamed Zahoor 2012-08-22, 05:00
-
Re: Slow full-table scansGurjeet Singh 2012-08-22, 16:42
Okay, I just ran extensive tests with my minimal test case and you are
correct, the old and the new version do the scans in about the same amount of time (although puts are MUCH faster in the packed scheme). I guess my test case is too minimal. I will try to make a better testcase since in my production code, there is still a 500x difference. Gurjeet On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote: > Try a quick TestDFSIO to see if things are okay. > > ./zahoor > > On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > >> It's possible that there is a bad or slower disk on Gurjeet's machine. I >> think details of iostat and cpu would clear things up. >> >> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <[EMAIL PROTECTED]> >> wrote: >> >> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size >> > 100 >> > >> > >> > >> > ________________________________ >> > From: Gurjeet Singh <[EMAIL PROTECTED]> >> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >> > Sent: Tuesday, August 21, 2012 11:31 AM >> > Subject: Re: Slow full-table scans >> > >> > How does that compare with the newScanTable on your build ? >> > >> > Gurjeet >> > >> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <[EMAIL PROTECTED]> >> > wrote: >> > > Hmm... So I tried in HBase (current trunk). >> > > I created 100 rows with 200.000 columns each (using your oldMakeTable). >> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo >> > distributed mode - with your oldScanTable). >> > > >> > > -- Lars >> > > >> > > >> > > >> > > ----- Original Message ----- >> > > From: lars hofhansl <[EMAIL PROTECTED]> >> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >> > > Cc: >> > > Sent: Monday, August 20, 2012 7:50 PM >> > > Subject: Re: Slow full-table scans >> > > >> > > Thanks Gurjeet, >> > > >> > > I'll (hopefully) have a look tomorrow. >> > > >> > > -- Lars >> > > >> > > >> > > >> > > ----- Original Message ----- >> > > From: Gurjeet Singh <[EMAIL PROTECTED]> >> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >> > > Cc: >> > > Sent: Monday, August 20, 2012 7:42 PM >> > > Subject: Re: Slow full-table scans >> > > >> > > Hi Lars, >> > > >> > > Here is a testcase: >> > > >> > > https://gist.github.com/3410948 >> > > >> > > Benchmarking code: >> > > >> > > https://gist.github.com/3410952 >> > > >> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000 >> > > >> > > Gurjeet >> > > >> > > >> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <[EMAIL PROTECTED]> >> > wrote: >> > >> Sure - I can create a minimal testcase and send it along. >> > >> >> > >> Gurjeet >> > >> >> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <[EMAIL PROTECTED]> >> > wrote: >> > >>> That's interesting. >> > >>> Could you share your old and new schema. I would like to track down >> > the performance problems you saw. >> > >>> (If you had a demo program that populates your rows with 200.000 >> > columns in a way where you saw the performance issues, that'd be even >> > better, but not necessary). >> > >>> >> > >>> >> > >>> -- Lars >> > >>> >> > >>> >> > >>> >> > >>> ________________________________ >> > >>> From: Gurjeet Singh <[EMAIL PROTECTED]> >> > >>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> >> > >>> Sent: Thursday, August 16, 2012 11:26 AM >> > >>> Subject: Re: Slow full-table scans >> > >>> >> > >>> Sorry for the delay guys. >> > >>> >> > >>> Here are a few results: >> > >>> >> > >>> 1. Regions in the table = 11 >> > >>> 2. The region servers don't appear to be very busy with the query ~5% >> > >>> CPU (but with parallelization, they are all busy) >> > >>> >> > >>> Finally, I changed the format of my data, such that each cell in >> HBase >> > >>> contains a chunk of a row instead of the single value it had. So, >> > >>> stuffing each Hbase cell with 500 columns of a row, gave me a >> > >>> performance boost of 1000x. It seems that the underlying issue was IO +
Gurjeet Singh 2012-08-22, 16:42
-
Re: Slow full-table scansMohammad Tariq 2012-08-12, 22:49
Hello experts,
Would it be feasible to create a separate thread for each region??I mean we can determine start and end key of each region and issue a scan for each region in parallel. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Do you really have to retrieve all 200.000 each time? > Scan.setBatch(...) makes no difference?! (note that batching is different > and separate from caching). > > Also note that the scanner contract is to return sorted KVs, so a single > scan cannot be parallelized across RegionServers (well not entirely true, > it could be farmed off in parallel and then be presented to the client in > the right order - but HBase is not doing that). That is why one vs 12 RSs > makes no difference in this scenario. > > In the 12 node case you'll see low CPU on all but one RS, and each RS will > get its turn. > > In your case this is scanning 20.000.000 KVs serially in 400s, that's > 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but > not great either). > > If you only ever expect to run a single query like this on top your > cluster (i.e. your concern is latency not throughput) you can do multiple > RPCs in parallel for a sub portion of your key range. Together with > batching can start using value before all is streamed back from the server. > > > -- Lars > > > > ----- Original Message ----- > From: Gurjeet Singh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Saturday, August 11, 2012 11:04 PM > Subject: Slow full-table scans > > Hi, > > I am trying to read all the data out of an HBase table using a scan > and it is extremely slow. > > Here are some characteristics of the data: > > 1. The total table size is tiny (~200MB) > 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > Thus the size of each cell is ~10bytes and the size of each row is > ~2MB > 3. Currently scanning the whole table takes ~400s (both in a > distributed setting with 12 nodes or so and on a single node), thus > 5sec/row > 4. The row keys are unique 8 byte crypto hashes of sequential numbers > 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > and is set to fetch 100MB of data at a time (scan.setCaching) > 6. Changing the caching size seems to have no effect on the total scan > time at all > 7. The column family is setup to keep a single version of the cells, > no compression, and no block cache. > > Am I missing something ? Is there a way to optimize this ? > > I guess a general question I have is whether HBase is good datastore > for storing many medium sized (~50GB), dense datasets with lots of > columns when a lot of the queries require full table scans ? > > Thanks! > Gurjeet > > +
Mohammad Tariq 2012-08-12, 22:49
-
Re: Slow full-table scansGurjeet Singh 2012-08-12, 22:52
Hi Mohammad,
This is a great idea. Is there a API call to determine the start/end key for each region ? Thanks, Gurjeet On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello experts, > > Would it be feasible to create a separate thread for each region??I > mean we can determine start and end key of each region and issue a scan for > each region in parallel. > > Regards, > Mohammad Tariq > > > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > >> Do you really have to retrieve all 200.000 each time? >> Scan.setBatch(...) makes no difference?! (note that batching is different >> and separate from caching). >> >> Also note that the scanner contract is to return sorted KVs, so a single >> scan cannot be parallelized across RegionServers (well not entirely true, >> it could be farmed off in parallel and then be presented to the client in >> the right order - but HBase is not doing that). That is why one vs 12 RSs >> makes no difference in this scenario. >> >> In the 12 node case you'll see low CPU on all but one RS, and each RS will >> get its turn. >> >> In your case this is scanning 20.000.000 KVs serially in 400s, that's >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but >> not great either). >> >> If you only ever expect to run a single query like this on top your >> cluster (i.e. your concern is latency not throughput) you can do multiple >> RPCs in parallel for a sub portion of your key range. Together with >> batching can start using value before all is streamed back from the server. >> >> >> -- Lars >> >> >> >> ----- Original Message ----- >> From: Gurjeet Singh <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Cc: >> Sent: Saturday, August 11, 2012 11:04 PM >> Subject: Slow full-table scans >> >> Hi, >> >> I am trying to read all the data out of an HBase table using a scan >> and it is extremely slow. >> >> Here are some characteristics of the data: >> >> 1. The total table size is tiny (~200MB) >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. >> Thus the size of each cell is ~10bytes and the size of each row is >> ~2MB >> 3. Currently scanning the whole table takes ~400s (both in a >> distributed setting with 12 nodes or so and on a single node), thus >> 5sec/row >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) >> and is set to fetch 100MB of data at a time (scan.setCaching) >> 6. Changing the caching size seems to have no effect on the total scan >> time at all >> 7. The column family is setup to keep a single version of the cells, >> no compression, and no block cache. >> >> Am I missing something ? Is there a way to optimize this ? >> >> I guess a general question I have is whether HBase is good datastore >> for storing many medium sized (~50GB), dense datasets with lots of >> columns when a lot of the queries require full table scans ? >> >> Thanks! >> Gurjeet >> >> +
Gurjeet Singh 2012-08-12, 22:52
-
Re: Slow full-table scansMohammad Tariq 2012-08-12, 23:00
Methods getStartKey and getEndKey provided by HRegionInfo class can used
for that purpose. Also, please make sure, any HTable instance is not left opened once you are are done with reads. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Hi Mohammad, > > This is a great idea. Is there a API call to determine the start/end > key for each region ? > > Thanks, > Gurjeet > > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> > wrote: > > Hello experts, > > > > Would it be feasible to create a separate thread for each > region??I > > mean we can determine start and end key of each region and issue a scan > for > > each region in parallel. > > > > Regards, > > Mohammad Tariq > > > > > > > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > >> Do you really have to retrieve all 200.000 each time? > >> Scan.setBatch(...) makes no difference?! (note that batching is > different > >> and separate from caching). > >> > >> Also note that the scanner contract is to return sorted KVs, so a single > >> scan cannot be parallelized across RegionServers (well not entirely > true, > >> it could be farmed off in parallel and then be presented to the client > in > >> the right order - but HBase is not doing that). That is why one vs 12 > RSs > >> makes no difference in this scenario. > >> > >> In the 12 node case you'll see low CPU on all but one RS, and each RS > will > >> get its turn. > >> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase > (but > >> not great either). > >> > >> If you only ever expect to run a single query like this on top your > >> cluster (i.e. your concern is latency not throughput) you can do > multiple > >> RPCs in parallel for a sub portion of your key range. Together with > >> batching can start using value before all is streamed back from the > server. > >> > >> > >> -- Lars > >> > >> > >> > >> ----- Original Message ----- > >> From: Gurjeet Singh <[EMAIL PROTECTED]> > >> To: [EMAIL PROTECTED] > >> Cc: > >> Sent: Saturday, August 11, 2012 11:04 PM > >> Subject: Slow full-table scans > >> > >> Hi, > >> > >> I am trying to read all the data out of an HBase table using a scan > >> and it is extremely slow. > >> > >> Here are some characteristics of the data: > >> > >> 1. The total table size is tiny (~200MB) > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > >> Thus the size of each cell is ~10bytes and the size of each row is > >> ~2MB > >> 3. Currently scanning the whole table takes ~400s (both in a > >> distributed setting with 12 nodes or so and on a single node), thus > >> 5sec/row > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > >> and is set to fetch 100MB of data at a time (scan.setCaching) > >> 6. Changing the caching size seems to have no effect on the total scan > >> time at all > >> 7. The column family is setup to keep a single version of the cells, > >> no compression, and no block cache. > >> > >> Am I missing something ? Is there a way to optimize this ? > >> > >> I guess a general question I have is whether HBase is good datastore > >> for storing many medium sized (~50GB), dense datasets with lots of > >> columns when a lot of the queries require full table scans ? > >> > >> Thanks! > >> Gurjeet > >> > >> > +
Mohammad Tariq 2012-08-12, 23:00
-
Re: Slow full-table scansJacques 2012-08-12, 23:13
I think the first question is where is the time spent. Does your analysis
show that all the time spent is on the regionservers or is a portion of the bottleneck on the client side? Jacques On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Methods getStartKey and getEndKey provided by HRegionInfo class can used > for that purpose. > Also, please make sure, any HTable instance is not left opened once you are > are done with reads. > Regards, > Mohammad Tariq > > > > On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > > > Hi Mohammad, > > > > This is a great idea. Is there a API call to determine the start/end > > key for each region ? > > > > Thanks, > > Gurjeet > > > > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> > > wrote: > > > Hello experts, > > > > > > Would it be feasible to create a separate thread for each > > region??I > > > mean we can determine start and end key of each region and issue a scan > > for > > > each region in parallel. > > > > > > Regards, > > > Mohammad Tariq > > > > > > > > > > > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > > > >> Do you really have to retrieve all 200.000 each time? > > >> Scan.setBatch(...) makes no difference?! (note that batching is > > different > > >> and separate from caching). > > >> > > >> Also note that the scanner contract is to return sorted KVs, so a > single > > >> scan cannot be parallelized across RegionServers (well not entirely > > true, > > >> it could be farmed off in parallel and then be presented to the client > > in > > >> the right order - but HBase is not doing that). That is why one vs 12 > > RSs > > >> makes no difference in this scenario. > > >> > > >> In the 12 node case you'll see low CPU on all but one RS, and each RS > > will > > >> get its turn. > > >> > > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's > > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase > > (but > > >> not great either). > > >> > > >> If you only ever expect to run a single query like this on top your > > >> cluster (i.e. your concern is latency not throughput) you can do > > multiple > > >> RPCs in parallel for a sub portion of your key range. Together with > > >> batching can start using value before all is streamed back from the > > server. > > >> > > >> > > >> -- Lars > > >> > > >> > > >> > > >> ----- Original Message ----- > > >> From: Gurjeet Singh <[EMAIL PROTECTED]> > > >> To: [EMAIL PROTECTED] > > >> Cc: > > >> Sent: Saturday, August 11, 2012 11:04 PM > > >> Subject: Slow full-table scans > > >> > > >> Hi, > > >> > > >> I am trying to read all the data out of an HBase table using a scan > > >> and it is extremely slow. > > >> > > >> Here are some characteristics of the data: > > >> > > >> 1. The total table size is tiny (~200MB) > > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > > >> Thus the size of each cell is ~10bytes and the size of each row is > > >> ~2MB > > >> 3. Currently scanning the whole table takes ~400s (both in a > > >> distributed setting with 12 nodes or so and on a single node), thus > > >> 5sec/row > > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers > > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > > >> and is set to fetch 100MB of data at a time (scan.setCaching) > > >> 6. Changing the caching size seems to have no effect on the total scan > > >> time at all > > >> 7. The column family is setup to keep a single version of the cells, > > >> no compression, and no block cache. > > >> > > >> Am I missing something ? Is there a way to optimize this ? > > >> > > >> I guess a general question I have is whether HBase is good datastore > > >> for storing many medium sized (~50GB), dense datasets with lots of > > >> columns when a lot of the queries require full table scans ? > > >> > > >> Thanks! > > >> Gurjeet +
Jacques 2012-08-12, 23:13
-
Re: Slow full-table scansGurjeet Singh 2012-08-13, 04:41
It seems like the client code just sits idle, waiting for data from
the regionservers. Gurjeet On Sun, Aug 12, 2012 at 4:13 PM, Jacques <[EMAIL PROTECTED]> wrote: > I think the first question is where is the time spent. Does your analysis > show that all the time spent is on the regionservers or is a portion of the > bottleneck on the client side? > > Jacques > > > > On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > >> Methods getStartKey and getEndKey provided by HRegionInfo class can used >> for that purpose. >> Also, please make sure, any HTable instance is not left opened once you are >> are done with reads. >> Regards, >> Mohammad Tariq >> >> >> >> On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: >> >> > Hi Mohammad, >> > >> > This is a great idea. Is there a API call to determine the start/end >> > key for each region ? >> > >> > Thanks, >> > Gurjeet >> > >> > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> >> > wrote: >> > > Hello experts, >> > > >> > > Would it be feasible to create a separate thread for each >> > region??I >> > > mean we can determine start and end key of each region and issue a scan >> > for >> > > each region in parallel. >> > > >> > > Regards, >> > > Mohammad Tariq >> > > >> > > >> > > >> > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]> >> > wrote: >> > > >> > >> Do you really have to retrieve all 200.000 each time? >> > >> Scan.setBatch(...) makes no difference?! (note that batching is >> > different >> > >> and separate from caching). >> > >> >> > >> Also note that the scanner contract is to return sorted KVs, so a >> single >> > >> scan cannot be parallelized across RegionServers (well not entirely >> > true, >> > >> it could be farmed off in parallel and then be presented to the client >> > in >> > >> the right order - but HBase is not doing that). That is why one vs 12 >> > RSs >> > >> makes no difference in this scenario. >> > >> >> > >> In the 12 node case you'll see low CPU on all but one RS, and each RS >> > will >> > >> get its turn. >> > >> >> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's >> > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase >> > (but >> > >> not great either). >> > >> >> > >> If you only ever expect to run a single query like this on top your >> > >> cluster (i.e. your concern is latency not throughput) you can do >> > multiple >> > >> RPCs in parallel for a sub portion of your key range. Together with >> > >> batching can start using value before all is streamed back from the >> > server. >> > >> >> > >> >> > >> -- Lars >> > >> >> > >> >> > >> >> > >> ----- Original Message ----- >> > >> From: Gurjeet Singh <[EMAIL PROTECTED]> >> > >> To: [EMAIL PROTECTED] >> > >> Cc: >> > >> Sent: Saturday, August 11, 2012 11:04 PM >> > >> Subject: Slow full-table scans >> > >> >> > >> Hi, >> > >> >> > >> I am trying to read all the data out of an HBase table using a scan >> > >> and it is extremely slow. >> > >> >> > >> Here are some characteristics of the data: >> > >> >> > >> 1. The total table size is tiny (~200MB) >> > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. >> > >> Thus the size of each cell is ~10bytes and the size of each row is >> > >> ~2MB >> > >> 3. Currently scanning the whole table takes ~400s (both in a >> > >> distributed setting with 12 nodes or so and on a single node), thus >> > >> 5sec/row >> > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers >> > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) >> > >> and is set to fetch 100MB of data at a time (scan.setCaching) >> > >> 6. Changing the caching size seems to have no effect on the total scan >> > >> time at all >> > >> 7. The column family is setup to keep a single version of the cells, >> > >> no compression, and no block cache. >> > >> >> > >> Am I missing something ? Is there a way to optimize this ? +
Gurjeet Singh 2012-08-13, 04:41
-
Re: Slow full-table scansMohammad Tariq 2012-08-12, 23:34
Also, give it a shot using HTablePools and see if it makes any significant
difference. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 4:43 AM, Jacques <[EMAIL PROTECTED]> wrote: > I think the first question is where is the time spent. Does your analysis > show that all the time spent is on the regionservers or is a portion of the > bottleneck on the client side? > > Jacques > > > > On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <[EMAIL PROTECTED]> > wrote: > > > Methods getStartKey and getEndKey provided by HRegionInfo class can used > > for that purpose. > > Also, please make sure, any HTable instance is not left opened once you > are > > are done with reads. > > Regards, > > Mohammad Tariq > > > > > > > > On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <[EMAIL PROTECTED]> > wrote: > > > > > Hi Mohammad, > > > > > > This is a great idea. Is there a API call to determine the start/end > > > key for each region ? > > > > > > Thanks, > > > Gurjeet > > > > > > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> > > > wrote: > > > > Hello experts, > > > > > > > > Would it be feasible to create a separate thread for each > > > region??I > > > > mean we can determine start and end key of each region and issue a > scan > > > for > > > > each region in parallel. > > > > > > > > Regards, > > > > Mohammad Tariq > > > > > > > > > > > > > > > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]> > > > wrote: > > > > > > > >> Do you really have to retrieve all 200.000 each time? > > > >> Scan.setBatch(...) makes no difference?! (note that batching is > > > different > > > >> and separate from caching). > > > >> > > > >> Also note that the scanner contract is to return sorted KVs, so a > > single > > > >> scan cannot be parallelized across RegionServers (well not entirely > > > true, > > > >> it could be farmed off in parallel and then be presented to the > client > > > in > > > >> the right order - but HBase is not doing that). That is why one vs > 12 > > > RSs > > > >> makes no difference in this scenario. > > > >> > > > >> In the 12 node case you'll see low CPU on all but one RS, and each > RS > > > will > > > >> get its turn. > > > >> > > > >> In your case this is scanning 20.000.000 KVs serially in 400s, > that's > > > >> 50000 KVs/s, which - depending on hardware - is not too bad for > HBase > > > (but > > > >> not great either). > > > >> > > > >> If you only ever expect to run a single query like this on top your > > > >> cluster (i.e. your concern is latency not throughput) you can do > > > multiple > > > >> RPCs in parallel for a sub portion of your key range. Together with > > > >> batching can start using value before all is streamed back from the > > > server. > > > >> > > > >> > > > >> -- Lars > > > >> > > > >> > > > >> > > > >> ----- Original Message ----- > > > >> From: Gurjeet Singh <[EMAIL PROTECTED]> > > > >> To: [EMAIL PROTECTED] > > > >> Cc: > > > >> Sent: Saturday, August 11, 2012 11:04 PM > > > >> Subject: Slow full-table scans > > > >> > > > >> Hi, > > > >> > > > >> I am trying to read all the data out of an HBase table using a scan > > > >> and it is extremely slow. > > > >> > > > >> Here are some characteristics of the data: > > > >> > > > >> 1. The total table size is tiny (~200MB) > > > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > > > >> Thus the size of each cell is ~10bytes and the size of each row is > > > >> ~2MB > > > >> 3. Currently scanning the whole table takes ~400s (both in a > > > >> distributed setting with 12 nodes or so and on a single node), thus > > > >> 5sec/row > > > >> 4. The row keys are unique 8 byte crypto hashes of sequential > numbers > > > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > > > >> and is set to fetch 100MB of data at a time (scan.setCaching) > > > >> 6. Changing the caching size seems to have no effect on the total > scan > > > >> time at all > > > >> 7. The column family is setup to keep a single version of the cells, +
Mohammad Tariq 2012-08-12, 23:34
-
Re: Slow full-table scansJacques 2012-08-12, 22:59
HTable.getRegionLocations()
I didn't realize the KeyValue serializations/deserialization happened on a separate thread in the hbase client code. J On Sun, Aug 12, 2012 at 3:52 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Hi Mohammad, > > This is a great idea. Is there a API call to determine the start/end > key for each region ? > > Thanks, > Gurjeet > > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]> > wrote: > > Hello experts, > > > > Would it be feasible to create a separate thread for each > region??I > > mean we can determine start and end key of each region and issue a scan > for > > each region in parallel. > > > > Regards, > > Mohammad Tariq > > > > > > > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > >> Do you really have to retrieve all 200.000 each time? > >> Scan.setBatch(...) makes no difference?! (note that batching is > different > >> and separate from caching). > >> > >> Also note that the scanner contract is to return sorted KVs, so a single > >> scan cannot be parallelized across RegionServers (well not entirely > true, > >> it could be farmed off in parallel and then be presented to the client > in > >> the right order - but HBase is not doing that). That is why one vs 12 > RSs > >> makes no difference in this scenario. > >> > >> In the 12 node case you'll see low CPU on all but one RS, and each RS > will > >> get its turn. > >> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase > (but > >> not great either). > >> > >> If you only ever expect to run a single query like this on top your > >> cluster (i.e. your concern is latency not throughput) you can do > multiple > >> RPCs in parallel for a sub portion of your key range. Together with > >> batching can start using value before all is streamed back from the > server. > >> > >> > >> -- Lars > >> > >> > >> > >> ----- Original Message ----- > >> From: Gurjeet Singh <[EMAIL PROTECTED]> > >> To: [EMAIL PROTECTED] > >> Cc: > >> Sent: Saturday, August 11, 2012 11:04 PM > >> Subject: Slow full-table scans > >> > >> Hi, > >> > >> I am trying to read all the data out of an HBase table using a scan > >> and it is extremely slow. > >> > >> Here are some characteristics of the data: > >> > >> 1. The total table size is tiny (~200MB) > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. > >> Thus the size of each cell is ~10bytes and the size of each row is > >> ~2MB > >> 3. Currently scanning the whole table takes ~400s (both in a > >> distributed setting with 12 nodes or so and on a single node), thus > >> 5sec/row > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) > >> and is set to fetch 100MB of data at a time (scan.setCaching) > >> 6. Changing the caching size seems to have no effect on the total scan > >> time at all > >> 7. The column family is setup to keep a single version of the cells, > >> no compression, and no block cache. > >> > >> Am I missing something ? Is there a way to optimize this ? > >> > >> I guess a general question I have is whether HBase is good datastore > >> for storing many medium sized (~50GB), dense datasets with lots of > >> columns when a lot of the queries require full table scans ? > >> > >> Thanks! > >> Gurjeet > >> > >> > +
Jacques 2012-08-12, 22:59
-
Re: Slow full-table scansStack 2012-08-12, 08:17
On Sun, Aug 12, 2012 at 7:04 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote:
> Am I missing something ? Is there a way to optimize this ? > You've checked out the perf section of the refguide? http://hbase.apache.org/book.html#performance And have you read the postings by the GBIF lads starting with this one: http://gbif.blogspot.ie/2012/02/performance-evaluation-of-hbase.html The boys have done a few blog postings on what they did to get HBase scans working fast enough for their needs. Its good reading because they tell it like a detective story figuring where the frictions were and how they measured it and then undid them, one by one. > I guess a general question I have is whether HBase is good datastore > for storing many medium sized (~50GB), dense datasets with lots of > columns when a lot of the queries require full table scans ? > Yes. St.Ack +
Stack 2012-08-12, 08:17
-
Re: Slow full-table scansGurjeet Singh 2012-08-12, 12:32
Thanks for the reply Stack. My comments are inline.
> You've checked out the perf section of the refguide? > > http://hbase.apache.org/book.html#performance Yes. HBase has 8GB RAM both on my cluster as well as my dev machine. Both configurations are backed by SSDs and Hbase options are set to HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" The data that I am dealing with is static. The table never changes after the first load. Even some of my GET requests are taking up to a full 60 seconds when the row sizes reach ~10MB. In general, taking 5 seconds to fetch a single row (~1MB) seems a extremely high to me. Thanks again for your help. +
Gurjeet Singh 2012-08-12, 12:32
-
Re: Slow full-table scansTed Yu 2012-08-12, 14:11
Gurjeet:
Can you tell us which HBase version you are using ? Thanks On Sun, Aug 12, 2012 at 5:32 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Thanks for the reply Stack. My comments are inline. > > > You've checked out the perf section of the refguide? > > > > http://hbase.apache.org/book.html#performance > > Yes. HBase has 8GB RAM both on my cluster as well as my dev machine. > Both configurations are backed by SSDs and Hbase options are set to > > HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" > > The data that I am dealing with is static. The table never changes > after the first load. > > Even some of my GET requests are taking up to a full 60 seconds when > the row sizes reach ~10MB. In general, taking 5 seconds to fetch a > single row (~1MB) seems a extremely high to me. > > Thanks again for your help. > +
Ted Yu 2012-08-12, 14:11
-
Re: Slow full-table scansGurjeet Singh 2012-08-12, 14:23
Hi Ted,
Yes, I am using the cloudera distribution 3. Gurjeet Sent from my iPad On Aug 12, 2012, at 7:11 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Gurjeet: > Can you tell us which HBase version you are using ? > > Thanks > > On Sun, Aug 12, 2012 at 5:32 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > >> Thanks for the reply Stack. My comments are inline. >> >>> You've checked out the perf section of the refguide? >>> >>> http://hbase.apache.org/book.html#performance >> >> Yes. HBase has 8GB RAM both on my cluster as well as my dev machine. >> Both configurations are backed by SSDs and Hbase options are set to >> >> HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" >> >> The data that I am dealing with is static. The table never changes >> after the first load. >> >> Even some of my GET requests are taking up to a full 60 seconds when >> the row sizes reach ~10MB. In general, taking 5 seconds to fetch a >> single row (~1MB) seems a extremely high to me. >> >> Thanks again for your help. >> +
Gurjeet Singh 2012-08-12, 14:23
-
Re: Slow full-table scansJacques 2012-08-12, 21:05
Something to consider is that HBase stores and retrieves the row key (8
bytes in your case) + timestamp (8 bytes) + column qualifier (?) for every single value. The schemaless nature of HBase generally means that this data has to be stored for each row (certain kinds of newer block level compression can make this less). So depending on your column qualifiers, you're going to be looking at potentially a huge amount of overhead when you're dealing with 200,000 cells in a single row. I also wonder whether you're dealing with a large amount of overhead simply on the serialization/deserialization/instantiation side if you're pulling back that many values. I'm not sure how many people are using that many cells in a single row and trying to read or write them all at once. Other's may have more thoughts. Jacques On Sun, Aug 12, 2012 at 7:23 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > Hi Ted, > > Yes, I am using the cloudera distribution 3. > > Gurjeet > > Sent from my iPad > > On Aug 12, 2012, at 7:11 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Gurjeet: > > Can you tell us which HBase version you are using ? > > > > Thanks > > > > On Sun, Aug 12, 2012 at 5:32 AM, Gurjeet Singh <[EMAIL PROTECTED]> > wrote: > > > >> Thanks for the reply Stack. My comments are inline. > >> > >>> You've checked out the perf section of the refguide? > >>> > >>> http://hbase.apache.org/book.html#performance > >> > >> Yes. HBase has 8GB RAM both on my cluster as well as my dev machine. > >> Both configurations are backed by SSDs and Hbase options are set to > >> > >> HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" > >> > >> The data that I am dealing with is static. The table never changes > >> after the first load. > >> > >> Even some of my GET requests are taking up to a full 60 seconds when > >> the row sizes reach ~10MB. In general, taking 5 seconds to fetch a > >> single row (~1MB) seems a extremely high to me. > >> > >> Thanks again for your help. > >> > +
Jacques 2012-08-12, 21:05
-
Re: Slow full-table scansGurjeet Singh 2012-08-12, 22:46
Hi Jacques,
I did consider that. So, this increases the on-disk size of my data by 3-4x (=600-800MB). That still does not explain why reading 1row (=~4MB with overhead) takes 5sec. About serialization/deserialization on the client side - it happens on a different thread out of a buffer and most of the time, that thread is just idling. Gurjeet On Sun, Aug 12, 2012 at 2:05 PM, Jacques <[EMAIL PROTECTED]> wrote: > Something to consider is that HBase stores and retrieves the row key (8 > bytes in your case) + timestamp (8 bytes) + column qualifier (?) for every > single value. The schemaless nature of HBase generally means that this > data has to be stored for each row (certain kinds of newer block level > compression can make this less). So depending on your column qualifiers, > you're going to be looking at potentially a huge amount of overhead when > you're dealing with 200,000 cells in a single row. I also wonder whether > you're dealing with a large amount of overhead simply on the > serialization/deserialization/instantiation side if you're pulling back > that many values. > > I'm not sure how many people are using that many cells in a single row and > trying to read or write them all at once. > > Other's may have more thoughts. > > Jacques > > > > On Sun, Aug 12, 2012 at 7:23 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote: > >> Hi Ted, >> >> Yes, I am using the cloudera distribution 3. >> >> Gurjeet >> >> Sent from my iPad >> >> On Aug 12, 2012, at 7:11 AM, Ted Yu <[EMAIL PROTECTED]> wrote: >> >> > Gurjeet: >> > Can you tell us which HBase version you are using ? >> > >> > Thanks >> > >> > On Sun, Aug 12, 2012 at 5:32 AM, Gurjeet Singh <[EMAIL PROTECTED]> >> wrote: >> > >> >> Thanks for the reply Stack. My comments are inline. >> >> >> >>> You've checked out the perf section of the refguide? >> >>> >> >>> http://hbase.apache.org/book.html#performance >> >> >> >> Yes. HBase has 8GB RAM both on my cluster as well as my dev machine. >> >> Both configurations are backed by SSDs and Hbase options are set to >> >> >> >> HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" >> >> >> >> The data that I am dealing with is static. The table never changes >> >> after the first load. >> >> >> >> Even some of my GET requests are taking up to a full 60 seconds when >> >> the row sizes reach ~10MB. In general, taking 5 seconds to fetch a >> >> single row (~1MB) seems a extremely high to me. >> >> >> >> Thanks again for your help. >> >> >> +
Gurjeet Singh 2012-08-12, 22:46
|