|
Amit Sela
2012-09-12, 12:57
Michael Segel
2012-09-12, 13:04
Amit Sela
2012-09-12, 13:55
Doug Meil
2012-09-12, 14:37
Amit Sela
2012-09-15, 09:11
Anoop Sam John
2012-09-17, 04:36
Alex Baranau
2012-09-17, 17:10
|
-
Optimizing table scansAmit Sela 2012-09-12, 12:57
Hi all,
I'm trying to find the sweet spot for the cache size and batch size Scan() parameters. I'm scanning one table using HTable.getScanner() and iterating over the ResultScanner retrieved. I did some testing and got the following results: For scanning *1000000* rows. * Cache Batch Total execution time (sec) 10000 -1 (default) 112 10000 5000 110 10000 10000 110 10000 20000 110 Cache Batch Total execution time (sec) 1000 -1 (default) 116 10000 -1 (default) 110 20000 -1 (default) 115 Cache Batch Total execution time (sec) 5000 10 26 20000 10 25 50000 10 26 5000 5 15 20000 5 14 50000 5 14 1000 1 6 5000 1 5 10000 1 4 20000 1 4 50000 1 4 * *I don't understand why a lower batch size gives such an improvement ?* Thanks, Amit. * *
-
Re: Optimizing table scansMichael Segel 2012-09-12, 13:04
How much memory do you have?
What's the size of the underlying row? What does your network look like? 1GBe or 10GBe? There's more to it, and I think that you'll find that YMMV on what is an optimum scan size... HTH -Mike On Sep 12, 2012, at 7:57 AM, Amit Sela <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm trying to find the sweet spot for the cache size and batch size Scan() > parameters. > > I'm scanning one table using HTable.getScanner() and iterating over the > ResultScanner retrieved. > > I did some testing and got the following results: > > For scanning *1000000* rows. > > * > > Cache > > Batch > > Total execution time (sec) > > 10000 > > -1 (default) > > 112 > > 10000 > > 5000 > > 110 > > 10000 > > 10000 > > 110 > > 10000 > > 20000 > > 110 > > Cache > > Batch > > Total execution time (sec) > > 1000 > > -1 (default) > > 116 > > 10000 > > -1 (default) > > 110 > > 20000 > > -1 (default) > > 115 > > Cache > > Batch > > Total execution time (sec) > > 5000 > > 10 > > 26 > > 20000 > > 10 > > 25 > > 50000 > > 10 > > 26 > > 5000 > > 5 > > 15 > > 20000 > > 5 > > 14 > > 50000 > > 5 > > 14 > > 1000 > > 1 > > 6 > > 5000 > > 1 > > 5 > > 10000 > > 1 > > 4 > > 20000 > > 1 > > 4 > > 50000 > > 1 > > 4 > > * > *I don't understand why a lower batch size gives such an improvement ?* > > Thanks, > > Amit. > * > *
-
Re: Optimizing table scansAmit Sela 2012-09-12, 13:55
I allocate 10GB per RegionServer.
An average row size is ~200 Bytes. The network is 1GB. It would be great if anyone could elaborate on the difference between Cache and Batch parameters. Thanks. On Wed, Sep 12, 2012 at 4:04 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > How much memory do you have? > What's the size of the underlying row? > What does your network look like? 1GBe or 10GBe? > > There's more to it, and I think that you'll find that YMMV on what is an > optimum scan size... > > HTH > > -Mike > > On Sep 12, 2012, at 7:57 AM, Amit Sela <[EMAIL PROTECTED]> wrote: > > > Hi all, > > > > I'm trying to find the sweet spot for the cache size and batch size > Scan() > > parameters. > > > > I'm scanning one table using HTable.getScanner() and iterating over the > > ResultScanner retrieved. > > > > I did some testing and got the following results: > > > > For scanning *1000000* rows. > > > > * > > > > Cache > > > > Batch > > > > Total execution time (sec) > > > > 10000 > > > > -1 (default) > > > > 112 > > > > 10000 > > > > 5000 > > > > 110 > > > > 10000 > > > > 10000 > > > > 110 > > > > 10000 > > > > 20000 > > > > 110 > > > > Cache > > > > Batch > > > > Total execution time (sec) > > > > 1000 > > > > -1 (default) > > > > 116 > > > > 10000 > > > > -1 (default) > > > > 110 > > > > 20000 > > > > -1 (default) > > > > 115 > > > > Cache > > > > Batch > > > > Total execution time (sec) > > > > 5000 > > > > 10 > > > > 26 > > > > 20000 > > > > 10 > > > > 25 > > > > 50000 > > > > 10 > > > > 26 > > > > 5000 > > > > 5 > > > > 15 > > > > 20000 > > > > 5 > > > > 14 > > > > 50000 > > > > 5 > > > > 14 > > > > 1000 > > > > 1 > > > > 6 > > > > 5000 > > > > 1 > > > > 5 > > > > 10000 > > > > 1 > > > > 4 > > > > 20000 > > > > 1 > > > > 4 > > > > 50000 > > > > 1 > > > > 4 > > > > * > > *I don't understand why a lower batch size gives such an improvement ?* > > > > Thanks, > > > > Amit. > > * > > * > >
-
Re: Optimizing table scansDoug Meil 2012-09-12, 14:37
Hi there, See this for info on the block cache in the RegionServer.. http://hbase.apache.org/book.html 9.6.4. Block Cache Š and see this for "batching" on the scan parameter... http://hbase.apache.org/book.html#perf.reading 11.8.1. Scan Caching On 9/12/12 9:55 AM, "Amit Sela" <[EMAIL PROTECTED]> wrote: >I allocate 10GB per RegionServer. >An average row size is ~200 Bytes. >The network is 1GB. > >It would be great if anyone could elaborate on the difference between >Cache >and Batch parameters. > >Thanks. > >On Wed, Sep 12, 2012 at 4:04 PM, Michael Segel ><[EMAIL PROTECTED]>wrote: > >> How much memory do you have? >> What's the size of the underlying row? >> What does your network look like? 1GBe or 10GBe? >> >> There's more to it, and I think that you'll find that YMMV on what is an >> optimum scan size... >> >> HTH >> >> -Mike >> >> On Sep 12, 2012, at 7:57 AM, Amit Sela <[EMAIL PROTECTED]> wrote: >> >> > Hi all, >> > >> > I'm trying to find the sweet spot for the cache size and batch size >> Scan() >> > parameters. >> > >> > I'm scanning one table using HTable.getScanner() and iterating over >>the >> > ResultScanner retrieved. >> > >> > I did some testing and got the following results: >> > >> > For scanning *1000000* rows. >> > >> > * >> > >> > Cache >> > >> > Batch >> > >> > Total execution time (sec) >> > >> > 10000 >> > >> > -1 (default) >> > >> > 112 >> > >> > 10000 >> > >> > 5000 >> > >> > 110 >> > >> > 10000 >> > >> > 10000 >> > >> > 110 >> > >> > 10000 >> > >> > 20000 >> > >> > 110 >> > >> > Cache >> > >> > Batch >> > >> > Total execution time (sec) >> > >> > 1000 >> > >> > -1 (default) >> > >> > 116 >> > >> > 10000 >> > >> > -1 (default) >> > >> > 110 >> > >> > 20000 >> > >> > -1 (default) >> > >> > 115 >> > >> > Cache >> > >> > Batch >> > >> > Total execution time (sec) >> > >> > 5000 >> > >> > 10 >> > >> > 26 >> > >> > 20000 >> > >> > 10 >> > >> > 25 >> > >> > 50000 >> > >> > 10 >> > >> > 26 >> > >> > 5000 >> > >> > 5 >> > >> > 15 >> > >> > 20000 >> > >> > 5 >> > >> > 14 >> > >> > 50000 >> > >> > 5 >> > >> > 14 >> > >> > 1000 >> > >> > 1 >> > >> > 6 >> > >> > 5000 >> > >> > 1 >> > >> > 5 >> > >> > 10000 >> > >> > 1 >> > >> > 4 >> > >> > 20000 >> > >> > 1 >> > >> > 4 >> > >> > 50000 >> > >> > 1 >> > >> > 4 >> > >> > * >> > *I don't understand why a lower batch size gives such an improvement >>?* >> > >> > Thanks, >> > >> > Amit. >> > * >> > * >> >>
-
Re: Optimizing table scansAmit Sela 2012-09-15, 09:11
So just to get it straight. The reason the scan with setBatch(1) is much
much faster is because it returns the only the value for the first column ? On Wed, Sep 12, 2012 at 5:37 PM, Doug Meil <[EMAIL PROTECTED]>wrote: > > Hi there, > > See this for info on the block cache in the RegionServer.. > > http://hbase.apache.org/book.html > 9.6.4. Block Cache > > Š and see this for "batching" on the scan parameter... > > http://hbase.apache.org/book.html#perf.reading > 11.8.1. Scan Caching > > > > > > > On 9/12/12 9:55 AM, "Amit Sela" <[EMAIL PROTECTED]> wrote: > > >I allocate 10GB per RegionServer. > >An average row size is ~200 Bytes. > >The network is 1GB. > > > >It would be great if anyone could elaborate on the difference between > >Cache > >and Batch parameters. > > > >Thanks. > > > >On Wed, Sep 12, 2012 at 4:04 PM, Michael Segel > ><[EMAIL PROTECTED]>wrote: > > > >> How much memory do you have? > >> What's the size of the underlying row? > >> What does your network look like? 1GBe or 10GBe? > >> > >> There's more to it, and I think that you'll find that YMMV on what is an > >> optimum scan size... > >> > >> HTH > >> > >> -Mike > >> > >> On Sep 12, 2012, at 7:57 AM, Amit Sela <[EMAIL PROTECTED]> wrote: > >> > >> > Hi all, > >> > > >> > I'm trying to find the sweet spot for the cache size and batch size > >> Scan() > >> > parameters. > >> > > >> > I'm scanning one table using HTable.getScanner() and iterating over > >>the > >> > ResultScanner retrieved. > >> > > >> > I did some testing and got the following results: > >> > > >> > For scanning *1000000* rows. > >> > > >> > * > >> > > >> > Cache > >> > > >> > Batch > >> > > >> > Total execution time (sec) > >> > > >> > 10000 > >> > > >> > -1 (default) > >> > > >> > 112 > >> > > >> > 10000 > >> > > >> > 5000 > >> > > >> > 110 > >> > > >> > 10000 > >> > > >> > 10000 > >> > > >> > 110 > >> > > >> > 10000 > >> > > >> > 20000 > >> > > >> > 110 > >> > > >> > Cache > >> > > >> > Batch > >> > > >> > Total execution time (sec) > >> > > >> > 1000 > >> > > >> > -1 (default) > >> > > >> > 116 > >> > > >> > 10000 > >> > > >> > -1 (default) > >> > > >> > 110 > >> > > >> > 20000 > >> > > >> > -1 (default) > >> > > >> > 115 > >> > > >> > Cache > >> > > >> > Batch > >> > > >> > Total execution time (sec) > >> > > >> > 5000 > >> > > >> > 10 > >> > > >> > 26 > >> > > >> > 20000 > >> > > >> > 10 > >> > > >> > 25 > >> > > >> > 50000 > >> > > >> > 10 > >> > > >> > 26 > >> > > >> > 5000 > >> > > >> > 5 > >> > > >> > 15 > >> > > >> > 20000 > >> > > >> > 5 > >> > > >> > 14 > >> > > >> > 50000 > >> > > >> > 5 > >> > > >> > 14 > >> > > >> > 1000 > >> > > >> > 1 > >> > > >> > 6 > >> > > >> > 5000 > >> > > >> > 1 > >> > > >> > 5 > >> > > >> > 10000 > >> > > >> > 1 > >> > > >> > 4 > >> > > >> > 20000 > >> > > >> > 1 > >> > > >> > 4 > >> > > >> > 50000 > >> > > >> > 1 > >> > > >> > 4 > >> > > >> > * > >> > *I don't understand why a lower batch size gives such an improvement > >>?* > >> > > >> > Thanks, > >> > > >> > Amit. > >> > * > >> > * > >> > >> > > >
-
RE: Optimizing table scansAnoop Sam John 2012-09-17, 04:36
>The reason the scan with setBatch(1) is much
much faster is because it returns the only the value for the first column ? When u set batching=1, it returns all the column values of rows. But one column value at a time.... FYI -Anoop- ________________________________________ From: Amit Sela [[EMAIL PROTECTED]] Sent: Saturday, September 15, 2012 2:41 PM To: [EMAIL PROTECTED] Subject: Re: Optimizing table scans So just to get it straight. The reason the scan with setBatch(1) is much much faster is because it returns the only the value for the first column ? On Wed, Sep 12, 2012 at 5:37 PM, Doug Meil <[EMAIL PROTECTED]>wrote: > > Hi there, > > See this for info on the block cache in the RegionServer.. > > http://hbase.apache.org/book.html > 9.6.4. Block Cache > > Š and see this for "batching" on the scan parameter... > > http://hbase.apache.org/book.html#perf.reading > 11.8.1. Scan Caching > > > > > > > On 9/12/12 9:55 AM, "Amit Sela" <[EMAIL PROTECTED]> wrote: > > >I allocate 10GB per RegionServer. > >An average row size is ~200 Bytes. > >The network is 1GB. > > > >It would be great if anyone could elaborate on the difference between > >Cache > >and Batch parameters. > > > >Thanks. > > > >On Wed, Sep 12, 2012 at 4:04 PM, Michael Segel > ><[EMAIL PROTECTED]>wrote: > > > >> How much memory do you have? > >> What's the size of the underlying row? > >> What does your network look like? 1GBe or 10GBe? > >> > >> There's more to it, and I think that you'll find that YMMV on what is an > >> optimum scan size... > >> > >> HTH > >> > >> -Mike > >> > >> On Sep 12, 2012, at 7:57 AM, Amit Sela <[EMAIL PROTECTED]> wrote: > >> > >> > Hi all, > >> > > >> > I'm trying to find the sweet spot for the cache size and batch size > >> Scan() > >> > parameters. > >> > > >> > I'm scanning one table using HTable.getScanner() and iterating over > >>the > >> > ResultScanner retrieved. > >> > > >> > I did some testing and got the following results: > >> > > >> > For scanning *1000000* rows. > >> > > >> > * > >> > > >> > Cache > >> > > >> > Batch > >> > > >> > Total execution time (sec) > >> > > >> > 10000 > >> > > >> > -1 (default) > >> > > >> > 112 > >> > > >> > 10000 > >> > > >> > 5000 > >> > > >> > 110 > >> > > >> > 10000 > >> > > >> > 10000 > >> > > >> > 110 > >> > > >> > 10000 > >> > > >> > 20000 > >> > > >> > 110 > >> > > >> > Cache > >> > > >> > Batch > >> > > >> > Total execution time (sec) > >> > > >> > 1000 > >> > > >> > -1 (default) > >> > > >> > 116 > >> > > >> > 10000 > >> > > >> > -1 (default) > >> > > >> > 110 > >> > > >> > 20000 > >> > > >> > -1 (default) > >> > > >> > 115 > >> > > >> > Cache > >> > > >> > Batch > >> > > >> > Total execution time (sec) > >> > > >> > 5000 > >> > > >> > 10 > >> > > >> > 26 > >> > > >> > 20000 > >> > > >> > 10 > >> > > >> > 25 > >> > > >> > 50000 > >> > > >> > 10 > >> > > >> > 26 > >> > > >> > 5000 > >> > > >> > 5 > >> > > >> > 15 > >> > > >> > 20000 > >> > > >> > 5 > >> > > >> > 14 > >> > > >> > 50000 > >> > > >> > 5 > >> > > >> > 14 > >> > > >> > 1000 > >> > > >> > 1 > >> > > >> > 6 > >> > > >> > 5000 > >> > > >> > 1 > >> > > >> > 5 > >> > > >> > 10000 > >> > > >> > 1 > >> > > >> > 4 > >> > > >> > 20000 > >> > > >> > 1 > >> > > >> > 4 > >> > > >> > 50000 > >> > > >> > 1 > >> > > >> > 4 > >> > > >> > * > >> > *I don't understand why a lower batch size gives such an improvement > >>?* > >> > > >> > Thanks, > >> > > >> > Amit. > >> > * > >> > * > >> > >> > > >
-
Re: Optimizing table scansAlex Baranau 2012-09-17, 17:10
> An average row size is ~200 Bytes.
How many columns do you have? I assume every time you try to fetch "non-cached in RSs block cache" data (i.e. making "true test"), right? Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Mon, Sep 17, 2012 at 12:36 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > >The reason the scan with setBatch(1) is much > much faster is because it returns the only the value for the first column ? > > When u set batching=1, it returns all the column values of rows. But one > column value at a time.... FYI > > -Anoop- > ________________________________________ > From: Amit Sela [[EMAIL PROTECTED]] > Sent: Saturday, September 15, 2012 2:41 PM > To: [EMAIL PROTECTED] > Subject: Re: Optimizing table scans > > So just to get it straight. The reason the scan with setBatch(1) is much > much faster is because it returns the only the value for the first column ? > > On Wed, Sep 12, 2012 at 5:37 PM, Doug Meil <[EMAIL PROTECTED] > >wrote: > > > > > Hi there, > > > > See this for info on the block cache in the RegionServer.. > > > > http://hbase.apache.org/book.html > > 9.6.4. Block Cache > > > > Š and see this for "batching" on the scan parameter... > > > > http://hbase.apache.org/book.html#perf.reading > > 11.8.1. Scan Caching > > > > > > > > > > > > > > On 9/12/12 9:55 AM, "Amit Sela" <[EMAIL PROTECTED]> wrote: > > > > >I allocate 10GB per RegionServer. > > >An average row size is ~200 Bytes. > > >The network is 1GB. > > > > > >It would be great if anyone could elaborate on the difference between > > >Cache > > >and Batch parameters. > > > > > >Thanks. > > > > > >On Wed, Sep 12, 2012 at 4:04 PM, Michael Segel > > ><[EMAIL PROTECTED]>wrote: > > > > > >> How much memory do you have? > > >> What's the size of the underlying row? > > >> What does your network look like? 1GBe or 10GBe? > > >> > > >> There's more to it, and I think that you'll find that YMMV on what is > an > > >> optimum scan size... > > >> > > >> HTH > > >> > > >> -Mike > > >> > > >> On Sep 12, 2012, at 7:57 AM, Amit Sela <[EMAIL PROTECTED]> wrote: > > >> > > >> > Hi all, > > >> > > > >> > I'm trying to find the sweet spot for the cache size and batch size > > >> Scan() > > >> > parameters. > > >> > > > >> > I'm scanning one table using HTable.getScanner() and iterating over > > >>the > > >> > ResultScanner retrieved. > > >> > > > >> > I did some testing and got the following results: > > >> > > > >> > For scanning *1000000* rows. > > >> > > > >> > * > > >> > > > >> > Cache > > >> > > > >> > Batch > > >> > > > >> > Total execution time (sec) > > >> > > > >> > 10000 > > >> > > > >> > -1 (default) > > >> > > > >> > 112 > > >> > > > >> > 10000 > > >> > > > >> > 5000 > > >> > > > >> > 110 > > >> > > > >> > 10000 > > >> > > > >> > 10000 > > >> > > > >> > 110 > > >> > > > >> > 10000 > > >> > > > >> > 20000 > > >> > > > >> > 110 > > >> > > > >> > Cache > > >> > > > >> > Batch > > >> > > > >> > Total execution time (sec) > > >> > > > >> > 1000 > > >> > > > >> > -1 (default) > > >> > > > >> > 116 > > >> > > > >> > 10000 > > >> > > > >> > -1 (default) > > >> > > > >> > 110 > > >> > > > >> > 20000 > > >> > > > >> > -1 (default) > > >> > > > >> > 115 > > >> > > > >> > Cache > > >> > > > >> > Batch > > >> > > > >> > Total execution time (sec) > > >> > > > >> > 5000 > > >> > > > >> > 10 > > >> > > > >> > 26 > > >> > > > >> > 20000 > > >> > > > >> > 10 > > >> > > > >> > 25 > > >> > > > >> > 50000 > > >> > > > >> > 10 > > >> > > > >> > 26 > > >> > > > >> > 5000 > > >> > > > >> > 5 > > >> > > > >> > 15 > > >> > > > >> > 20000 > > >> > > > >> > 5 > > >> > > > >> > 14 > > >> > > > >> > 50000 > > >> > > > >> > 5 > > >> > > > >> > 14 > > >> > > > >> > 1000 > > >> > > > >> > 1 > > >> > > > >> > 6 > > >> > > > >> > 5000 > > >> > > > >> > 1 > > >> > > > >> > 5 > > >> > > > >> > 10000 > > >> > > > >> > 1 > > >> > > > >> > 4 > > >> > > > >> > 20000 > > >> > > > >> > 1 > > >> > > > >> > 4 > > >> > |