|
Rita
2011-10-09, 14:50
Tom Goren
2011-10-09, 15:07
Ted Yu
2011-10-09, 15:09
Himanshu Vashishtha
2011-10-09, 15:19
Rita
2011-10-09, 15:30
Ted Yu
2011-10-09, 15:44
Himanshu Vashishtha
2011-10-09, 16:26
Ted Yu
2011-10-09, 16:29
Ryan Rawson
2011-10-10, 00:01
lars hofhansl
2011-10-10, 00:44
Himanshu Vashishtha
2011-10-10, 01:05
Rita
2011-10-29, 14:29
Ted Yu
2011-10-29, 14:46
Rita
2011-10-29, 16:56
Ted Yu
2011-10-29, 21:32
|
-
speeding up rowcountRita 2011-10-09, 14:50
Hi,
I have been doing a rowcount via mapreduce and its taking about 4-5 hours to count a 500million rows in a table. I was wondering if there are any map reduce tunings I can do so it will go much faster. I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any tuning advice would be much appreciated. -- --- Get your facts first, then you can distort them as you please.--
-
Re: speeding up rowcountTom Goren 2011-10-09, 15:07
lol - i just ran a rowcount via mapreduce, and it took 6 hours for 7.5
million rows... On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > Hi, > > I have been doing a rowcount via mapreduce and its taking about 4-5 hours > to > count a 500million rows in a table. I was wondering if there are any map > reduce tunings I can do so it will go much faster. > > I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any > tuning > advice would be much appreciated. > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: speeding up rowcountTed Yu 2011-10-09, 15:09
I guess your hbase.hregion.max.filesize is quite high.
If possible, lower its value so that you have smaller regions. On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > Hi, > > I have been doing a rowcount via mapreduce and its taking about 4-5 hours > to > count a 500million rows in a table. I was wondering if there are any map > reduce tunings I can do so it will go much faster. > > I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any > tuning > advice would be much appreciated. > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: speeding up rowcountHimanshu Vashishtha 2011-10-09, 15:19
Since a MapReduce is a separate process, try with a high Scan cache value.
http://hbase.apache.org/book.html#perf.hbase.client.caching Himanshu On Sun, Oct 9, 2011 at 9:09 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > I guess your hbase.hregion.max.filesize is quite high. > If possible, lower its value so that you have smaller regions. > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I have been doing a rowcount via mapreduce and its taking about 4-5 hours >> to >> count a 500million rows in a table. I was wondering if there are any map >> reduce tunings I can do so it will go much faster. >> >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any >> tuning >> advice would be much appreciated. >> >> >> -- >> --- Get your facts first, then you can distort them as you please.-- >> >
-
Re: speeding up rowcountRita 2011-10-09, 15:30
Thanks for the responses.
Where do I set the high Scan cache values? On Sun, Oct 9, 2011 at 11:19 AM, Himanshu Vashishtha < [EMAIL PROTECTED]> wrote: > Since a MapReduce is a separate process, try with a high Scan cache value. > > http://hbase.apache.org/book.html#perf.hbase.client.caching > > Himanshu > > On Sun, Oct 9, 2011 at 9:09 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > I guess your hbase.hregion.max.filesize is quite high. > > If possible, lower its value so that you have smaller regions. > > > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > > > >> Hi, > >> > >> I have been doing a rowcount via mapreduce and its taking about 4-5 > hours > >> to > >> count a 500million rows in a table. I was wondering if there are any map > >> reduce tunings I can do so it will go much faster. > >> > >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any > >> tuning > >> advice would be much appreciated. > >> > >> > >> -- > >> --- Get your facts first, then you can distort them as you please.-- > >> > > > -- --- Get your facts first, then you can distort them as you please.--
-
Re: speeding up rowcountTed Yu 2011-10-09, 15:44
Excellent question.
There seems to be a bug for RowCounter. In TableInputFormat: if (conf.get(SCAN_CACHEDROWS) != null) { scan.setCaching(Integer.parseInt(conf.get(SCAN_CACHEDROWS))); } But I don't see SCAN_CACHEDROWS in either TableMapReduceUtil or RowCounter. Mind filing a bug ? On Sun, Oct 9, 2011 at 8:30 AM, Rita <[EMAIL PROTECTED]> wrote: > Thanks for the responses. > > Where do I set the high Scan cache values? > > > On Sun, Oct 9, 2011 at 11:19 AM, Himanshu Vashishtha < > [EMAIL PROTECTED]> wrote: > > > Since a MapReduce is a separate process, try with a high Scan cache > value. > > > > http://hbase.apache.org/book.html#perf.hbase.client.caching > > > > Himanshu > > > > On Sun, Oct 9, 2011 at 9:09 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > I guess your hbase.hregion.max.filesize is quite high. > > > If possible, lower its value so that you have smaller regions. > > > > > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > > > > > >> Hi, > > >> > > >> I have been doing a rowcount via mapreduce and its taking about 4-5 > > hours > > >> to > > >> count a 500million rows in a table. I was wondering if there are any > map > > >> reduce tunings I can do so it will go much faster. > > >> > > >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any > > >> tuning > > >> advice would be much appreciated. > > >> > > >> > > >> -- > > >> --- Get your facts first, then you can distort them as you please.-- > > >> > > > > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: speeding up rowcountHimanshu Vashishtha 2011-10-09, 16:26
Since a RowCounter uses FirstKeyOnlyFilter, we can have a default Scan
cache value of 500 or so? Himanshu On Sun, Oct 9, 2011 at 9:44 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Excellent question. > There seems to be a bug for RowCounter. > > In TableInputFormat: > if (conf.get(SCAN_CACHEDROWS) != null) { > scan.setCaching(Integer.parseInt(conf.get(SCAN_CACHEDROWS))); > } > But I don't see SCAN_CACHEDROWS in either TableMapReduceUtil or RowCounter. > > Mind filing a bug ? > > On Sun, Oct 9, 2011 at 8:30 AM, Rita <[EMAIL PROTECTED]> wrote: > >> Thanks for the responses. >> >> Where do I set the high Scan cache values? >> >> >> On Sun, Oct 9, 2011 at 11:19 AM, Himanshu Vashishtha < >> [EMAIL PROTECTED]> wrote: >> >> > Since a MapReduce is a separate process, try with a high Scan cache >> value. >> > >> > http://hbase.apache.org/book.html#perf.hbase.client.caching >> > >> > Himanshu >> > >> > On Sun, Oct 9, 2011 at 9:09 AM, Ted Yu <[EMAIL PROTECTED]> wrote: >> > > I guess your hbase.hregion.max.filesize is quite high. >> > > If possible, lower its value so that you have smaller regions. >> > > >> > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: >> > > >> > >> Hi, >> > >> >> > >> I have been doing a rowcount via mapreduce and its taking about 4-5 >> > hours >> > >> to >> > >> count a 500million rows in a table. I was wondering if there are any >> map >> > >> reduce tunings I can do so it will go much faster. >> > >> >> > >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any >> > >> tuning >> > >> advice would be much appreciated. >> > >> >> > >> >> > >> -- >> > >> --- Get your facts first, then you can distort them as you please.-- >> > >> >> > > >> > >> >> >> >> -- >> --- Get your facts first, then you can distort them as you please.-- >> >
-
Re: speeding up rowcountTed Yu 2011-10-09, 16:29
That is fine.
We should also allow users to override cache value. On Sun, Oct 9, 2011 at 9:26 AM, Himanshu Vashishtha <[EMAIL PROTECTED] > wrote: > Since a RowCounter uses FirstKeyOnlyFilter, we can have a default Scan > cache value of 500 or so? > > Himanshu > > On Sun, Oct 9, 2011 at 9:44 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > Excellent question. > > There seems to be a bug for RowCounter. > > > > In TableInputFormat: > > if (conf.get(SCAN_CACHEDROWS) != null) { > > scan.setCaching(Integer.parseInt(conf.get(SCAN_CACHEDROWS))); > > } > > But I don't see SCAN_CACHEDROWS in either TableMapReduceUtil or > RowCounter. > > > > Mind filing a bug ? > > > > On Sun, Oct 9, 2011 at 8:30 AM, Rita <[EMAIL PROTECTED]> wrote: > > > >> Thanks for the responses. > >> > >> Where do I set the high Scan cache values? > >> > >> > >> On Sun, Oct 9, 2011 at 11:19 AM, Himanshu Vashishtha < > >> [EMAIL PROTECTED]> wrote: > >> > >> > Since a MapReduce is a separate process, try with a high Scan cache > >> value. > >> > > >> > http://hbase.apache.org/book.html#perf.hbase.client.caching > >> > > >> > Himanshu > >> > > >> > On Sun, Oct 9, 2011 at 9:09 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > >> > > I guess your hbase.hregion.max.filesize is quite high. > >> > > If possible, lower its value so that you have smaller regions. > >> > > > >> > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > >> > > > >> > >> Hi, > >> > >> > >> > >> I have been doing a rowcount via mapreduce and its taking about 4-5 > >> > hours > >> > >> to > >> > >> count a 500million rows in a table. I was wondering if there are > any > >> map > >> > >> reduce tunings I can do so it will go much faster. > >> > >> > >> > >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. > Any > >> > >> tuning > >> > >> advice would be much appreciated. > >> > >> > >> > >> > >> > >> -- > >> > >> --- Get your facts first, then you can distort them as you > please.-- > >> > >> > >> > > > >> > > >> > >> > >> > >> -- > >> --- Get your facts first, then you can distort them as you please.-- > >> > > >
-
Re: speeding up rowcountRyan Rawson 2011-10-10, 00:01
Are you sure the job is running on the cluster and not running in single
node mode? This happens a lot... On Oct 9, 2011 7:50 AM, "Rita" <[EMAIL PROTECTED]> wrote: > Hi, > > I have been doing a rowcount via mapreduce and its taking about 4-5 hours > to > count a 500million rows in a table. I was wondering if there are any map > reduce tunings I can do so it will go much faster. > > I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any > tuning > advice would be much appreciated. > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: speeding up rowcountlars hofhansl 2011-10-10, 00:44
Be aware that the contract for a scan is to return all rows sorted by rowkey, hence it cannot scan regions in parallel by default.I have not played much HBase with MapReduce, but if order is not important you can to split the scan into multiple scans.
----- Original Message ----- From: Tom Goren <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Sunday, October 9, 2011 8:07 AM Subject: Re: speeding up rowcount lol - i just ran a rowcount via mapreduce, and it took 6 hours for 7.5 million rows... On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > Hi, > > I have been doing a rowcount via mapreduce and its taking about 4-5 hours > to > count a 500million rows in a table. I was wondering if there are any map > reduce tunings I can do so it will go much faster. > > I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any > tuning > advice would be much appreciated. > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: speeding up rowcountHimanshu Vashishtha 2011-10-10, 01:05
MapReduce support in HBase inherently provides parallelism such that
each Region is given to one mapper. Himanshu On Sun, Oct 9, 2011 at 6:44 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Be aware that the contract for a scan is to return all rows sorted by rowkey, hence it cannot scan regions in parallel by default.I have not played much HBase with MapReduce, but if order is not important you can to split the scan into multiple scans. > > > ----- Original Message ----- > From: Tom Goren <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Sunday, October 9, 2011 8:07 AM > Subject: Re: speeding up rowcount > > lol - i just ran a rowcount via mapreduce, and it took 6 hours for 7.5 > million rows... > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I have been doing a rowcount via mapreduce and its taking about 4-5 hours >> to >> count a 500million rows in a table. I was wondering if there are any map >> reduce tunings I can do so it will go much faster. >> >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any >> tuning >> advice would be much appreciated. >> >> >> -- >> --- Get your facts first, then you can distort them as you please.-- >> > >
-
Re: speeding up rowcountRita 2011-10-29, 14:29
Opened, https://issues.apache.org/jira/browse/HBASE-4702
Please edit to your liking. On Sun, Oct 9, 2011 at 9:05 PM, Himanshu Vashishtha <[EMAIL PROTECTED] > wrote: > MapReduce support in HBase inherently provides parallelism such that > each Region is given to one mapper. > > Himanshu > > On Sun, Oct 9, 2011 at 6:44 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > Be aware that the contract for a scan is to return all rows sorted by > rowkey, hence it cannot scan regions in parallel by default.I have not > played much HBase with MapReduce, but if order is not important you can to > split the scan into multiple scans. > > > > > > ----- Original Message ----- > > From: Tom Goren <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Cc: > > Sent: Sunday, October 9, 2011 8:07 AM > > Subject: Re: speeding up rowcount > > > > lol - i just ran a rowcount via mapreduce, and it took 6 hours for 7.5 > > million rows... > > > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > > > >> Hi, > >> > >> I have been doing a rowcount via mapreduce and its taking about 4-5 > hours > >> to > >> count a 500million rows in a table. I was wondering if there are any map > >> reduce tunings I can do so it will go much faster. > >> > >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any > >> tuning > >> advice would be much appreciated. > >> > >> > >> -- > >> --- Get your facts first, then you can distort them as you please.-- > >> > > > > > -- --- Get your facts first, then you can distort them as you please.--
-
Re: speeding up rowcountTed Yu 2011-10-29, 14:46
Thanks Rita for logging the JIRA.
Do you want to provide a patch ? On Sat, Oct 29, 2011 at 7:29 AM, Rita <[EMAIL PROTECTED]> wrote: > Opened, https://issues.apache.org/jira/browse/HBASE-4702 > > > Please edit to your liking. > > > On Sun, Oct 9, 2011 at 9:05 PM, Himanshu Vashishtha < > [EMAIL PROTECTED] > > wrote: > > > MapReduce support in HBase inherently provides parallelism such that > > each Region is given to one mapper. > > > > Himanshu > > > > On Sun, Oct 9, 2011 at 6:44 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > Be aware that the contract for a scan is to return all rows sorted by > > rowkey, hence it cannot scan regions in parallel by default.I have not > > played much HBase with MapReduce, but if order is not important you can > to > > split the scan into multiple scans. > > > > > > > > > ----- Original Message ----- > > > From: Tom Goren <[EMAIL PROTECTED]> > > > To: [EMAIL PROTECTED] > > > Cc: > > > Sent: Sunday, October 9, 2011 8:07 AM > > > Subject: Re: speeding up rowcount > > > > > > lol - i just ran a rowcount via mapreduce, and it took 6 hours for 7.5 > > > million rows... > > > > > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > > > > > >> Hi, > > >> > > >> I have been doing a rowcount via mapreduce and its taking about 4-5 > > hours > > >> to > > >> count a 500million rows in a table. I was wondering if there are any > map > > >> reduce tunings I can do so it will go much faster. > > >> > > >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any > > >> tuning > > >> advice would be much appreciated. > > >> > > >> > > >> -- > > >> --- Get your facts first, then you can distort them as you please.-- > > >> > > > > > > > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- >
-
Re: speeding up rowcountRita 2011-10-29, 16:56
Ha. You are over estimating my Java Ted. I am no programmer just a ignorant
consumer of great technologies On Sat, Oct 29, 2011 at 10:46 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Thanks Rita for logging the JIRA. > > Do you want to provide a patch ? > > On Sat, Oct 29, 2011 at 7:29 AM, Rita <[EMAIL PROTECTED]> wrote: > > > Opened, https://issues.apache.org/jira/browse/HBASE-4702 > > > > > > Please edit to your liking. > > > > > > On Sun, Oct 9, 2011 at 9:05 PM, Himanshu Vashishtha < > > [EMAIL PROTECTED] > > > wrote: > > > > > MapReduce support in HBase inherently provides parallelism such that > > > each Region is given to one mapper. > > > > > > Himanshu > > > > > > On Sun, Oct 9, 2011 at 6:44 PM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > > Be aware that the contract for a scan is to return all rows sorted by > > > rowkey, hence it cannot scan regions in parallel by default.I have not > > > played much HBase with MapReduce, but if order is not important you can > > to > > > split the scan into multiple scans. > > > > > > > > > > > > ----- Original Message ----- > > > > From: Tom Goren <[EMAIL PROTECTED]> > > > > To: [EMAIL PROTECTED] > > > > Cc: > > > > Sent: Sunday, October 9, 2011 8:07 AM > > > > Subject: Re: speeding up rowcount > > > > > > > > lol - i just ran a rowcount via mapreduce, and it took 6 hours for > 7.5 > > > > million rows... > > > > > > > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > > > > > > > >> Hi, > > > >> > > > >> I have been doing a rowcount via mapreduce and its taking about 4-5 > > > hours > > > >> to > > > >> count a 500million rows in a table. I was wondering if there are any > > map > > > >> reduce tunings I can do so it will go much faster. > > > >> > > > >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. > Any > > > >> tuning > > > >> advice would be much appreciated. > > > >> > > > >> > > > >> -- > > > >> --- Get your facts first, then you can distort them as you please.-- > > > >> > > > > > > > > > > > > > > > > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > > > -- --- Get your facts first, then you can distort them as you please.--
-
Re: speeding up rowcountTed Yu 2011-10-29, 21:32
Please take a look at my first patch for 4702.
Thanks On Sat, Oct 29, 2011 at 9:56 AM, Rita <[EMAIL PROTECTED]> wrote: > Ha. You are over estimating my Java Ted. I am no programmer just a ignorant > consumer of great technologies > > > On Sat, Oct 29, 2011 at 10:46 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Thanks Rita for logging the JIRA. > > > > Do you want to provide a patch ? > > > > On Sat, Oct 29, 2011 at 7:29 AM, Rita <[EMAIL PROTECTED]> wrote: > > > > > Opened, https://issues.apache.org/jira/browse/HBASE-4702 > > > > > > > > > Please edit to your liking. > > > > > > > > > On Sun, Oct 9, 2011 at 9:05 PM, Himanshu Vashishtha < > > > [EMAIL PROTECTED] > > > > wrote: > > > > > > > MapReduce support in HBase inherently provides parallelism such that > > > > each Region is given to one mapper. > > > > > > > > Himanshu > > > > > > > > On Sun, Oct 9, 2011 at 6:44 PM, lars hofhansl <[EMAIL PROTECTED]> > > > wrote: > > > > > Be aware that the contract for a scan is to return all rows sorted > by > > > > rowkey, hence it cannot scan regions in parallel by default.I have > not > > > > played much HBase with MapReduce, but if order is not important you > can > > > to > > > > split the scan into multiple scans. > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: Tom Goren <[EMAIL PROTECTED]> > > > > > To: [EMAIL PROTECTED] > > > > > Cc: > > > > > Sent: Sunday, October 9, 2011 8:07 AM > > > > > Subject: Re: speeding up rowcount > > > > > > > > > > lol - i just ran a rowcount via mapreduce, and it took 6 hours for > > 7.5 > > > > > million rows... > > > > > > > > > > On Sun, Oct 9, 2011 at 7:50 AM, Rita <[EMAIL PROTECTED]> wrote: > > > > > > > > > >> Hi, > > > > >> > > > > >> I have been doing a rowcount via mapreduce and its taking about > 4-5 > > > > hours > > > > >> to > > > > >> count a 500million rows in a table. I was wondering if there are > any > > > map > > > > >> reduce tunings I can do so it will go much faster. > > > > >> > > > > >> I have 10 node cluster, each node with 8CPUs with 64GB of memory. > > Any > > > > >> tuning > > > > >> advice would be much appreciated. > > > > >> > > > > >> > > > > >> -- > > > > >> --- Get your facts first, then you can distort them as you > please.-- > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > --- Get your facts first, then you can distort them as you please.-- > > > > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- > |