|
Weihua JIANG
2011-10-12, 02:47
Jean-Daniel Cryans
2011-10-12, 02:53
Weihua JIANG
2011-10-12, 03:04
Jean-Daniel Cryans
2011-10-12, 03:12
Akash Ashok
2011-10-12, 03:17
Weihua JIANG
2011-10-13, 08:53
Jean-Daniel Cryans
2011-10-13, 17:25
Todd Lipcon
2011-10-13, 21:24
|
-
Hive+HBase performance is much poorer than Hive+HDFSWeihua JIANG 2011-10-12, 02:47
Hi all,
I have made some perf test about Hive+HBase. The table is a normal 2D table with about 160M rows (each row with 7 small columns) and 32 regions. There is only one column family and all regions have been major compacted to one store file before test. On a cluster with 11 task trackers (each with 4 map slots and 1 reduce slot, these servers also act as region servers), a simple SQL in Hive select count(*) from table where column3='Y'; needs ~1700 seconds to finish. But, after use CTAS statement to create an internal table (stored as sequence file), this statement only needs 43 seconds to finish. So Hive+HBase is 40X slower than Hive+HDFS. Though Hive+HBase has less map tasks (32 vs 223), but since there are only 44 map slots available, I don't think it is the main cause. I studied the source code of HBase scan implementation. To me, it seems, in my case, the scan performs HFile read in a quite similar way as sequence file read (sequential reading of each key/value pair). So, in theory, the performance shall be quite similar. Can anyone explain the 40X slowdown? Thanks Weihua
-
Re: Hive+HBase performance is much poorer than Hive+HDFSJean-Daniel Cryans 2011-10-12, 02:53
This is one big factor and you didn't mention configuring it:
http://hbase.apache.org/book.html#perf.hbase.client.caching J-D On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[EMAIL PROTECTED]>wrote: > Hi all, > > I have made some perf test about Hive+HBase. The table is a normal 2D > table with about 160M rows (each row with 7 small columns) and 32 > regions. There is only one column family and all regions have been > major compacted to one store file before test. > > On a cluster with 11 task trackers (each with 4 map slots and 1 reduce > slot, these servers also act as region servers), a simple SQL in Hive > select count(*) from table where column3='Y'; > needs ~1700 seconds to finish. > > But, after use CTAS statement to create an internal table (stored as > sequence file), this statement only needs 43 seconds to finish. > > So Hive+HBase is 40X slower than Hive+HDFS. > > Though Hive+HBase has less map tasks (32 vs 223), but since there are > only 44 map slots available, I don't think it is the main cause. > > I studied the source code of HBase scan implementation. To me, it > seems, in my case, the scan performs HFile read in a quite similar way > as sequence file read (sequential reading of each key/value pair). So, > in theory, the performance shall be quite similar. > > Can anyone explain the 40X slowdown? > > Thanks > Weihua >
-
Re: Hive+HBase performance is much poorer than Hive+HDFSWeihua JIANG 2011-10-12, 03:04
Since I am using Hive to perform query, I don't know how to set it.
Can you tell me how to do so? Thanks Weihua 2011/10/12 Jean-Daniel Cryans <[EMAIL PROTECTED]>: > This is one big factor and you didn't mention configuring it: > http://hbase.apache.org/book.html#perf.hbase.client.caching > > J-D > > On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[EMAIL PROTECTED]>wrote: > >> Hi all, >> >> I have made some perf test about Hive+HBase. The table is a normal 2D >> table with about 160M rows (each row with 7 small columns) and 32 >> regions. There is only one column family and all regions have been >> major compacted to one store file before test. >> >> On a cluster with 11 task trackers (each with 4 map slots and 1 reduce >> slot, these servers also act as region servers), a simple SQL in Hive >> select count(*) from table where column3='Y'; >> needs ~1700 seconds to finish. >> >> But, after use CTAS statement to create an internal table (stored as >> sequence file), this statement only needs 43 seconds to finish. >> >> So Hive+HBase is 40X slower than Hive+HDFS. >> >> Though Hive+HBase has less map tasks (32 vs 223), but since there are >> only 44 map slots available, I don't think it is the main cause. >> >> I studied the source code of HBase scan implementation. To me, it >> seems, in my case, the scan performs HFile read in a quite similar way >> as sequence file read (sequential reading of each key/value pair). So, >> in theory, the performance shall be quite similar. >> >> Can anyone explain the 40X slowdown? >> >> Thanks >> Weihua >> >
-
Re: Hive+HBase performance is much poorer than Hive+HDFSJean-Daniel Cryans 2011-10-12, 03:12
Your hive client needs to see a hbase-site.xml in its classpath, so you can
set the config there. Also this in general: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath J-D On Tue, Oct 11, 2011 at 8:04 PM, Weihua JIANG <[EMAIL PROTECTED]>wrote: > Since I am using Hive to perform query, I don't know how to set it. > Can you tell me how to do so? > > Thanks > Weihua > > 2011/10/12 Jean-Daniel Cryans <[EMAIL PROTECTED]>: > > This is one big factor and you didn't mention configuring it: > > http://hbase.apache.org/book.html#perf.hbase.client.caching > > > > J-D > > > > On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[EMAIL PROTECTED] > >wrote: > > > >> Hi all, > >> > >> I have made some perf test about Hive+HBase. The table is a normal 2D > >> table with about 160M rows (each row with 7 small columns) and 32 > >> regions. There is only one column family and all regions have been > >> major compacted to one store file before test. > >> > >> On a cluster with 11 task trackers (each with 4 map slots and 1 reduce > >> slot, these servers also act as region servers), a simple SQL in Hive > >> select count(*) from table where column3='Y'; > >> needs ~1700 seconds to finish. > >> > >> But, after use CTAS statement to create an internal table (stored as > >> sequence file), this statement only needs 43 seconds to finish. > >> > >> So Hive+HBase is 40X slower than Hive+HDFS. > >> > >> Though Hive+HBase has less map tasks (32 vs 223), but since there are > >> only 44 map slots available, I don't think it is the main cause. > >> > >> I studied the source code of HBase scan implementation. To me, it > >> seems, in my case, the scan performs HFile read in a quite similar way > >> as sequence file read (sequential reading of each key/value pair). So, > >> in theory, the performance shall be quite similar. > >> > >> Can anyone explain the 40X slowdown? > >> > >> Thanks > >> Weihua > >> > > >
-
Re: Hive+HBase performance is much poorer than Hive+HDFSAkash Ashok 2011-10-12, 03:17
Hi,
To set this parameter you could use "set hbase.client.scanner.caching=500;" before the execution of your hive query. Cheers, Akash On Wed, Oct 12, 2011 at 8:34 AM, Weihua JIANG <[EMAIL PROTECTED]>wrote: > Since I am using Hive to perform query, I don't know how to set it. > Can you tell me how to do so? > > Thanks > Weihua > > 2011/10/12 Jean-Daniel Cryans <[EMAIL PROTECTED]>: > > This is one big factor and you didn't mention configuring it: > > http://hbase.apache.org/book.html#perf.hbase.client.caching > > > > J-D > > > > On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[EMAIL PROTECTED] > >wrote: > > > >> Hi all, > >> > >> I have made some perf test about Hive+HBase. The table is a normal 2D > >> table with about 160M rows (each row with 7 small columns) and 32 > >> regions. There is only one column family and all regions have been > >> major compacted to one store file before test. > >> > >> On a cluster with 11 task trackers (each with 4 map slots and 1 reduce > >> slot, these servers also act as region servers), a simple SQL in Hive > >> select count(*) from table where column3='Y'; > >> needs ~1700 seconds to finish. > >> > >> But, after use CTAS statement to create an internal table (stored as > >> sequence file), this statement only needs 43 seconds to finish. > >> > >> So Hive+HBase is 40X slower than Hive+HDFS. > >> > >> Though Hive+HBase has less map tasks (32 vs 223), but since there are > >> only 44 map slots available, I don't think it is the main cause. > >> > >> I studied the source code of HBase scan implementation. To me, it > >> seems, in my case, the scan performs HFile read in a quite similar way > >> as sequence file read (sequential reading of each key/value pair). So, > >> in theory, the performance shall be quite similar. > >> > >> Can anyone explain the 40X slowdown? > >> > >> Thanks > >> Weihua > >> > > >
-
Re: Hive+HBase performance is much poorer than Hive+HDFSWeihua JIANG 2011-10-13, 08:53
After set this argument to 1000, I get a result: hive/hbase is 4X
slower than hive/hdfs. how much X is the expected slowdown for hive/hbase vs hive/hdfs? Thanks Weihua 2011/10/12 Akash Ashok <[EMAIL PROTECTED]>: > Hi, > To set this parameter you could use "set hbase.client.scanner.caching=500;" > before the execution of your hive query. > > Cheers, > Akash > > On Wed, Oct 12, 2011 at 8:34 AM, Weihua JIANG <[EMAIL PROTECTED]>wrote: > >> Since I am using Hive to perform query, I don't know how to set it. >> Can you tell me how to do so? >> >> Thanks >> Weihua >> >> 2011/10/12 Jean-Daniel Cryans <[EMAIL PROTECTED]>: >> > This is one big factor and you didn't mention configuring it: >> > http://hbase.apache.org/book.html#perf.hbase.client.caching >> > >> > J-D >> > >> > On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[EMAIL PROTECTED] >> >wrote: >> > >> >> Hi all, >> >> >> >> I have made some perf test about Hive+HBase. The table is a normal 2D >> >> table with about 160M rows (each row with 7 small columns) and 32 >> >> regions. There is only one column family and all regions have been >> >> major compacted to one store file before test. >> >> >> >> On a cluster with 11 task trackers (each with 4 map slots and 1 reduce >> >> slot, these servers also act as region servers), a simple SQL in Hive >> >> select count(*) from table where column3='Y'; >> >> needs ~1700 seconds to finish. >> >> >> >> But, after use CTAS statement to create an internal table (stored as >> >> sequence file), this statement only needs 43 seconds to finish. >> >> >> >> So Hive+HBase is 40X slower than Hive+HDFS. >> >> >> >> Though Hive+HBase has less map tasks (32 vs 223), but since there are >> >> only 44 map slots available, I don't think it is the main cause. >> >> >> >> I studied the source code of HBase scan implementation. To me, it >> >> seems, in my case, the scan performs HFile read in a quite similar way >> >> as sequence file read (sequential reading of each key/value pair). So, >> >> in theory, the performance shall be quite similar. >> >> >> >> Can anyone explain the 40X slowdown? >> >> >> >> Thanks >> >> Weihua >> >> >> > >> >
-
Re: Hive+HBase performance is much poorer than Hive+HDFSJean-Daniel Cryans 2011-10-13, 17:25
Your question is more basic than that, it's actually how much slower is it
to sequentially read in HBase compared to HDFS. I'm not sure anyone quantified that, and there's probably a bunch of factors that can influence it, but at least you should try to get the same level of distribution eg since you have less regions than mapper slots, force split that table once or twice to get more of them. The difference here is due to the fact that regions can get up to 256MB by default before splitting whereas in HDFS the default block size is 64MB. Then maybe your HBase schema isn't efficient (fat keys), but I wouldn't be able to tell just by what you wrote. In any case, since you have to go through an additional layer, it will definitely be slower to use HBase than directly reading the files. J-D On Thu, Oct 13, 2011 at 1:53 AM, Weihua JIANG <[EMAIL PROTECTED]>wrote: > After set this argument to 1000, I get a result: hive/hbase is 4X > slower than hive/hdfs. > > how much X is the expected slowdown for hive/hbase vs hive/hdfs? > > Thanks > Weihua > > 2011/10/12 Akash Ashok <[EMAIL PROTECTED]>: > > Hi, > > To set this parameter you could use "set > hbase.client.scanner.caching=500;" > > before the execution of your hive query. > > > > Cheers, > > Akash > > > > On Wed, Oct 12, 2011 at 8:34 AM, Weihua JIANG <[EMAIL PROTECTED] > >wrote: > > > >> Since I am using Hive to perform query, I don't know how to set it. > >> Can you tell me how to do so? > >> > >> Thanks > >> Weihua > >> > >> 2011/10/12 Jean-Daniel Cryans <[EMAIL PROTECTED]>: > >> > This is one big factor and you didn't mention configuring it: > >> > http://hbase.apache.org/book.html#perf.hbase.client.caching > >> > > >> > J-D > >> > > >> > On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[EMAIL PROTECTED] > >> >wrote: > >> > > >> >> Hi all, > >> >> > >> >> I have made some perf test about Hive+HBase. The table is a normal 2D > >> >> table with about 160M rows (each row with 7 small columns) and 32 > >> >> regions. There is only one column family and all regions have been > >> >> major compacted to one store file before test. > >> >> > >> >> On a cluster with 11 task trackers (each with 4 map slots and 1 > reduce > >> >> slot, these servers also act as region servers), a simple SQL in Hive > >> >> select count(*) from table where column3='Y'; > >> >> needs ~1700 seconds to finish. > >> >> > >> >> But, after use CTAS statement to create an internal table (stored as > >> >> sequence file), this statement only needs 43 seconds to finish. > >> >> > >> >> So Hive+HBase is 40X slower than Hive+HDFS. > >> >> > >> >> Though Hive+HBase has less map tasks (32 vs 223), but since there are > >> >> only 44 map slots available, I don't think it is the main cause. > >> >> > >> >> I studied the source code of HBase scan implementation. To me, it > >> >> seems, in my case, the scan performs HFile read in a quite similar > way > >> >> as sequence file read (sequential reading of each key/value pair). > So, > >> >> in theory, the performance shall be quite similar. > >> >> > >> >> Can anyone explain the 40X slowdown? > >> >> > >> >> Thanks > >> >> Weihua > >> >> > >> > > >> > > >
-
Re: Hive+HBase performance is much poorer than Hive+HDFSTodd Lipcon 2011-10-13, 21:24
Most of the benchmarks I've seen are about what you're seeing 4-5x
overhead reading from HBase vs straight DFS files. Makes sense as we have a whole extra layer involved, plus locking overhead, etc. We can probably do some more optimization and get down to a 2x difference, but we'll never be as fast as churning through raw files with no locks and no extra copies. -Todd On Thu, Oct 13, 2011 at 10:25 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > Your question is more basic than that, it's actually how much slower is it > to sequentially read in HBase compared to HDFS. I'm not sure anyone > quantified that, and there's probably a bunch of factors that can influence > it, but at least you should try to get the same level of distribution eg > since you have less regions than mapper slots, force split that table once > or twice to get more of them. The difference here is due to the fact that > regions can get up to 256MB by default before splitting whereas in HDFS the > default block size is 64MB. > > Then maybe your HBase schema isn't efficient (fat keys), but I wouldn't be > able to tell just by what you wrote. > > In any case, since you have to go through an additional layer, it will > definitely be slower to use HBase than directly reading the files. > > J-D > > On Thu, Oct 13, 2011 at 1:53 AM, Weihua JIANG <[EMAIL PROTECTED]>wrote: > >> After set this argument to 1000, I get a result: hive/hbase is 4X >> slower than hive/hdfs. >> >> how much X is the expected slowdown for hive/hbase vs hive/hdfs? >> >> Thanks >> Weihua >> >> 2011/10/12 Akash Ashok <[EMAIL PROTECTED]>: >> > Hi, >> > To set this parameter you could use "set >> hbase.client.scanner.caching=500;" >> > before the execution of your hive query. >> > >> > Cheers, >> > Akash >> > >> > On Wed, Oct 12, 2011 at 8:34 AM, Weihua JIANG <[EMAIL PROTECTED] >> >wrote: >> > >> >> Since I am using Hive to perform query, I don't know how to set it. >> >> Can you tell me how to do so? >> >> >> >> Thanks >> >> Weihua >> >> >> >> 2011/10/12 Jean-Daniel Cryans <[EMAIL PROTECTED]>: >> >> > This is one big factor and you didn't mention configuring it: >> >> > http://hbase.apache.org/book.html#perf.hbase.client.caching >> >> > >> >> > J-D >> >> > >> >> > On Tue, Oct 11, 2011 at 7:47 PM, Weihua JIANG <[EMAIL PROTECTED] >> >> >wrote: >> >> > >> >> >> Hi all, >> >> >> >> >> >> I have made some perf test about Hive+HBase. The table is a normal 2D >> >> >> table with about 160M rows (each row with 7 small columns) and 32 >> >> >> regions. There is only one column family and all regions have been >> >> >> major compacted to one store file before test. >> >> >> >> >> >> On a cluster with 11 task trackers (each with 4 map slots and 1 >> reduce >> >> >> slot, these servers also act as region servers), a simple SQL in Hive >> >> >> select count(*) from table where column3='Y'; >> >> >> needs ~1700 seconds to finish. >> >> >> >> >> >> But, after use CTAS statement to create an internal table (stored as >> >> >> sequence file), this statement only needs 43 seconds to finish. >> >> >> >> >> >> So Hive+HBase is 40X slower than Hive+HDFS. >> >> >> >> >> >> Though Hive+HBase has less map tasks (32 vs 223), but since there are >> >> >> only 44 map slots available, I don't think it is the main cause. >> >> >> >> >> >> I studied the source code of HBase scan implementation. To me, it >> >> >> seems, in my case, the scan performs HFile read in a quite similar >> way >> >> >> as sequence file read (sequential reading of each key/value pair). >> So, >> >> >> in theory, the performance shall be quite similar. >> >> >> >> >> >> Can anyone explain the 40X slowdown? >> >> >> >> >> >> Thanks >> >> >> Weihua >> >> >> Todd Lipcon Software Engineer, Cloudera |