|
Austin Chungath
2013-01-17, 16:44
Michael Segel
2013-01-17, 16:48
Anoop John
2013-01-17, 17:00
ramkrishna vasudevan
2013-01-17, 17:09
Mohammad Tariq
2013-01-17, 17:46
praveenesh kumar
2013-01-18, 17:57
Doug Meil
2013-01-18, 18:00
Asaf Mesika
2013-01-19, 19:50
Mohammad Tariq
2013-01-19, 21:12
Doug Meil
2013-01-20, 15:13
Vikas Jadhav
2013-01-20, 18:04
Austin Chungath
2013-01-21, 05:45
Anoop Sam John
2013-01-21, 05:54
Austin Chungath
2013-01-21, 06:16
Mohammad Tariq
2013-01-21, 06:31
Anoop Sam John
2013-01-21, 06:36
Mohammad Tariq
2013-01-21, 06:39
|
-
Loading data, hbase slower than Hive?Austin Chungath 2013-01-17, 16:44
Hi,
Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 mins. It's a 20 gb data set approx 230 million records. The data is in hdfs, single text file. The cluster is 11 nodes, 8 cores. I loaded this in hive, partitioned by date and bucketed into 32 and sorted. Time taken is 6 mins. I loaded the same data into hbase, in the same cluster by writing a map reduce code. It took 1hr 14 mins. The cluster wasn't running anything else and assuming that the code that i wrote is good enough, what is it that makes hbase slower than hive in loading the data? Thanks, Austin +
Austin Chungath 2013-01-17, 16:44
-
Re: Loading data, hbase slower than Hive?Michael Segel 2013-01-17, 16:48
The writes take longer in HBase.
Just how much longer may depend on how well you tuned HBase. Now, having said that... suppose you want to find a single record in either HBase or Hive. Which do you think will be faster? ;-) On Jan 17, 2013, at 10:44 AM, Austin Chungath <[EMAIL PROTECTED]> wrote: > Hi, > Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 mins. > It's a 20 gb data set approx 230 million records. The data is in hdfs, > single text file. The cluster is 11 nodes, 8 cores. > > I loaded this in hive, partitioned by date and bucketed into 32 and sorted. > Time taken is 6 mins. > > I loaded the same data into hbase, in the same cluster by writing a map > reduce code. It took 1hr 14 mins. The cluster wasn't running anything else > and assuming that the code that i wrote is good enough, what is it that > makes hbase slower than hive in loading the data? > > Thanks, > Austin +
Michael Segel 2013-01-17, 16:48
-
Re: Loading data, hbase slower than Hive?Anoop John 2013-01-17, 17:00
In case of Hive data insertion means placing the file under table path in
HDFS. HBase need to read the data and convert it into its format. (HFiles) MR is doing this work.. So this makes it clear that HBase will be slower. :) As Michael said the read operation... -Anoop- On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath <[EMAIL PROTECTED]>wrote: > Hi, > Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 mins. > It's a 20 gb data set approx 230 million records. The data is in hdfs, > single text file. The cluster is 11 nodes, 8 cores. > > I loaded this in hive, partitioned by date and bucketed into 32 and sorted. > Time taken is 6 mins. > > I loaded the same data into hbase, in the same cluster by writing a map > reduce code. It took 1hr 14 mins. The cluster wasn't running anything else > and assuming that the code that i wrote is good enough, what is it that > makes hbase slower than hive in loading the data? > > Thanks, > Austin > +
Anoop John 2013-01-17, 17:00
-
Re: Loading data, hbase slower than Hive?ramkrishna vasudevan 2013-01-17, 17:09
Hive is more for batch and HBase is for more of real time data.
Regards Ram On Thu, Jan 17, 2013 at 10:30 PM, Anoop John <[EMAIL PROTECTED]> wrote: > In case of Hive data insertion means placing the file under table path in > HDFS. HBase need to read the data and convert it into its format. (HFiles) > MR is doing this work.. So this makes it clear that HBase will be slower. > :) As Michael said the read operation... > > > > -Anoop- > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath <[EMAIL PROTECTED] > >wrote: > > > Hi, > > Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 mins. > > It's a 20 gb data set approx 230 million records. The data is in hdfs, > > single text file. The cluster is 11 nodes, 8 cores. > > > > I loaded this in hive, partitioned by date and bucketed into 32 and > sorted. > > Time taken is 6 mins. > > > > I loaded the same data into hbase, in the same cluster by writing a map > > reduce code. It took 1hr 14 mins. The cluster wasn't running anything > else > > and assuming that the code that i wrote is good enough, what is it that > > makes hbase slower than hive in loading the data? > > > > Thanks, > > Austin > > > +
ramkrishna vasudevan 2013-01-17, 17:09
-
Re: Loading data, hbase slower than Hive?Mohammad Tariq 2013-01-17, 17:46
Just to add to whatever all the heavyweights have said above, your MR job
may not be as efficient as the MR job corresponding to your Hive query. You can enhance the performance by setting the mapred config parameters wisely and by tuning your MR job. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < [EMAIL PROTECTED]> wrote: > Hive is more for batch and HBase is for more of real time data. > > Regards > Ram > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John <[EMAIL PROTECTED]> > wrote: > > > In case of Hive data insertion means placing the file under table path in > > HDFS. HBase need to read the data and convert it into its format. > (HFiles) > > MR is doing this work.. So this makes it clear that HBase will be > slower. > > :) As Michael said the read operation... > > > > > > > > -Anoop- > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath <[EMAIL PROTECTED] > > >wrote: > > > > > Hi, > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 mins. > > > It's a 20 gb data set approx 230 million records. The data is in hdfs, > > > single text file. The cluster is 11 nodes, 8 cores. > > > > > > I loaded this in hive, partitioned by date and bucketed into 32 and > > sorted. > > > Time taken is 6 mins. > > > > > > I loaded the same data into hbase, in the same cluster by writing a map > > > reduce code. It took 1hr 14 mins. The cluster wasn't running anything > > else > > > and assuming that the code that i wrote is good enough, what is it that > > > makes hbase slower than hive in loading the data? > > > > > > Thanks, > > > Austin > > > > > > +
Mohammad Tariq 2013-01-17, 17:46
-
Re: Loading data, hbase slower than Hive?praveenesh kumar 2013-01-18, 17:57
Hey,
Can someone throw some pointers on what would be the best practice for bulk imports in hbase ? That would be really helpful. Regards, Praveenesh On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Just to add to whatever all the heavyweights have said above, your MR job > may not be as efficient as the MR job corresponding to your Hive query. You > can enhance the performance by setting the mapred config parameters wisely > and by tuning your MR job. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > [EMAIL PROTECTED]> wrote: > > > Hive is more for batch and HBase is for more of real time data. > > > > Regards > > Ram > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John <[EMAIL PROTECTED]> > > wrote: > > > > > In case of Hive data insertion means placing the file under table path > in > > > HDFS. HBase need to read the data and convert it into its format. > > (HFiles) > > > MR is doing this work.. So this makes it clear that HBase will be > > slower. > > > :) As Michael said the read operation... > > > > > > > > > > > > -Anoop- > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath <[EMAIL PROTECTED] > > > >wrote: > > > > > > > Hi, > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 > mins. > > > > It's a 20 gb data set approx 230 million records. The data is in > hdfs, > > > > single text file. The cluster is 11 nodes, 8 cores. > > > > > > > > I loaded this in hive, partitioned by date and bucketed into 32 and > > > sorted. > > > > Time taken is 6 mins. > > > > > > > > I loaded the same data into hbase, in the same cluster by writing a > map > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running anything > > > else > > > > and assuming that the code that i wrote is good enough, what is it > that > > > > makes hbase slower than hive in loading the data? > > > > > > > > Thanks, > > > > Austin > > > > > > > > > > +
praveenesh kumar 2013-01-18, 17:57
-
Re: Loading data, hbase slower than Hive?Doug Meil 2013-01-18, 18:00
Hi there, See this section of the HBase RefGuide for information about bulk loading. http://hbase.apache.org/book.html#arch.bulk.load On 1/18/13 12:57 PM, "praveenesh kumar" <[EMAIL PROTECTED]> wrote: >Hey, >Can someone throw some pointers on what would be the best practice for >bulk >imports in hbase ? >That would be really helpful. > >Regards, >Praveenesh > >On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED]> >wrote: > >> Just to add to whatever all the heavyweights have said above, your MR >>job >> may not be as efficient as the MR job corresponding to your Hive query. >>You >> can enhance the performance by setting the mapred config parameters >>wisely >> and by tuning your MR job. >> >> Warm Regards, >> Tariq >> https://mtariq.jux.com/ >> cloudfront.blogspot.com >> >> >> On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < >> [EMAIL PROTECTED]> wrote: >> >> > Hive is more for batch and HBase is for more of real time data. >> > >> > Regards >> > Ram >> > >> > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John <[EMAIL PROTECTED]> >> > wrote: >> > >> > > In case of Hive data insertion means placing the file under table >>path >> in >> > > HDFS. HBase need to read the data and convert it into its format. >> > (HFiles) >> > > MR is doing this work.. So this makes it clear that HBase will be >> > slower. >> > > :) As Michael said the read operation... >> > > >> > > >> > > >> > > -Anoop- >> > > >> > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath >><[EMAIL PROTECTED] >> > > >wrote: >> > > >> > > > Hi, >> > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 >> mins. >> > > > It's a 20 gb data set approx 230 million records. The data is in >> hdfs, >> > > > single text file. The cluster is 11 nodes, 8 cores. >> > > > >> > > > I loaded this in hive, partitioned by date and bucketed into 32 >>and >> > > sorted. >> > > > Time taken is 6 mins. >> > > > >> > > > I loaded the same data into hbase, in the same cluster by writing >>a >> map >> > > > reduce code. It took 1hr 14 mins. The cluster wasn't running >>anything >> > > else >> > > > and assuming that the code that i wrote is good enough, what is it >> that >> > > > makes hbase slower than hive in loading the data? >> > > > >> > > > Thanks, >> > > > Austin >> > > > >> > > >> > >> +
Doug Meil 2013-01-18, 18:00
-
Re: Loading data, hbase slower than Hive?Asaf Mesika 2013-01-19, 19:50
Start by telling us your row key design.
Check for pre splitting your table regions. I managed to get to 25mb/sec write throughput in Hbase using 1 region server. If your data is evenly spread you can get around 7 times that in a 10 regions server environment. Should mean that 1 gig should take 4 sec. On Friday, January 18, 2013, praveenesh kumar wrote: > Hey, > Can someone throw some pointers on what would be the best practice for bulk > imports in hbase ? > That would be really helpful. > > Regards, > Praveenesh > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED]<javascript:;>> > wrote: > > > Just to add to whatever all the heavyweights have said above, your MR job > > may not be as efficient as the MR job corresponding to your Hive query. > You > > can enhance the performance by setting the mapred config parameters > wisely > > and by tuning your MR job. > > > > Warm Regards, > > Tariq > > https://mtariq.jux.com/ > > cloudfront.blogspot.com > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > Hive is more for batch and HBase is for more of real time data. > > > > > > Regards > > > Ram > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John <[EMAIL PROTECTED]<javascript:;> > > > > > wrote: > > > > > > > In case of Hive data insertion means placing the file under table > path > > in > > > > HDFS. HBase need to read the data and convert it into its format. > > > (HFiles) > > > > MR is doing this work.. So this makes it clear that HBase will be > > > slower. > > > > :) As Michael said the read operation... > > > > > > > > > > > > > > > > -Anoop- > > > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < > [EMAIL PROTECTED] <javascript:;> > > > > >wrote: > > > > > > > > > Hi, > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 > > mins. > > > > > It's a 20 gb data set approx 230 million records. The data is in > > hdfs, > > > > > single text file. The cluster is 11 nodes, 8 cores. > > > > > > > > > > I loaded this in hive, partitioned by date and bucketed into 32 and > > > > sorted. > > > > > Time taken is 6 mins. > > > > > > > > > > I loaded the same data into hbase, in the same cluster by writing a > > map > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running > anything > > > > else > > > > > and assuming that the code that i wrote is good enough, what is it > > that > > > > > makes hbase slower than hive in loading the data? > > > > > > > > > > Thanks, > > > > > Austin > > > > > > > > > > > > > > > +
Asaf Mesika 2013-01-19, 19:50
-
Re: Loading data, hbase slower than Hive?Mohammad Tariq 2013-01-19, 21:12
Hello Austin,
I am sorry for the late response. Asaf has made a very valid point. Rowkwey design is very crucial. Specially if the data is gonna be sequential(timeseries kinda thing). You may end up with hotspotting problem. Use pre-splitted tables or hash the keys to avoid that. It'll also allow you to fetch the results faster. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> wrote: > Start by telling us your row key design. > Check for pre splitting your table regions. > I managed to get to 25mb/sec write throughput in Hbase using 1 region > server. If your data is evenly spread you can get around 7 times that in a > 10 regions server environment. Should mean that 1 gig should take 4 sec. > > > On Friday, January 18, 2013, praveenesh kumar wrote: > > > Hey, > > Can someone throw some pointers on what would be the best practice for > bulk > > imports in hbase ? > > That would be really helpful. > > > > Regards, > > Praveenesh > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED] > <javascript:;>> > > wrote: > > > > > Just to add to whatever all the heavyweights have said above, your MR > job > > > may not be as efficient as the MR job corresponding to your Hive query. > > You > > > can enhance the performance by setting the mapred config parameters > > wisely > > > and by tuning your MR job. > > > > > > Warm Regards, > > > Tariq > > > https://mtariq.jux.com/ > > > cloudfront.blogspot.com > > > > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > > Hive is more for batch and HBase is for more of real time data. > > > > > > > > Regards > > > > Ram > > > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John <[EMAIL PROTECTED] > <javascript:;> > > > > > > > wrote: > > > > > > > > > In case of Hive data insertion means placing the file under table > > path > > > in > > > > > HDFS. HBase need to read the data and convert it into its format. > > > > (HFiles) > > > > > MR is doing this work.. So this makes it clear that HBase will be > > > > slower. > > > > > :) As Michael said the read operation... > > > > > > > > > > > > > > > > > > > > -Anoop- > > > > > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < > > [EMAIL PROTECTED] <javascript:;> > > > > > >wrote: > > > > > > > > > > > Hi, > > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr 14 > > > mins. > > > > > > It's a 20 gb data set approx 230 million records. The data is in > > > hdfs, > > > > > > single text file. The cluster is 11 nodes, 8 cores. > > > > > > > > > > > > I loaded this in hive, partitioned by date and bucketed into 32 > and > > > > > sorted. > > > > > > Time taken is 6 mins. > > > > > > > > > > > > I loaded the same data into hbase, in the same cluster by > writing a > > > map > > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running > > anything > > > > > else > > > > > > and assuming that the code that i wrote is good enough, what is > it > > > that > > > > > > makes hbase slower than hive in loading the data? > > > > > > > > > > > > Thanks, > > > > > > Austin > > > > > > > > > > > > > > > > > > > > > +
Mohammad Tariq 2013-01-19, 21:12
-
Re: Loading data, hbase slower than Hive?Doug Meil 2013-01-20, 15:13
Hi there- On top of what everybody else said, for more info on rowkey design and pre-splitting see http://hbase.apache.org/book.html#schema (as well as other threads in this dist-list on that topic). On 1/19/13 4:12 PM, "Mohammad Tariq" <[EMAIL PROTECTED]> wrote: >Hello Austin, > > I am sorry for the late response. > >Asaf has made a very valid point. Rowkwey design is very crucial. >Specially if the data is gonna be sequential(timeseries kinda thing). >You may end up with hotspotting problem. Use pre-splitted tables >or hash the keys to avoid that. It'll also allow you to fetch the results >faster. > >Warm Regards, >Tariq >https://mtariq.jux.com/ >cloudfront.blogspot.com > > >On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> >wrote: > >> Start by telling us your row key design. >> Check for pre splitting your table regions. >> I managed to get to 25mb/sec write throughput in Hbase using 1 region >> server. If your data is evenly spread you can get around 7 times that >>in a >> 10 regions server environment. Should mean that 1 gig should take 4 sec. >> >> >> On Friday, January 18, 2013, praveenesh kumar wrote: >> >> > Hey, >> > Can someone throw some pointers on what would be the best practice for >> bulk >> > imports in hbase ? >> > That would be really helpful. >> > >> > Regards, >> > Praveenesh >> > >> > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED] >> <javascript:;>> >> > wrote: >> > >> > > Just to add to whatever all the heavyweights have said above, your >>MR >> job >> > > may not be as efficient as the MR job corresponding to your Hive >>query. >> > You >> > > can enhance the performance by setting the mapred config parameters >> > wisely >> > > and by tuning your MR job. >> > > >> > > Warm Regards, >> > > Tariq >> > > https://mtariq.jux.com/ >> > > cloudfront.blogspot.com >> > > >> > > >> > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < >> > > [EMAIL PROTECTED] <javascript:;>> wrote: >> > > >> > > > Hive is more for batch and HBase is for more of real time data. >> > > > >> > > > Regards >> > > > Ram >> > > > >> > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John >><[EMAIL PROTECTED] >> <javascript:;> >> > > >> > > > wrote: >> > > > >> > > > > In case of Hive data insertion means placing the file under >>table >> > path >> > > in >> > > > > HDFS. HBase need to read the data and convert it into its >>format. >> > > > (HFiles) >> > > > > MR is doing this work.. So this makes it clear that HBase will >>be >> > > > slower. >> > > > > :) As Michael said the read operation... >> > > > > >> > > > > >> > > > > >> > > > > -Anoop- >> > > > > >> > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < >> > [EMAIL PROTECTED] <javascript:;> >> > > > > >wrote: >> > > > > >> > > > > > Hi, >> > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr >>14 >> > > mins. >> > > > > > It's a 20 gb data set approx 230 million records. The data is >>in >> > > hdfs, >> > > > > > single text file. The cluster is 11 nodes, 8 cores. >> > > > > > >> > > > > > I loaded this in hive, partitioned by date and bucketed into >>32 >> and >> > > > > sorted. >> > > > > > Time taken is 6 mins. >> > > > > > >> > > > > > I loaded the same data into hbase, in the same cluster by >> writing a >> > > map >> > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running >> > anything >> > > > > else >> > > > > > and assuming that the code that i wrote is good enough, what >>is >> it >> > > that >> > > > > > makes hbase slower than hive in loading the data? >> > > > > > >> > > > > > Thanks, >> > > > > > Austin >> > > > > > >> > > > > >> > > > >> > > >> > >> +
Doug Meil 2013-01-20, 15:13
-
Re: Loading data, hbase slower than Hive?Vikas Jadhav 2013-01-20, 18:04
According to me
HBase need to store more metadata than hive (For each value it stores seperately row key , col_family ,col_name,value) and file size of original hdfs file may increase in size I also wondered this if anyone has got better result for hbase than hive let us know. Thank You On Sun, Jan 20, 2013 at 8:43 PM, Doug Meil <[EMAIL PROTECTED]>wrote: > > Hi there- > > On top of what everybody else said, for more info on rowkey design and > pre-splitting see http://hbase.apache.org/book.html#schema (as well as > other threads in this dist-list on that topic). > > > > > > On 1/19/13 4:12 PM, "Mohammad Tariq" <[EMAIL PROTECTED]> wrote: > > >Hello Austin, > > > > I am sorry for the late response. > > > >Asaf has made a very valid point. Rowkwey design is very crucial. > >Specially if the data is gonna be sequential(timeseries kinda thing). > >You may end up with hotspotting problem. Use pre-splitted tables > >or hash the keys to avoid that. It'll also allow you to fetch the results > >faster. > > > >Warm Regards, > >Tariq > >https://mtariq.jux.com/ > >cloudfront.blogspot.com > > > > > >On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> > >wrote: > > > >> Start by telling us your row key design. > >> Check for pre splitting your table regions. > >> I managed to get to 25mb/sec write throughput in Hbase using 1 region > >> server. If your data is evenly spread you can get around 7 times that > >>in a > >> 10 regions server environment. Should mean that 1 gig should take 4 sec. > >> > >> > >> On Friday, January 18, 2013, praveenesh kumar wrote: > >> > >> > Hey, > >> > Can someone throw some pointers on what would be the best practice for > >> bulk > >> > imports in hbase ? > >> > That would be really helpful. > >> > > >> > Regards, > >> > Praveenesh > >> > > >> > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED] > >> <javascript:;>> > >> > wrote: > >> > > >> > > Just to add to whatever all the heavyweights have said above, your > >>MR > >> job > >> > > may not be as efficient as the MR job corresponding to your Hive > >>query. > >> > You > >> > > can enhance the performance by setting the mapred config parameters > >> > wisely > >> > > and by tuning your MR job. > >> > > > >> > > Warm Regards, > >> > > Tariq > >> > > https://mtariq.jux.com/ > >> > > cloudfront.blogspot.com > >> > > > >> > > > >> > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > >> > > [EMAIL PROTECTED] <javascript:;>> wrote: > >> > > > >> > > > Hive is more for batch and HBase is for more of real time data. > >> > > > > >> > > > Regards > >> > > > Ram > >> > > > > >> > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John > >><[EMAIL PROTECTED] > >> <javascript:;> > >> > > > >> > > > wrote: > >> > > > > >> > > > > In case of Hive data insertion means placing the file under > >>table > >> > path > >> > > in > >> > > > > HDFS. HBase need to read the data and convert it into its > >>format. > >> > > > (HFiles) > >> > > > > MR is doing this work.. So this makes it clear that HBase will > >>be > >> > > > slower. > >> > > > > :) As Michael said the read operation... > >> > > > > > >> > > > > > >> > > > > > >> > > > > -Anoop- > >> > > > > > >> > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < > >> > [EMAIL PROTECTED] <javascript:;> > >> > > > > >wrote: > >> > > > > > >> > > > > > Hi, > >> > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr > >>14 > >> > > mins. > >> > > > > > It's a 20 gb data set approx 230 million records. The data is > >>in > >> > > hdfs, > >> > > > > > single text file. The cluster is 11 nodes, 8 cores. > >> > > > > > > >> > > > > > I loaded this in hive, partitioned by date and bucketed into > >>32 > >> and > >> > > > > sorted. > >> > > > > > Time taken is 6 mins. > >> > > > > > > >> > > > > > I loaded the same data into hbase, in the same cluster by > >> writing a > >> > > map > >> > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running * * * Thanx and Regards* * Vikas Jadhav* +
Vikas Jadhav 2013-01-20, 18:04
-
Re: Loading data, hbase slower than Hive?Austin Chungath 2013-01-21, 05:45
Thank you Tariq.
I will let you know how things went after I implement these suggestions. Regards, Austin On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello Austin, > > I am sorry for the late response. > > Asaf has made a very valid point. Rowkwey design is very crucial. > Specially if the data is gonna be sequential(timeseries kinda thing). > You may end up with hotspotting problem. Use pre-splitted tables > or hash the keys to avoid that. It'll also allow you to fetch the results > faster. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> > wrote: > > > Start by telling us your row key design. > > Check for pre splitting your table regions. > > I managed to get to 25mb/sec write throughput in Hbase using 1 region > > server. If your data is evenly spread you can get around 7 times that in > a > > 10 regions server environment. Should mean that 1 gig should take 4 sec. > > > > > > On Friday, January 18, 2013, praveenesh kumar wrote: > > > > > Hey, > > > Can someone throw some pointers on what would be the best practice for > > bulk > > > imports in hbase ? > > > That would be really helpful. > > > > > > Regards, > > > Praveenesh > > > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED] > > <javascript:;>> > > > wrote: > > > > > > > Just to add to whatever all the heavyweights have said above, your MR > > job > > > > may not be as efficient as the MR job corresponding to your Hive > query. > > > You > > > > can enhance the performance by setting the mapred config parameters > > > wisely > > > > and by tuning your MR job. > > > > > > > > Warm Regards, > > > > Tariq > > > > https://mtariq.jux.com/ > > > > cloudfront.blogspot.com > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > > > > Hive is more for batch and HBase is for more of real time data. > > > > > > > > > > Regards > > > > > Ram > > > > > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John < > [EMAIL PROTECTED] > > <javascript:;> > > > > > > > > > wrote: > > > > > > > > > > > In case of Hive data insertion means placing the file under table > > > path > > > > in > > > > > > HDFS. HBase need to read the data and convert it into its > format. > > > > > (HFiles) > > > > > > MR is doing this work.. So this makes it clear that HBase will > be > > > > > slower. > > > > > > :) As Michael said the read operation... > > > > > > > > > > > > > > > > > > > > > > > > -Anoop- > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < > > > [EMAIL PROTECTED] <javascript:;> > > > > > > >wrote: > > > > > > > > > > > > > Hi, > > > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr > 14 > > > > mins. > > > > > > > It's a 20 gb data set approx 230 million records. The data is > in > > > > hdfs, > > > > > > > single text file. The cluster is 11 nodes, 8 cores. > > > > > > > > > > > > > > I loaded this in hive, partitioned by date and bucketed into 32 > > and > > > > > > sorted. > > > > > > > Time taken is 6 mins. > > > > > > > > > > > > > > I loaded the same data into hbase, in the same cluster by > > writing a > > > > map > > > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running > > > anything > > > > > > else > > > > > > > and assuming that the code that i wrote is good enough, what is > > it > > > > that > > > > > > > makes hbase slower than hive in loading the data? > > > > > > > > > > > > > > Thanks, > > > > > > > Austin > > > > > > > > > > > > > > > > > > > > > > > > > > > > +
Austin Chungath 2013-01-21, 05:45
-
RE: Loading data, hbase slower than Hive?Anoop Sam John 2013-01-21, 05:54
Austin,
You are using HFileOutputFormat or TableOutputFormat? -Anoop- ________________________________________ From: Austin Chungath [[EMAIL PROTECTED]] Sent: Monday, January 21, 2013 11:15 AM To: [EMAIL PROTECTED] Subject: Re: Loading data, hbase slower than Hive? Thank you Tariq. I will let you know how things went after I implement these suggestions. Regards, Austin On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello Austin, > > I am sorry for the late response. > > Asaf has made a very valid point. Rowkwey design is very crucial. > Specially if the data is gonna be sequential(timeseries kinda thing). > You may end up with hotspotting problem. Use pre-splitted tables > or hash the keys to avoid that. It'll also allow you to fetch the results > faster. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> > wrote: > > > Start by telling us your row key design. > > Check for pre splitting your table regions. > > I managed to get to 25mb/sec write throughput in Hbase using 1 region > > server. If your data is evenly spread you can get around 7 times that in > a > > 10 regions server environment. Should mean that 1 gig should take 4 sec. > > > > > > On Friday, January 18, 2013, praveenesh kumar wrote: > > > > > Hey, > > > Can someone throw some pointers on what would be the best practice for > > bulk > > > imports in hbase ? > > > That would be really helpful. > > > > > > Regards, > > > Praveenesh > > > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED] > > <javascript:;>> > > > wrote: > > > > > > > Just to add to whatever all the heavyweights have said above, your MR > > job > > > > may not be as efficient as the MR job corresponding to your Hive > query. > > > You > > > > can enhance the performance by setting the mapred config parameters > > > wisely > > > > and by tuning your MR job. > > > > > > > > Warm Regards, > > > > Tariq > > > > https://mtariq.jux.com/ > > > > cloudfront.blogspot.com > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > > > > Hive is more for batch and HBase is for more of real time data. > > > > > > > > > > Regards > > > > > Ram > > > > > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John < > [EMAIL PROTECTED] > > <javascript:;> > > > > > > > > > wrote: > > > > > > > > > > > In case of Hive data insertion means placing the file under table > > > path > > > > in > > > > > > HDFS. HBase need to read the data and convert it into its > format. > > > > > (HFiles) > > > > > > MR is doing this work.. So this makes it clear that HBase will > be > > > > > slower. > > > > > > :) As Michael said the read operation... > > > > > > > > > > > > > > > > > > > > > > > > -Anoop- > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < > > > [EMAIL PROTECTED] <javascript:;> > > > > > > >wrote: > > > > > > > > > > > > > Hi, > > > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr > 14 > > > > mins. > > > > > > > It's a 20 gb data set approx 230 million records. The data is > in > > > > hdfs, > > > > > > > single text file. The cluster is 11 nodes, 8 cores. > > > > > > > > > > > > > > I loaded this in hive, partitioned by date and bucketed into 32 > > and > > > > > > sorted. > > > > > > > Time taken is 6 mins. > > > > > > > > > > > > > > I loaded the same data into hbase, in the same cluster by > > writing a > > > > map > > > > > > > reduce code. It took 1hr 14 mins. The cluster wasn't running > > > anything > > > > > > else > > > > > > > and assuming that the code that i wrote is good enough, what is > > it > > > > that > > > > > > > makes hbase slower than hive in loading the data? > > > > > > > > > > > > > > Thanks, > > > > > > > Austin > > > > > > > > > > > > > > > > > > > > > > > > > +
Anoop Sam John 2013-01-21, 05:54
-
Re: Loading data, hbase slower than Hive?Austin Chungath 2013-01-21, 06:16
Anoop,
I am using HFileOutputFormat. I am doing nothing but splitting the data from each row by the delimiter and sending it into their respective columns. Is there some kind of preprocessing or steps that I should do before this? As suggested I will look into the above solutions and let you guys know what the problem was. I might have to rethink the Rowkey design. Regards, Austin. On Mon, Jan 21, 2013 at 11:24 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > Austin, > You are using HFileOutputFormat or TableOutputFormat? > > -Anoop- > ________________________________________ > From: Austin Chungath [[EMAIL PROTECTED]] > Sent: Monday, January 21, 2013 11:15 AM > To: [EMAIL PROTECTED] > Subject: Re: Loading data, hbase slower than Hive? > > Thank you Tariq. > I will let you know how things went after I implement these suggestions. > > Regards, > Austin > > On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <[EMAIL PROTECTED]> > wrote: > > > Hello Austin, > > > > I am sorry for the late response. > > > > Asaf has made a very valid point. Rowkwey design is very crucial. > > Specially if the data is gonna be sequential(timeseries kinda thing). > > You may end up with hotspotting problem. Use pre-splitted tables > > or hash the keys to avoid that. It'll also allow you to fetch the results > > faster. > > > > Warm Regards, > > Tariq > > https://mtariq.jux.com/ > > cloudfront.blogspot.com > > > > > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> > > wrote: > > > > > Start by telling us your row key design. > > > Check for pre splitting your table regions. > > > I managed to get to 25mb/sec write throughput in Hbase using 1 region > > > server. If your data is evenly spread you can get around 7 times that > in > > a > > > 10 regions server environment. Should mean that 1 gig should take 4 > sec. > > > > > > > > > On Friday, January 18, 2013, praveenesh kumar wrote: > > > > > > > Hey, > > > > Can someone throw some pointers on what would be the best practice > for > > > bulk > > > > imports in hbase ? > > > > That would be really helpful. > > > > > > > > Regards, > > > > Praveenesh > > > > > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq <[EMAIL PROTECTED] > > > <javascript:;>> > > > > wrote: > > > > > > > > > Just to add to whatever all the heavyweights have said above, your > MR > > > job > > > > > may not be as efficient as the MR job corresponding to your Hive > > query. > > > > You > > > > > can enhance the performance by setting the mapred config parameters > > > > wisely > > > > > and by tuning your MR job. > > > > > > > > > > Warm Regards, > > > > > Tariq > > > > > https://mtariq.jux.com/ > > > > > cloudfront.blogspot.com > > > > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > > > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > > > > > > Hive is more for batch and HBase is for more of real time data. > > > > > > > > > > > > Regards > > > > > > Ram > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John < > > [EMAIL PROTECTED] > > > <javascript:;> > > > > > > > > > > > wrote: > > > > > > > > > > > > > In case of Hive data insertion means placing the file under > table > > > > path > > > > > in > > > > > > > HDFS. HBase need to read the data and convert it into its > > format. > > > > > > (HFiles) > > > > > > > MR is doing this work.. So this makes it clear that HBase will > > be > > > > > > slower. > > > > > > > :) As Michael said the read operation... > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Anoop- > > > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:14 PM, Austin Chungath < > > > > [EMAIL PROTECTED] <javascript:;> > > > > > > > >wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > Problem: hive took 6 mins to load a data set, hbase took 1 hr > > 14 > > > > > mins. > > > > > > > > It's a 20 gb data set approx 230 million records. The data is > > in > > > > > hdfs, > > > > > > > > single text file. The cluster is 11 nodes, 8 cores. +
Austin Chungath 2013-01-21, 06:16
-
Re: Loading data, hbase slower than Hive?Mohammad Tariq 2013-01-21, 06:31
Apart from this you can have some additional tweaks to improve
put performance. Like, creating pre-splitted tables, making use of put(List<Put> puts) instead of normal put etc. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Mon, Jan 21, 2013 at 11:46 AM, Austin Chungath <[EMAIL PROTECTED]>wrote: > Anoop, > > I am using HFileOutputFormat. I am doing nothing but splitting the data > from each row by the delimiter and sending it into their respective > columns. > Is there some kind of preprocessing or steps that I should do before this? > As suggested I will look into the above solutions and let you guys know > what the problem was. I might have to rethink the Rowkey design. > > Regards, > Austin. > > On Mon, Jan 21, 2013 at 11:24 AM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > Austin, > > You are using HFileOutputFormat or TableOutputFormat? > > > > -Anoop- > > ________________________________________ > > From: Austin Chungath [[EMAIL PROTECTED]] > > Sent: Monday, January 21, 2013 11:15 AM > > To: [EMAIL PROTECTED] > > Subject: Re: Loading data, hbase slower than Hive? > > > > Thank you Tariq. > > I will let you know how things went after I implement these suggestions. > > > > Regards, > > Austin > > > > On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <[EMAIL PROTECTED]> > > wrote: > > > > > Hello Austin, > > > > > > I am sorry for the late response. > > > > > > Asaf has made a very valid point. Rowkwey design is very crucial. > > > Specially if the data is gonna be sequential(timeseries kinda thing). > > > You may end up with hotspotting problem. Use pre-splitted tables > > > or hash the keys to avoid that. It'll also allow you to fetch the > results > > > faster. > > > > > > Warm Regards, > > > Tariq > > > https://mtariq.jux.com/ > > > cloudfront.blogspot.com > > > > > > > > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Start by telling us your row key design. > > > > Check for pre splitting your table regions. > > > > I managed to get to 25mb/sec write throughput in Hbase using 1 region > > > > server. If your data is evenly spread you can get around 7 times that > > in > > > a > > > > 10 regions server environment. Should mean that 1 gig should take 4 > > sec. > > > > > > > > > > > > On Friday, January 18, 2013, praveenesh kumar wrote: > > > > > > > > > Hey, > > > > > Can someone throw some pointers on what would be the best practice > > for > > > > bulk > > > > > imports in hbase ? > > > > > That would be really helpful. > > > > > > > > > > Regards, > > > > > Praveenesh > > > > > > > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq < > [EMAIL PROTECTED] > > > > <javascript:;>> > > > > > wrote: > > > > > > > > > > > Just to add to whatever all the heavyweights have said above, > your > > MR > > > > job > > > > > > may not be as efficient as the MR job corresponding to your Hive > > > query. > > > > > You > > > > > > can enhance the performance by setting the mapred config > parameters > > > > > wisely > > > > > > and by tuning your MR job. > > > > > > > > > > > > Warm Regards, > > > > > > Tariq > > > > > > https://mtariq.jux.com/ > > > > > > cloudfront.blogspot.com > > > > > > > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > > > > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > > > > > > > > Hive is more for batch and HBase is for more of real time data. > > > > > > > > > > > > > > Regards > > > > > > > Ram > > > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:30 PM, Anoop John < > > > [EMAIL PROTECTED] > > > > <javascript:;> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > In case of Hive data insertion means placing the file under > > table > > > > > path > > > > > > in > > > > > > > > HDFS. HBase need to read the data and convert it into its > > > format. > > > > > > > (HFiles) > > > > > > > > MR is doing this work.. So this makes it clear that HBase +
Mohammad Tariq 2013-01-21, 06:31
-
RE: Loading data, hbase slower than Hive?Anoop Sam John 2013-01-21, 06:36
@Mohammad
As he is using HFileOutputFormat, there is no put call happening on HTable. In this case the MR will create the HFiles directly with out using the normal HBase write path. Then later using HRS API the HFiles are loaded to the table regions. In this case the number of reducers will be that of the table regions. So Austin you can check with proper presplit of table. -Anoop- ________________________________________ From: Mohammad Tariq [[EMAIL PROTECTED]] Sent: Monday, January 21, 2013 12:01 PM To: [EMAIL PROTECTED] Subject: Re: Loading data, hbase slower than Hive? Apart from this you can have some additional tweaks to improve put performance. Like, creating pre-splitted tables, making use of put(List<Put> puts) instead of normal put etc. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Mon, Jan 21, 2013 at 11:46 AM, Austin Chungath <[EMAIL PROTECTED]>wrote: > Anoop, > > I am using HFileOutputFormat. I am doing nothing but splitting the data > from each row by the delimiter and sending it into their respective > columns. > Is there some kind of preprocessing or steps that I should do before this? > As suggested I will look into the above solutions and let you guys know > what the problem was. I might have to rethink the Rowkey design. > > Regards, > Austin. > > On Mon, Jan 21, 2013 at 11:24 AM, Anoop Sam John <[EMAIL PROTECTED]> > wrote: > > > Austin, > > You are using HFileOutputFormat or TableOutputFormat? > > > > -Anoop- > > ________________________________________ > > From: Austin Chungath [[EMAIL PROTECTED]] > > Sent: Monday, January 21, 2013 11:15 AM > > To: [EMAIL PROTECTED] > > Subject: Re: Loading data, hbase slower than Hive? > > > > Thank you Tariq. > > I will let you know how things went after I implement these suggestions. > > > > Regards, > > Austin > > > > On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <[EMAIL PROTECTED]> > > wrote: > > > > > Hello Austin, > > > > > > I am sorry for the late response. > > > > > > Asaf has made a very valid point. Rowkwey design is very crucial. > > > Specially if the data is gonna be sequential(timeseries kinda thing). > > > You may end up with hotspotting problem. Use pre-splitted tables > > > or hash the keys to avoid that. It'll also allow you to fetch the > results > > > faster. > > > > > > Warm Regards, > > > Tariq > > > https://mtariq.jux.com/ > > > cloudfront.blogspot.com > > > > > > > > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Start by telling us your row key design. > > > > Check for pre splitting your table regions. > > > > I managed to get to 25mb/sec write throughput in Hbase using 1 region > > > > server. If your data is evenly spread you can get around 7 times that > > in > > > a > > > > 10 regions server environment. Should mean that 1 gig should take 4 > > sec. > > > > > > > > > > > > On Friday, January 18, 2013, praveenesh kumar wrote: > > > > > > > > > Hey, > > > > > Can someone throw some pointers on what would be the best practice > > for > > > > bulk > > > > > imports in hbase ? > > > > > That would be really helpful. > > > > > > > > > > Regards, > > > > > Praveenesh > > > > > > > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq < > [EMAIL PROTECTED] > > > > <javascript:;>> > > > > > wrote: > > > > > > > > > > > Just to add to whatever all the heavyweights have said above, > your > > MR > > > > job > > > > > > may not be as efficient as the MR job corresponding to your Hive > > > query. > > > > > You > > > > > > can enhance the performance by setting the mapred config > parameters > > > > > wisely > > > > > > and by tuning your MR job. > > > > > > > > > > > > Warm Regards, > > > > > > Tariq > > > > > > https://mtariq.jux.com/ > > > > > > cloudfront.blogspot.com > > > > > > > > > > > > > > > > > > On Thu, Jan 17, 2013 at 10:39 PM, ramkrishna vasudevan < > > > > > > [EMAIL PROTECTED] <javascript:;>> wrote: > > > > > > +
Anoop Sam John 2013-01-21, 06:36
-
Re: Loading data, hbase slower than Hive?Mohammad Tariq 2013-01-21, 06:39
Thank you so much for pointing out the mistake sir.
Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Mon, Jan 21, 2013 at 12:06 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > @Mohammad > As he is using HFileOutputFormat, there is no put call happening on > HTable. In this case the MR will create the HFiles directly with out using > the normal HBase write path. Then later using HRS API the HFiles are loaded > to the table regions. > In this case the number of reducers will be that of the table regions. So > Austin you can check with proper presplit of table. > > -Anoop- > ________________________________________ > From: Mohammad Tariq [[EMAIL PROTECTED]] > Sent: Monday, January 21, 2013 12:01 PM > To: [EMAIL PROTECTED] > Subject: Re: Loading data, hbase slower than Hive? > > Apart from this you can have some additional tweaks to improve > put performance. Like, creating pre-splitted tables, making use of > put(List<Put> puts) instead of normal put etc. > > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Mon, Jan 21, 2013 at 11:46 AM, Austin Chungath <[EMAIL PROTECTED] > >wrote: > > > Anoop, > > > > I am using HFileOutputFormat. I am doing nothing but splitting the data > > from each row by the delimiter and sending it into their respective > > columns. > > Is there some kind of preprocessing or steps that I should do before > this? > > As suggested I will look into the above solutions and let you guys know > > what the problem was. I might have to rethink the Rowkey design. > > > > Regards, > > Austin. > > > > On Mon, Jan 21, 2013 at 11:24 AM, Anoop Sam John <[EMAIL PROTECTED]> > > wrote: > > > > > Austin, > > > You are using HFileOutputFormat or TableOutputFormat? > > > > > > -Anoop- > > > ________________________________________ > > > From: Austin Chungath [[EMAIL PROTECTED]] > > > Sent: Monday, January 21, 2013 11:15 AM > > > To: [EMAIL PROTECTED] > > > Subject: Re: Loading data, hbase slower than Hive? > > > > > > Thank you Tariq. > > > I will let you know how things went after I implement these > suggestions. > > > > > > Regards, > > > Austin > > > > > > On Sun, Jan 20, 2013 at 2:42 AM, Mohammad Tariq <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hello Austin, > > > > > > > > I am sorry for the late response. > > > > > > > > Asaf has made a very valid point. Rowkwey design is very crucial. > > > > Specially if the data is gonna be sequential(timeseries kinda thing). > > > > You may end up with hotspotting problem. Use pre-splitted tables > > > > or hash the keys to avoid that. It'll also allow you to fetch the > > results > > > > faster. > > > > > > > > Warm Regards, > > > > Tariq > > > > https://mtariq.jux.com/ > > > > cloudfront.blogspot.com > > > > > > > > > > > > On Sun, Jan 20, 2013 at 1:20 AM, Asaf Mesika <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Start by telling us your row key design. > > > > > Check for pre splitting your table regions. > > > > > I managed to get to 25mb/sec write throughput in Hbase using 1 > region > > > > > server. If your data is evenly spread you can get around 7 times > that > > > in > > > > a > > > > > 10 regions server environment. Should mean that 1 gig should take 4 > > > sec. > > > > > > > > > > > > > > > On Friday, January 18, 2013, praveenesh kumar wrote: > > > > > > > > > > > Hey, > > > > > > Can someone throw some pointers on what would be the best > practice > > > for > > > > > bulk > > > > > > imports in hbase ? > > > > > > That would be really helpful. > > > > > > > > > > > > Regards, > > > > > > Praveenesh > > > > > > > > > > > > On Thu, Jan 17, 2013 at 11:16 PM, Mohammad Tariq < > > [EMAIL PROTECTED] > > > > > <javascript:;>> > > > > > > wrote: > > > > > > > > > > > > > Just to add to whatever all the heavyweights have said above, > > your > > > MR > > > > > job > > > > > > > may not be as efficient as the MR job corresponding to your > Hive > > > > query. > > > > > +
Mohammad Tariq 2013-01-21, 06:39
|