|
Weishung Chung
2011-01-10, 16:58
Weishung Chung
2011-01-10, 17:06
Jean-Daniel Cryans
2011-01-10, 17:38
Weishung Chung
2011-01-10, 18:10
Jonathan Gray
2011-01-10, 18:39
Jean-Daniel Cryans
2011-01-10, 18:44
Weishung Chung
2011-01-10, 18:58
Jean-Daniel Cryans
2011-01-10, 19:03
Weishung Chung
2011-01-10, 19:12
Jean-Daniel Cryans
2011-01-10, 19:53
Weishung Chung
2011-01-10, 20:45
Alex Baranau
2011-01-11, 15:51
Otis Gospodnetic
2011-01-16, 10:17
Weishung Chung
2011-01-18, 16:31
Jim X
2011-01-30, 19:59
tsuna
2011-01-14, 10:21
Stack
2011-01-14, 18:01
Sean Bigdatafun
2011-01-15, 00:06
tsuna
2011-01-15, 06:51
Sean Bigdatafun
2011-01-31, 18:48
Ryan Rawson
2011-02-01, 01:04
Jim X
2011-02-01, 01:13
Sean Bigdatafun
2011-02-01, 01:16
Ryan Rawson
2011-02-01, 01:19
tsuna
2011-02-01, 08:03
|
-
HTable.put(List<Put> puts) perform batch insert?Weishung Chung 2011-01-10, 16:58
Does HTable.put(List<Put> puts) method perform a batch insert with a single
RPC call? I am going to insert a lot of values into a column family and would like to increase the write speed. Thank you. +
Weishung Chung 2011-01-10, 16:58
-
Re: HTable.put(List<Put> puts) perform batch insert?Weishung Chung 2011-01-10, 17:06
What is the difference between the above put method with the following
capability of the HBaseHUT package ? https://github.com/sematext/HBaseHUT On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> wrote: > Does HTable.put(List<Put> puts) method perform a batch insert with a single > RPC call? I am going to insert a lot of values into a column family and > would like to increase the write speed. > Thank you. > +
Weishung Chung 2011-01-10, 17:06
-
Re: HTable.put(List<Put> puts) perform batch insert?Jean-Daniel Cryans 2011-01-10, 17:38
HBaseHUT is used to solve he Get+Put problem, so if it's your problem
as well then do look into it. To answer your first question, that method will group Puts by region server meaning that it will do anywhere between 1-n where n is the number of RS, and that's done in parallel. J-D On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <[EMAIL PROTECTED]> wrote: > What is the difference between the above put method with the following > capability of the HBaseHUT package ? > https://github.com/sematext/HBaseHUT > > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> wrote: > >> Does HTable.put(List<Put> puts) method perform a batch insert with a single >> RPC call? I am going to insert a lot of values into a column family and >> would like to increase the write speed. >> Thank you. >> > +
Jean-Daniel Cryans 2011-01-10, 17:38
-
Re: HTable.put(List<Put> puts) perform batch insert?Weishung Chung 2011-01-10, 18:10
Thank you :)
Could I use org.apache.hadoop.hbase.io.BatchUpdate ? Would it be faster than the put(List<Put>)? Also, would you recommend the use of MapReduce to accomplish the samething? On Mon, Jan 10, 2011 at 11:38 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > HBaseHUT is used to solve he Get+Put problem, so if it's your problem > as well then do look into it. > > To answer your first question, that method will group Puts by region > server meaning that it will do anywhere between 1-n where n is the > number of RS, and that's done in parallel. > > J-D > > On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > What is the difference between the above put method with the following > > capability of the HBaseHUT package ? > > https://github.com/sematext/HBaseHUT > > > > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > > >> Does HTable.put(List<Put> puts) method perform a batch insert with a > single > >> RPC call? I am going to insert a lot of values into a column family and > >> would like to increase the write speed. > >> Thank you. > >> > > > +
Weishung Chung 2011-01-10, 18:10
-
RE: HTable.put(List<Put> puts) perform batch insert?Jonathan Gray 2011-01-10, 18:39
BatchUpdate is the old, deprecated version of Put. You are using the best APIs.
> -----Original Message----- > From: Weishung Chung [mailto:[EMAIL PROTECTED]] > Sent: Monday, January 10, 2011 10:10 AM > To: [EMAIL PROTECTED] > Subject: Re: HTable.put(List<Put> puts) perform batch insert? > > Thank you :) > Could I use org.apache.hadoop.hbase.io.BatchUpdate ? Would it be faster > than the put(List<Put>)? > Also, would you recommend the use of MapReduce to accomplish the > samething? > > On Mon, Jan 10, 2011 at 11:38 AM, Jean-Daniel Cryans > <[EMAIL PROTECTED]>wrote: > > > HBaseHUT is used to solve he Get+Put problem, so if it's your problem > > as well then do look into it. > > > > To answer your first question, that method will group Puts by region > > server meaning that it will do anywhere between 1-n where n is the > > number of RS, and that's done in parallel. > > > > J-D > > > > On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung > <[EMAIL PROTECTED]> > > wrote: > > > What is the difference between the above put method with the > > > following capability of the HBaseHUT package ? > > > https://github.com/sematext/HBaseHUT > > > > > > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung > > > <[EMAIL PROTECTED]> > > wrote: > > > > > >> Does HTable.put(List<Put> puts) method perform a batch insert with > > >> a > > single > > >> RPC call? I am going to insert a lot of values into a column family > > >> and would like to increase the write speed. > > >> Thank you. > > >> > > > > > +
Jonathan Gray 2011-01-10, 18:39
-
Re: HTable.put(List<Put> puts) perform batch insert?Jean-Daniel Cryans 2011-01-10, 18:44
BatchUpdate is deprecated and gone after 0.20, also the name was
misleading because it was batching edits on multiple columns but not rows. If I'm guessing correctly, you want to do an initial import of your data? The brute force way is to write a MR job but I would first recommend that you look into using the bulk uploader tools such as http://hbase.apache.org/docs/r0.89.20100924/bulk-loads.html J-D On Mon, Jan 10, 2011 at 10:10 AM, Weishung Chung <[EMAIL PROTECTED]> wrote: > Thank you :) > Could I use org.apache.hadoop.hbase.io.BatchUpdate ? Would it be faster than > the put(List<Put>)? > Also, would you recommend the use of MapReduce to accomplish the samething? > > On Mon, Jan 10, 2011 at 11:38 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> HBaseHUT is used to solve he Get+Put problem, so if it's your problem >> as well then do look into it. >> >> To answer your first question, that method will group Puts by region >> server meaning that it will do anywhere between 1-n where n is the >> number of RS, and that's done in parallel. >> >> J-D >> >> On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <[EMAIL PROTECTED]> >> wrote: >> > What is the difference between the above put method with the following >> > capability of the HBaseHUT package ? >> > https://github.com/sematext/HBaseHUT >> > >> > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> >> wrote: >> > >> >> Does HTable.put(List<Put> puts) method perform a batch insert with a >> single >> >> RPC call? I am going to insert a lot of values into a column family and >> >> would like to increase the write speed. >> >> Thank you. >> >> >> > >> > +
Jean-Daniel Cryans 2011-01-10, 18:44
-
Re: HTable.put(List<Put> puts) perform batch insert?Weishung Chung 2011-01-10, 18:58
Jonathan, awesome, best of breed APIs!
Jean, I would like to insert lotsa new rows with many columns in a particular column family* **programmatically in batch just like the jdbc addBatch method.* *Thanks again.* On Mon, Jan 10, 2011 at 12:44 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > BatchUpdate is deprecated and gone after 0.20, also the name was > misleading because it was batching edits on multiple columns but not > rows. > > If I'm guessing correctly, you want to do an initial import of your > data? The brute force way is to write a MR job but I would first > recommend that you look into using the bulk uploader tools such as > http://hbase.apache.org/docs/r0.89.20100924/bulk-loads.html > > J-D > > On Mon, Jan 10, 2011 at 10:10 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > Thank you :) > > Could I use org.apache.hadoop.hbase.io.BatchUpdate ? Would it be faster > than > > the put(List<Put>)? > > Also, would you recommend the use of MapReduce to accomplish the > samething? > > > > On Mon, Jan 10, 2011 at 11:38 AM, Jean-Daniel Cryans < > [EMAIL PROTECTED]>wrote: > > > >> HBaseHUT is used to solve he Get+Put problem, so if it's your problem > >> as well then do look into it. > >> > >> To answer your first question, that method will group Puts by region > >> server meaning that it will do anywhere between 1-n where n is the > >> number of RS, and that's done in parallel. > >> > >> J-D > >> > >> On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <[EMAIL PROTECTED]> > >> wrote: > >> > What is the difference between the above put method with the following > >> > capability of the HBaseHUT package ? > >> > https://github.com/sematext/HBaseHUT > >> > > >> > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> > >> wrote: > >> > > >> >> Does HTable.put(List<Put> puts) method perform a batch insert with a > >> single > >> >> RPC call? I am going to insert a lot of values into a column family > and > >> >> would like to increase the write speed. > >> >> Thank you. > >> >> > >> > > >> > > > +
Weishung Chung 2011-01-10, 18:58
-
Re: HTable.put(List<Put> puts) perform batch insert?Jean-Daniel Cryans 2011-01-10, 19:03
lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP?
J-D On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> wrote: > Jonathan, awesome, best of breed APIs! > Jean, I would like to insert lotsa new rows with many columns in a > particular column family* **programmatically in batch just like the jdbc > addBatch method.* > *Thanks again.* > > +
Jean-Daniel Cryans 2011-01-10, 19:03
-
Re: HTable.put(List<Put> puts) perform batch insert?Weishung Chung 2011-01-10, 19:12
Multiple batches of 10k *new/updated* rows at any time to different tables
by different clients simultaneously. I want these multiple batches of insertions to be done super fast. At the same time, I would like to be able to scale up to 100k rows at a time (the goal). Now, I am building a cluster of size 6 to 7 nodes. On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP? > > J-D > > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > Jonathan, awesome, best of breed APIs! > > Jean, I would like to insert lotsa new rows with many columns in a > > particular column family* **programmatically in batch just like the jdbc > > addBatch method.* > > *Thanks again.* > > > > > +
Weishung Chung 2011-01-10, 19:12
-
Re: HTable.put(List<Put> puts) perform batch insert?Jean-Daniel Cryans 2011-01-10, 19:53
Depending on the level of super fastness you need, it may or may not
be fast enough. Better to test it, as usual. J-D On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <[EMAIL PROTECTED]> wrote: > Multiple batches of 10k *new/updated* rows at any time to different tables > by different clients simultaneously. I want these multiple batches of > insertions to be done super fast. At the same time, I would like to be able > to scale up to 100k rows at a time (the goal). Now, I am building a cluster > of size 6 to 7 nodes. > > On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > >> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP? >> >> J-D >> >> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> >> wrote: >> > Jonathan, awesome, best of breed APIs! >> > Jean, I would like to insert lotsa new rows with many columns in a >> > particular column family* **programmatically in batch just like the jdbc >> > addBatch method.* >> > *Thanks again.* >> > >> > >> > +
Jean-Daniel Cryans 2011-01-10, 19:53
-
Re: HTable.put(List<Put> puts) perform batch insert?Weishung Chung 2011-01-10, 20:45
Ok, i will test it, thanks again :)
On Mon, Jan 10, 2011 at 1:53 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > Depending on the level of super fastness you need, it may or may not > be fast enough. Better to test it, as usual. > > J-D > > On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > Multiple batches of 10k *new/updated* rows at any time to different > tables > > by different clients simultaneously. I want these multiple batches of > > insertions to be done super fast. At the same time, I would like to be > able > > to scale up to 100k rows at a time (the goal). Now, I am building a > cluster > > of size 6 to 7 nodes. > > > > On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] > >wrote: > > > >> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP? > >> > >> J-D > >> > >> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> > >> wrote: > >> > Jonathan, awesome, best of breed APIs! > >> > Jean, I would like to insert lotsa new rows with many columns in a > >> > particular column family* **programmatically in batch just like the > jdbc > >> > addBatch method.* > >> > *Thanks again.* > >> > > >> > > >> > > > +
Weishung Chung 2011-01-10, 20:45
-
Re: HTable.put(List<Put> puts) perform batch insert?Alex Baranau 2011-01-11, 15:51
Re HBaseHUT J-D was correct: you will gain speed with it in case you need
Get & Put operation to perform your updates. Don't forget to play with writeToWAL, writeBuffer (with autoFlush=false) attributes! Alex Baranau ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase On Mon, Jan 10, 2011 at 10:45 PM, Weishung Chung <[EMAIL PROTECTED]> wrote: > Ok, i will test it, thanks again :) > > On Mon, Jan 10, 2011 at 1:53 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] > >wrote: > > > Depending on the level of super fastness you need, it may or may not > > be fast enough. Better to test it, as usual. > > > > J-D > > > > On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <[EMAIL PROTECTED]> > > wrote: > > > Multiple batches of 10k *new/updated* rows at any time to different > > tables > > > by different clients simultaneously. I want these multiple batches of > > > insertions to be done super fast. At the same time, I would like to be > > able > > > to scale up to 100k rows at a time (the goal). Now, I am building a > > cluster > > > of size 6 to 7 nodes. > > > > > > On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans < > [EMAIL PROTECTED] > > >wrote: > > > > > >> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP? > > >> > > >> J-D > > >> > > >> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> > > >> wrote: > > >> > Jonathan, awesome, best of breed APIs! > > >> > Jean, I would like to insert lotsa new rows with many columns in a > > >> > particular column family* **programmatically in batch just like the > > jdbc > > >> > addBatch method.* > > >> > *Thanks again.* > > >> > > > >> > > > >> > > > > > > +
Alex Baranau 2011-01-11, 15:51
-
Re: HTable.put(List<Put> puts) perform batch insert?Otis Gospodnetic 2011-01-16, 10:17
Hi,
Re HBaseHUT - Alex didn't mention it, but he did a really nice and clear writeup of it in this post: http://blog.sematext.com/2010/12/16/deferring-processing-updates-to-increase-hbase-write-performance/ Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Alex Baranau <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tue, January 11, 2011 10:51:28 AM > Subject: Re: HTable.put(List<Put> puts) perform batch insert? > > Re HBaseHUT J-D was correct: you will gain speed with it in case you need > Get & Put operation to perform your updates. > > Don't forget to play with writeToWAL, writeBuffer (with autoFlush=false) > attributes! > > Alex Baranau > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase > > On Mon, Jan 10, 2011 at 10:45 PM, Weishung Chung <[EMAIL PROTECTED]> wrote: > > > Ok, i will test it, thanks again :) > > > > On Mon, Jan 10, 2011 at 1:53 PM, Jean-Daniel Cryans <[EMAIL PROTECTED] > > >wrote: > > > > > Depending on the level of super fastness you need, it may or may not > > > be fast enough. Better to test it, as usual. > > > > > > J-D > > > > > > On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <[EMAIL PROTECTED]> > > > wrote: > > > > Multiple batches of 10k *new/updated* rows at any time to different > > > tables > > > > by different clients simultaneously. I want these multiple batches of > > > > insertions to be done super fast. At the same time, I would like to be > > > able > > > > to scale up to 100k rows at a time (the goal). Now, I am building a > > > cluster > > > > of size 6 to 7 nodes. > > > > > > > > On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > >> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP? > > > >> > > > >> J-D > > > >> > > > >> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> > > > >> wrote: > > > >> > Jonathan, awesome, best of breed APIs! > > > >> > Jean, I would like to insert lotsa new rows with many columns in a > > > >> > particular column family* **programmatically in batch just like the > > > jdbc > > > >> > addBatch method.* > > > >> > *Thanks again.* > > > >> > > > > >> > > > > >> > > > > > > > > > > +
Otis Gospodnetic 2011-01-16, 10:17
-
Re: HTable.put(List<Put> puts) perform batch insert?Weishung Chung 2011-01-18, 16:31
Thank you, I will look into these packages :)
On Sun, Jan 16, 2011 at 4:17 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Hi, > > Re HBaseHUT - Alex didn't mention it, but he did a really nice and clear > writeup > of it in this post: > > http://blog.sematext.com/2010/12/16/deferring-processing-updates-to-increase-hbase-write-performance/ > > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- > > From: Alex Baranau <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Tue, January 11, 2011 10:51:28 AM > > Subject: Re: HTable.put(List<Put> puts) perform batch insert? > > > > Re HBaseHUT J-D was correct: you will gain speed with it in case you need > > Get & Put operation to perform your updates. > > > > Don't forget to play with writeToWAL, writeBuffer (with autoFlush=false) > > attributes! > > > > Alex Baranau > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - > HBase > > > > On Mon, Jan 10, 2011 at 10:45 PM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > > > > Ok, i will test it, thanks again :) > > > > > > On Mon, Jan 10, 2011 at 1:53 PM, Jean-Daniel Cryans < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Depending on the level of super fastness you need, it may or may not > > > > be fast enough. Better to test it, as usual. > > > > > > > > J-D > > > > > > > > On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung < > [EMAIL PROTECTED]> > > > > wrote: > > > > > Multiple batches of 10k *new/updated* rows at any time to > different > > > > tables > > > > > by different clients simultaneously. I want these multiple batches > of > > > > > insertions to be done super fast. At the same time, I would like > to be > > > > able > > > > > to scale up to 100k rows at a time (the goal). Now, I am building > a > > > > cluster > > > > > of size 6 to 7 nodes. > > > > > > > > > > On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans < > > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > >> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP? > > > > >> > > > > >> J-D > > > > >> > > > > >> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung < > [EMAIL PROTECTED]> > > > > >> wrote: > > > > >> > Jonathan, awesome, best of breed APIs! > > > > >> > Jean, I would like to insert lotsa new rows with many columns > in a > > > > >> > particular column family* **programmatically in batch just like > the > > > > jdbc > > > > >> > addBatch method.* > > > > >> > *Thanks again.* > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > +
Weishung Chung 2011-01-18, 16:31
-
Re: HTable.put(List<Put> puts) perform batch insert?Jim X 2011-01-30, 19:59
Which one do you use finally for batch process like JDBC batch?
On Tue, Jan 18, 2011 at 11:31 AM, Weishung Chung <[EMAIL PROTECTED]> wrote: > Thank you, I will look into these packages :) > > On Sun, Jan 16, 2011 at 4:17 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > >> Hi, >> >> Re HBaseHUT - Alex didn't mention it, but he did a really nice and clear >> writeup >> of it in this post: >> >> http://blog.sematext.com/2010/12/16/deferring-processing-updates-to-increase-hbase-write-performance/ >> >> >> Otis >> ---- >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> Lucene ecosystem search :: http://search-lucene.com/ >> >> >> >> ----- Original Message ---- >> > From: Alex Baranau <[EMAIL PROTECTED]> >> > To: [EMAIL PROTECTED] >> > Sent: Tue, January 11, 2011 10:51:28 AM >> > Subject: Re: HTable.put(List<Put> puts) perform batch insert? >> > >> > Re HBaseHUT J-D was correct: you will gain speed with it in case you need >> > Get & Put operation to perform your updates. >> > >> > Don't forget to play with writeToWAL, writeBuffer (with autoFlush=false) >> > attributes! >> > >> > Alex Baranau >> > ---- >> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - >> HBase >> > >> > On Mon, Jan 10, 2011 at 10:45 PM, Weishung Chung <[EMAIL PROTECTED]> >> wrote: >> > >> > > Ok, i will test it, thanks again :) >> > > >> > > On Mon, Jan 10, 2011 at 1:53 PM, Jean-Daniel Cryans < >> [EMAIL PROTECTED] >> > > >wrote: >> > > >> > > > Depending on the level of super fastness you need, it may or may not >> > > > be fast enough. Better to test it, as usual. >> > > > >> > > > J-D >> > > > >> > > > On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung < >> [EMAIL PROTECTED]> >> > > > wrote: >> > > > > Multiple batches of 10k *new/updated* rows at any time to >> different >> > > > tables >> > > > > by different clients simultaneously. I want these multiple batches >> of >> > > > > insertions to be done super fast. At the same time, I would like >> to be >> > > > able >> > > > > to scale up to 100k rows at a time (the goal). Now, I am building >> a >> > > > cluster >> > > > > of size 6 to 7 nodes. >> > > > > >> > > > > On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans < >> > > [EMAIL PROTECTED] >> > > > >wrote: >> > > > > >> > > > >> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP? >> > > > >> >> > > > >> J-D >> > > > >> >> > > > >> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung < >> [EMAIL PROTECTED]> >> > > > >> wrote: >> > > > >> > Jonathan, awesome, best of breed APIs! >> > > > >> > Jean, I would like to insert lotsa new rows with many columns >> in a >> > > > >> > particular column family* **programmatically in batch just like >> the >> > > > jdbc >> > > > >> > addBatch method.* >> > > > >> > *Thanks again.* >> > > > >> > >> > > > >> > >> > > > >> >> > > > > >> > > > >> > > >> > >> > +
Jim X 2011-01-30, 19:59
-
Re: HTable.put(List<Put> puts) perform batch insert?tsuna 2011-01-14, 10:21
On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <[EMAIL PROTECTED]> wrote:
> Multiple batches of 10k *new/updated* rows at any time to different tables > by different clients simultaneously. I want these multiple batches of > insertions to be done super fast. At the same time, I would like to be able > to scale up to 100k rows at a time (the goal). Now, I am building a cluster > of size 6 to 7 nodes. If you're writing a multi-threaded client and you're going to have many clients like this writing to HBase continuously, I recommend writing your application with asynchbase (http://github.com/stumbleupon/asynchbase) instead. It's an alternate HBase client library I wrote and in my application it significantly increased write throughput. It can easily push 150k updates per second to a 20-node cluster – and then it's the local machine that's CPU bound, not the HBase cluster (the local machine is a very slow VM so it doesn't have a lot of horsepower). This client is especially good for throughput oriented workloads and was written to be thread-safe from the ground up (unlike HTable). -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com +
tsuna 2011-01-14, 10:21
-
Re: HTable.put(List<Put> puts) perform batch insert?Stack 2011-01-14, 18:01
It would be interesting to hear your experience w asynchronous hbase client (it is used extensively at su where a few of us hbase committers work)
Stack On Jan 14, 2011, at 2:21, tsuna <[EMAIL PROTECTED]> wrote: > On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <[EMAIL PROTECTED]> wrote: >> Multiple batches of 10k *new/updated* rows at any time to different tables >> by different clients simultaneously. I want these multiple batches of >> insertions to be done super fast. At the same time, I would like to be able >> to scale up to 100k rows at a time (the goal). Now, I am building a cluster >> of size 6 to 7 nodes. > > If you're writing a multi-threaded client and you're going to have > many clients like this writing to HBase continuously, I recommend > writing your application with asynchbase > (http://github.com/stumbleupon/asynchbase) instead. It's an alternate > HBase client library I wrote and in my application it significantly > increased write throughput. It can easily push 150k updates per > second to a 20-node cluster – and then it's the local machine that's > CPU bound, not the HBase cluster (the local machine is a very slow VM > so it doesn't have a lot of horsepower). This client is especially > good for throughput oriented workloads and was written to be > thread-safe from the ground up (unlike HTable). > > -- > Benoit "tsuna" Sigoure > Software Engineer @ www.StumbleUpon.com +
Stack 2011-01-14, 18:01
-
Re: HTable.put(List<Put> puts) perform batch insert?Sean Bigdatafun 2011-01-15, 00:06
But how can the client understand which k-v belongs to an individual RS?
Does it need to scan the .META. table? (if so, it's an expensive op). On the RegionServer side, is it like processing multiple requests in a batch per RPC? Can you guide us to dive it a bit more? Thanks, Sean On Mon, Jan 10, 2011 at 9:38 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > HBaseHUT is used to solve he Get+Put problem, so if it's your problem > as well then do look into it. > > To answer your first question, that method will group Puts by region > server meaning that it will do anywhere between 1-n where n is the > number of RS, and that's done in parallel. > > J-D > > On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > What is the difference between the above put method with the following > > capability of the HBaseHUT package ? > > https://github.com/sematext/HBaseHUT > > > > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <[EMAIL PROTECTED]> > wrote: > > > >> Does HTable.put(List<Put> puts) method perform a batch insert with a > single > >> RPC call? I am going to insert a lot of values into a column family and > >> would like to increase the write speed. > >> Thank you. > >> > > > -- --Sean +
Sean Bigdatafun 2011-01-15, 00:06
-
Re: HTable.put(List<Put> puts) perform batch insert?tsuna 2011-01-15, 06:51
On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun
<[EMAIL PROTECTED]> wrote: > But how can the client understand which k-v belongs to an individual RS? > Does it need to scan the .META. table? (if so, it's an expensive op). On the > RegionServer side, is it like processing multiple requests in a batch per > RPC? The client has to figure out which region each edit has to go to. The client maintains a local cache of the META table, so when you frequently use the same working set of regions (which is common for most applications), the lookups are essentially free. The worst case is a client that does random-writes to all the regions in a huge table. In this case, the client will end up discovering the location of all the regions of that table and keep this in its in-memory cache. But regions move around, are split etc. This does cause extra META lookups, but the latency for a META lookup is typically very small (even though the penalty incurred by the client compared to cache hits in its local META cache is huge, comparatively speaking). Note that right now neither HTable nor asynchbase pro-actively evict unused entries from the local META cache to save memory. I don't think anyone is running HBase at a scale where this optimization would be useful. If you have a write-heavy application, you're always going to get significantly higher throughput when you send your edits in batch to the server. The downside to this is that when your client application dies, you lose all the edits in the un-committed batch. Unlike HTable, asynchbase puts an upper bound on the amount of time an edit is allowed to remain in the client's buffer, which helps limit data-loss when a client crashes (OpenTSDB sets this to 1s by default, so when it dies, you know you lost at most 1s worth of datapoints). -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com +
tsuna 2011-01-15, 06:51
-
Re: HTable.put(List<Put> puts) perform batch insert?Sean Bigdatafun 2011-01-31, 18:48
On Fri, Jan 14, 2011 at 10:51 PM, tsuna <[EMAIL PROTECTED]> wrote:
> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun > <[EMAIL PROTECTED]> wrote: > > But how can the client understand which k-v belongs to an individual RS? > > Does it need to scan the .META. table? (if so, it's an expensive op). On > the > > RegionServer side, is it like processing multiple requests in a batch per > > RPC? > > The client has to figure out which region each edit has to go to. The > client maintains a local cache of the META table, so when you > frequently use the same working set of regions (which is common for > most applications), the lookups are essentially free. > > The worst case is a client that does random-writes to all the regions > in a huge table. In this case, the client will end up discovering the > location of all the regions of that table and keep this in its > in-memory cache. But regions move around, are split etc. This does > cause extra META lookups, but the latency for a META lookup is > typically very small (even though the penalty incurred by the client > compared to cache hits in its local META cache is huge, comparatively > speaking). Note that right now neither HTable nor asynchbase > pro-actively evict unused entries from the local META cache to save > memory. I don't think anyone is running HBase at a scale where this > optimization would be useful. > > If you have a write-heavy application, you're always going to get > significantly higher throughput when you send your edits in batch to > the server. The downside to this is that when your client application > dies, you lose all the edits in the un-committed batch. Unlike > HTable, asynchbase puts an upper bound on the amount of time an edit > is allowed to remain in the client's buffer, which helps limit > data-loss when a client crashes (OpenTSDB sets this to 1s by default, > so when it dies, you know you lost at most 1s worth of datapoints). > * setWriteBufferSize(1024*1014*10); // 10MB * *setAutoFlush(false*); for(i=0; i<N; i++) { list.add(putitem[i]); } htable.put(list); For the above pseudo code (using put(List) to commit update in HBase), can I get a "batch transaction" success notification? * i.e., How can I know all the items have been successfully committed? -- it seems that I can't get such information, all are best-effort. Should I know some commits fail, I can do an application-level retry. * *setAutoFlush(true*); does not seem to help us to get any more reliable operation either. > > -- > Benoit "tsuna" Sigoure > Software Engineer @ www.StumbleUpon.com > -- --Sean +
Sean Bigdatafun 2011-01-31, 18:48
-
Re: HTable.put(List<Put> puts) perform batch insert?Ryan Rawson 2011-02-01, 01:04
When you are using the buffer, you also need to flush it:
htable.flushCommits(); If the call succeeds, the edits were persisted. If at any point you get exceptions, the unfinished edits are left in the write buffer and htable.getWriteBuffer() gets you them. -ryan On Mon, Jan 31, 2011 at 10:48 AM, Sean Bigdatafun <[EMAIL PROTECTED]> wrote: > On Fri, Jan 14, 2011 at 10:51 PM, tsuna <[EMAIL PROTECTED]> wrote: > >> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun >> <[EMAIL PROTECTED]> wrote: >> > But how can the client understand which k-v belongs to an individual RS? >> > Does it need to scan the .META. table? (if so, it's an expensive op). On >> the >> > RegionServer side, is it like processing multiple requests in a batch per >> > RPC? >> >> The client has to figure out which region each edit has to go to. The >> client maintains a local cache of the META table, so when you >> frequently use the same working set of regions (which is common for >> most applications), the lookups are essentially free. >> >> The worst case is a client that does random-writes to all the regions >> in a huge table. In this case, the client will end up discovering the >> location of all the regions of that table and keep this in its >> in-memory cache. But regions move around, are split etc. This does >> cause extra META lookups, but the latency for a META lookup is >> typically very small (even though the penalty incurred by the client >> compared to cache hits in its local META cache is huge, comparatively >> speaking). Note that right now neither HTable nor asynchbase >> pro-actively evict unused entries from the local META cache to save >> memory. I don't think anyone is running HBase at a scale where this >> optimization would be useful. >> >> If you have a write-heavy application, you're always going to get >> significantly higher throughput when you send your edits in batch to >> the server. The downside to this is that when your client application >> dies, you lose all the edits in the un-committed batch. Unlike >> HTable, asynchbase puts an upper bound on the amount of time an edit >> is allowed to remain in the client's buffer, which helps limit >> data-loss when a client crashes (OpenTSDB sets this to 1s by default, >> so when it dies, you know you lost at most 1s worth of datapoints). >> > * > > setWriteBufferSize(1024*1014*10); // 10MB > > * > > *setAutoFlush(false*); > > for(i=0; i<N; i++) { > > list.add(putitem[i]); > > } > > htable.put(list); > > > For the above pseudo code (using put(List) to commit update in HBase), can I > get a "batch transaction" success notification? > * i.e., How can I know all the items have been successfully > committed? -- it seems that I can't get such information, all are > best-effort. Should I know some commits fail, I can do an application-level > retry. > * *setAutoFlush(true*); does not seem to help us to get any more > reliable operation either. > > > > > >> >> -- >> Benoit "tsuna" Sigoure >> Software Engineer @ www.StumbleUpon.com >> > > > > -- > --Sean > +
Ryan Rawson 2011-02-01, 01:04
-
Re: HTable.put(List<Put> puts) perform batch insert?Jim X 2011-02-01, 01:13
Does Htable.getWriteBuffer() do a roll back?
Jim On Mon, Jan 31, 2011 at 8:04 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > When you are using the buffer, you also need to flush it: > > htable.flushCommits(); > > If the call succeeds, the edits were persisted. If at any point you > get exceptions, the unfinished edits are left in the write buffer and > htable.getWriteBuffer() gets you them. > > -ryan > > On Mon, Jan 31, 2011 at 10:48 AM, Sean Bigdatafun > <[EMAIL PROTECTED]> wrote: >> On Fri, Jan 14, 2011 at 10:51 PM, tsuna <[EMAIL PROTECTED]> wrote: >> >>> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun >>> <[EMAIL PROTECTED]> wrote: >>> > But how can the client understand which k-v belongs to an individual RS? >>> > Does it need to scan the .META. table? (if so, it's an expensive op). On >>> the >>> > RegionServer side, is it like processing multiple requests in a batch per >>> > RPC? >>> >>> The client has to figure out which region each edit has to go to. The >>> client maintains a local cache of the META table, so when you >>> frequently use the same working set of regions (which is common for >>> most applications), the lookups are essentially free. >>> >>> The worst case is a client that does random-writes to all the regions >>> in a huge table. In this case, the client will end up discovering the >>> location of all the regions of that table and keep this in its >>> in-memory cache. But regions move around, are split etc. This does >>> cause extra META lookups, but the latency for a META lookup is >>> typically very small (even though the penalty incurred by the client >>> compared to cache hits in its local META cache is huge, comparatively >>> speaking). Note that right now neither HTable nor asynchbase >>> pro-actively evict unused entries from the local META cache to save >>> memory. I don't think anyone is running HBase at a scale where this >>> optimization would be useful. >>> >>> If you have a write-heavy application, you're always going to get >>> significantly higher throughput when you send your edits in batch to >>> the server. The downside to this is that when your client application >>> dies, you lose all the edits in the un-committed batch. Unlike >>> HTable, asynchbase puts an upper bound on the amount of time an edit >>> is allowed to remain in the client's buffer, which helps limit >>> data-loss when a client crashes (OpenTSDB sets this to 1s by default, >>> so when it dies, you know you lost at most 1s worth of datapoints). >>> >> * >> >> setWriteBufferSize(1024*1014*10); // 10MB >> >> * >> >> *setAutoFlush(false*); >> >> for(i=0; i<N; i++) { >> >> list.add(putitem[i]); >> >> } >> >> htable.put(list); >> >> >> For the above pseudo code (using put(List) to commit update in HBase), can I >> get a "batch transaction" success notification? >> * i.e., How can I know all the items have been successfully >> committed? -- it seems that I can't get such information, all are >> best-effort. Should I know some commits fail, I can do an application-level >> retry. >> * *setAutoFlush(true*); does not seem to help us to get any more >> reliable operation either. >> >> >> >> >> >>> >>> -- >>> Benoit "tsuna" Sigoure >>> Software Engineer @ www.StumbleUpon.com >>> >> >> >> >> -- >> --Sean >> > +
Jim X 2011-02-01, 01:13
-
Re: HTable.put(List<Put> puts) perform batch insert?Sean Bigdatafun 2011-02-01, 01:16
On Mon, Jan 31, 2011 at 5:13 PM, Jim X <[EMAIL PROTECTED]> wrote:
> Does Htable.getWriteBuffer() do a roll back? > > I guess not --- this only allows you to know what has not been successfully committed to the server after you catch the exception. Correct me if I am wrong. Sean > Jim > > On Mon, Jan 31, 2011 at 8:04 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > > When you are using the buffer, you also need to flush it: > > > > htable.flushCommits(); > > > > If the call succeeds, the edits were persisted. If at any point you > > get exceptions, the unfinished edits are left in the write buffer and > > htable.getWriteBuffer() gets you them. > > > > -ryan > > > > On Mon, Jan 31, 2011 at 10:48 AM, Sean Bigdatafun > > <[EMAIL PROTECTED]> wrote: > >> On Fri, Jan 14, 2011 at 10:51 PM, tsuna <[EMAIL PROTECTED]> wrote: > >> > >>> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun > >>> <[EMAIL PROTECTED]> wrote: > >>> > But how can the client understand which k-v belongs to an individual > RS? > >>> > Does it need to scan the .META. table? (if so, it's an expensive op). > On > >>> the > >>> > RegionServer side, is it like processing multiple requests in a batch > per > >>> > RPC? > >>> > >>> The client has to figure out which region each edit has to go to. The > >>> client maintains a local cache of the META table, so when you > >>> frequently use the same working set of regions (which is common for > >>> most applications), the lookups are essentially free. > >>> > >>> The worst case is a client that does random-writes to all the regions > >>> in a huge table. In this case, the client will end up discovering the > >>> location of all the regions of that table and keep this in its > >>> in-memory cache. But regions move around, are split etc. This does > >>> cause extra META lookups, but the latency for a META lookup is > >>> typically very small (even though the penalty incurred by the client > >>> compared to cache hits in its local META cache is huge, comparatively > >>> speaking). Note that right now neither HTable nor asynchbase > >>> pro-actively evict unused entries from the local META cache to save > >>> memory. I don't think anyone is running HBase at a scale where this > >>> optimization would be useful. > >>> > >>> If you have a write-heavy application, you're always going to get > >>> significantly higher throughput when you send your edits in batch to > >>> the server. The downside to this is that when your client application > >>> dies, you lose all the edits in the un-committed batch. Unlike > >>> HTable, asynchbase puts an upper bound on the amount of time an edit > >>> is allowed to remain in the client's buffer, which helps limit > >>> data-loss when a client crashes (OpenTSDB sets this to 1s by default, > >>> so when it dies, you know you lost at most 1s worth of datapoints). > >>> > >> * > >> > >> setWriteBufferSize(1024*1014*10); // 10MB > >> > >> * > >> > >> *setAutoFlush(false*); > >> > >> for(i=0; i<N; i++) { > >> > >> list.add(putitem[i]); > >> > >> } > >> > >> htable.put(list); > >> > >> > >> For the above pseudo code (using put(List) to commit update in HBase), > can I > >> get a "batch transaction" success notification? > >> * i.e., How can I know all the items have been successfully > >> committed? -- it seems that I can't get such information, all are > >> best-effort. Should I know some commits fail, I can do an > application-level > >> retry. > >> * *setAutoFlush(true*); does not seem to help us to get any more > >> reliable operation either. > >> > >> > >> > >> > >> > >>> > >>> -- > >>> Benoit "tsuna" Sigoure > >>> Software Engineer @ www.StumbleUpon.com > >>> > >> > >> > >> > >> -- > >> --Sean > >> > > > -- --Sean +
Sean Bigdatafun 2011-02-01, 01:16
-
Re: HTable.put(List<Put> puts) perform batch insert?Ryan Rawson 2011-02-01, 01:19
It just retrieves the current state of the buffer. The buffer is
mutated to remove successful edits as they occur, during an exception the ones that were determined to be successful were also removed. So if you catch an exception, you can inspect this buffer and know these puts need to be sent again. It is possible to just retry calling flushCommits() again as well, to add further retries beyond the base number that the client already does. I've done that on large map reduce import jobs, since a cluster churn should eventually settle down, but restarting a 12 hour data import job sucks. -ryan On Mon, Jan 31, 2011 at 5:13 PM, Jim X <[EMAIL PROTECTED]> wrote: > Does Htable.getWriteBuffer() do a roll back? > > Jim > > On Mon, Jan 31, 2011 at 8:04 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: >> When you are using the buffer, you also need to flush it: >> >> htable.flushCommits(); >> >> If the call succeeds, the edits were persisted. If at any point you >> get exceptions, the unfinished edits are left in the write buffer and >> htable.getWriteBuffer() gets you them. >> >> -ryan >> >> On Mon, Jan 31, 2011 at 10:48 AM, Sean Bigdatafun >> <[EMAIL PROTECTED]> wrote: >>> On Fri, Jan 14, 2011 at 10:51 PM, tsuna <[EMAIL PROTECTED]> wrote: >>> >>>> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun >>>> <[EMAIL PROTECTED]> wrote: >>>> > But how can the client understand which k-v belongs to an individual RS? >>>> > Does it need to scan the .META. table? (if so, it's an expensive op). On >>>> the >>>> > RegionServer side, is it like processing multiple requests in a batch per >>>> > RPC? >>>> >>>> The client has to figure out which region each edit has to go to. The >>>> client maintains a local cache of the META table, so when you >>>> frequently use the same working set of regions (which is common for >>>> most applications), the lookups are essentially free. >>>> >>>> The worst case is a client that does random-writes to all the regions >>>> in a huge table. In this case, the client will end up discovering the >>>> location of all the regions of that table and keep this in its >>>> in-memory cache. But regions move around, are split etc. This does >>>> cause extra META lookups, but the latency for a META lookup is >>>> typically very small (even though the penalty incurred by the client >>>> compared to cache hits in its local META cache is huge, comparatively >>>> speaking). Note that right now neither HTable nor asynchbase >>>> pro-actively evict unused entries from the local META cache to save >>>> memory. I don't think anyone is running HBase at a scale where this >>>> optimization would be useful. >>>> >>>> If you have a write-heavy application, you're always going to get >>>> significantly higher throughput when you send your edits in batch to >>>> the server. The downside to this is that when your client application >>>> dies, you lose all the edits in the un-committed batch. Unlike >>>> HTable, asynchbase puts an upper bound on the amount of time an edit >>>> is allowed to remain in the client's buffer, which helps limit >>>> data-loss when a client crashes (OpenTSDB sets this to 1s by default, >>>> so when it dies, you know you lost at most 1s worth of datapoints). >>>> >>> * >>> >>> setWriteBufferSize(1024*1014*10); // 10MB >>> >>> * >>> >>> *setAutoFlush(false*); >>> >>> for(i=0; i<N; i++) { >>> >>> list.add(putitem[i]); >>> >>> } >>> >>> htable.put(list); >>> >>> >>> For the above pseudo code (using put(List) to commit update in HBase), can I >>> get a "batch transaction" success notification? >>> * i.e., How can I know all the items have been successfully >>> committed? -- it seems that I can't get such information, all are >>> best-effort. Should I know some commits fail, I can do an application-level >>> retry. >>> * *setAutoFlush(true*); does not seem to help us to get any more >>> reliable operation either. >>> >>> >>> >>> >>> >>>> >>>> -- >>>> Benoit "tsuna" Sigoure >>>> Software Engineer @ www.StumbleUpon.com +
Ryan Rawson 2011-02-01, 01:19
-
Re: HTable.put(List<Put> puts) perform batch insert?tsuna 2011-02-01, 08:03
On Mon, Jan 31, 2011 at 5:19 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote:
> It just retrieves the current state of the buffer. The buffer is > mutated to remove successful edits as they occur, during an exception > the ones that were determined to be successful were also removed. htable.getWriteBuffer() isn't thread-safe (correct me if I'm wrong) so be careful with it. FWIW, asynchbase works differently: you get a callback for each and every edit. You can specify two callbacks: one for the success case, and one callback to handle failures. Also, it's thread-safe :) The other cool thing in asynchbase is that it puts an upper bound on the amount of time data can be buffered in the client. After that time has elapsed, the client will flush the writes to HBase. This improves liveness in user-facing applications by preventing edits from sticking too long in the unflushed buffer of some client, while still allowing for higher throughput through batching of edits. -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com +
tsuna 2011-02-01, 08:03
|