|
kranthi reddy
2011-12-04, 14:19
yuzhihong@...
2011-12-04, 19:51
kranthi reddy
2011-12-05, 05:23
Ulrich Staudinger
2011-12-05, 07:56
kranthi reddy
2011-12-05, 09:10
Ulrich Staudinger
2011-12-05, 15:13
kranthi reddy
2011-12-05, 16:33
kranthi reddy
2011-12-05, 17:26
Doug Meil
2011-12-05, 17:42
kranthi reddy
2011-12-19, 05:54
|
-
Unexpected Data insertion time and Data size explosionkranthi reddy 2011-12-04, 14:19
Hi all,
I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines and am trying to insert data. 3 of the machines are tasktrackers, with 4 map tasks each. My data consists of about 1.3 billion rows with 4 columns each (100GB txt file). The column structure is "rowID, word1, word2, word3". My DFS replication in hadoop and hbase is set to 3 each. I have put only one column family and 3 qualifiers for each field (word*). I am using the SampleUploader present in the HBase distribution. To complete 40% of the insertion, it has taken around 21 hrs and it's still running. I have 12 map tasks running.* I would like to know is the insertion time taken here on expected lines ??? Because when I used lucene, I was able to insert the entire data in about 8 hours.* Also, there seems to be huge explosion of data size here. With a replication factor of 3 for HBase, I was expecting the table size inserted to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for replicating the data 3 times and 50+ GB for additional storage information). But even for 40% completion of data insertion, the space occupied is around 550GB (Looks like it might take around 1.2TB for an 100GB file).* I have used the rowID to be a String, instead of Long. Will that account for such rapid increase in data storage??? * Regards, Kranthi +
kranthi reddy 2011-12-04, 14:19
-
Re: Unexpected Data insertion time and Data size explosionyuzhihong@... 2011-12-04, 19:51
May I ask whether you pre-split your table before loading ?
On Dec 4, 2011, at 6:19 AM, kranthi reddy <[EMAIL PROTECTED]> wrote: > Hi all, > > I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines > and am trying to insert data. 3 of the machines are tasktrackers, with 4 > map tasks each. > > My data consists of about 1.3 billion rows with 4 columns each (100GB > txt file). The column structure is "rowID, word1, word2, word3". My DFS > replication in hadoop and hbase is set to 3 each. I have put only one > column family and 3 qualifiers for each field (word*). > > I am using the SampleUploader present in the HBase distribution. To > complete 40% of the insertion, it has taken around 21 hrs and it's still > running. I have 12 map tasks running.* I would like to know is the > insertion time taken here on expected lines ??? Because when I used lucene, > I was able to insert the entire data in about 8 hours.* > > Also, there seems to be huge explosion of data size here. With a > replication factor of 3 for HBase, I was expecting the table size inserted > to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for > replicating the data 3 times and 50+ GB for additional storage > information). But even for 40% completion of data insertion, the space > occupied is around 550GB (Looks like it might take around 1.2TB for an > 100GB file).* I have used the rowID to be a String, instead of Long. Will > that account for such rapid increase in data storage??? > * > > Regards, > Kranthi +
yuzhihong@... 2011-12-04, 19:51
-
Re: Unexpected Data insertion time and Data size explosionkranthi reddy 2011-12-05, 05:23
No, I split the table on the fly. This I have done because converting my
table into Hbase format (rowID, family, qualifier, value) would result in the input file being arnd 300GB. Hence, I had decided to do the splitting and generating this format on the fly. Will this effect the performance so heavily ??? On Mon, Dec 5, 2011 at 1:21 AM, <[EMAIL PROTECTED]> wrote: > May I ask whether you pre-split your table before loading ? > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <[EMAIL PROTECTED]> wrote: > > > Hi all, > > > > I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 > machines > > and am trying to insert data. 3 of the machines are tasktrackers, with 4 > > map tasks each. > > > > My data consists of about 1.3 billion rows with 4 columns each (100GB > > txt file). The column structure is "rowID, word1, word2, word3". My DFS > > replication in hadoop and hbase is set to 3 each. I have put only one > > column family and 3 qualifiers for each field (word*). > > > > I am using the SampleUploader present in the HBase distribution. To > > complete 40% of the insertion, it has taken around 21 hrs and it's still > > running. I have 12 map tasks running.* I would like to know is the > > insertion time taken here on expected lines ??? Because when I used > lucene, > > I was able to insert the entire data in about 8 hours.* > > > > Also, there seems to be huge explosion of data size here. With a > > replication factor of 3 for HBase, I was expecting the table size > inserted > > to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB > for > > replicating the data 3 times and 50+ GB for additional storage > > information). But even for 40% completion of data insertion, the space > > occupied is around 550GB (Looks like it might take around 1.2TB for an > > 100GB file).* I have used the rowID to be a String, instead of Long. Will > > that account for such rapid increase in data storage??? > > * > > > > Regards, > > Kranthi > -- Kranthi Reddy. B http://www.setusoftware.com/setu/index.htm +
kranthi reddy 2011-12-05, 05:23
-
Re: Unexpected Data insertion time and Data size explosionUlrich Staudinger 2011-12-05, 07:56
Hi there,
while I cannot give you any concrete advice on your particular storage problem, I can share some experiences with you regarding performance. I also bulk import data regularly, which is around 4GB every day in about 150 files with something between 10'000 to 30'000 lines in it. My first approach was to read every line and put it separately. Which resulted in a load time of about an hour. My next approach was to read an entire file, put each individual put into a list and then store the entire list at once. This works fast in the beginning, but after about 20 files, the server ran into compactions and couldn't cope with the load and finally, the master crashed, leaving regionserver and zookeeper running. To HBase's defense, I have to say that I did this on a standalone installation without Hadoop underneath, so the test may not be entirely fair. Next, I switched to a proper Hadoop layer with HBase on top. I now also put around 100 - 1000 lines (or puts) at once, in a bulk commit, and have insert times of around 0.5ms per row - which is very decent. My entire import now takes only 7 minutes. I think you must find a balance regarding the performance of your servers and how quick they are with compactions and the amount of data you put at once. I have definitely found single puts to result in low performance. Best regards, Ulrich On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <[EMAIL PROTECTED]>wrote: > No, I split the table on the fly. This I have done because converting my > table into Hbase format (rowID, family, qualifier, value) would result in > the input file being arnd 300GB. Hence, I had decided to do the splitting > and generating this format on the fly. > > Will this effect the performance so heavily ??? > > On Mon, Dec 5, 2011 at 1:21 AM, <[EMAIL PROTECTED]> wrote: > > > May I ask whether you pre-split your table before loading ? > > > > > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <[EMAIL PROTECTED]> > wrote: > > > > > Hi all, > > > > > > I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 > > machines > > > and am trying to insert data. 3 of the machines are tasktrackers, with > 4 > > > map tasks each. > > > > > > My data consists of about 1.3 billion rows with 4 columns each > (100GB > > > txt file). The column structure is "rowID, word1, word2, word3". My > DFS > > > replication in hadoop and hbase is set to 3 each. I have put only one > > > column family and 3 qualifiers for each field (word*). > > > > > > I am using the SampleUploader present in the HBase distribution. To > > > complete 40% of the insertion, it has taken around 21 hrs and it's > still > > > running. I have 12 map tasks running.* I would like to know is the > > > insertion time taken here on expected lines ??? Because when I used > > lucene, > > > I was able to insert the entire data in about 8 hours.* > > > > > > Also, there seems to be huge explosion of data size here. With a > > > replication factor of 3 for HBase, I was expecting the table size > > inserted > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB > > for > > > replicating the data 3 times and 50+ GB for additional storage > > > information). But even for 40% completion of data insertion, the space > > > occupied is around 550GB (Looks like it might take around 1.2TB for an > > > 100GB file).* I have used the rowID to be a String, instead of Long. > Will > > > that account for such rapid increase in data storage??? > > > * > > > > > > Regards, > > > Kranthi > > > > > > -- > Kranthi Reddy. B > > http://www.setusoftware.com/setu/index.htm > +
Ulrich Staudinger 2011-12-05, 07:56
-
Re: Unexpected Data insertion time and Data size explosionkranthi reddy 2011-12-05, 09:10
Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
the bulk insert ??? I was of the opinion that Hbase would flush all the puts to the disk when it's memstore is filled, whose property is defined in hbase-default.xml. Is my understanding wrong here ??? On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger < [EMAIL PROTECTED]> wrote: > Hi there, > > while I cannot give you any concrete advice on your particular storage > problem, I can share some experiences with you regarding performance. > > I also bulk import data regularly, which is around 4GB every day in about > 150 files with something between 10'000 to 30'000 lines in it. > > My first approach was to read every line and put it separately. Which > resulted in a load time of about an hour. My next approach was to read an > entire file, put each individual put into a list and then store the entire > list at once. This works fast in the beginning, but after about 20 files, > the server ran into compactions and couldn't cope with the load and > finally, the master crashed, leaving regionserver and zookeeper running. To > HBase's defense, I have to say that I did this on a standalone installation > without Hadoop underneath, so the test may not be entirely fair. > Next, I switched to a proper Hadoop layer with HBase on top. I now also put > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have > insert times of around 0.5ms per row - which is very decent. My entire > import now takes only 7 minutes. > > I think you must find a balance regarding the performance of your servers > and how quick they are with compactions and the amount of data you put at > once. I have definitely found single puts to result in low performance. > > Best regards, > Ulrich > > > > > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <[EMAIL PROTECTED] > >wrote: > > > No, I split the table on the fly. This I have done because converting my > > table into Hbase format (rowID, family, qualifier, value) would result in > > the input file being arnd 300GB. Hence, I had decided to do the splitting > > and generating this format on the fly. > > > > Will this effect the performance so heavily ??? > > > > On Mon, Dec 5, 2011 at 1:21 AM, <[EMAIL PROTECTED]> wrote: > > > > > May I ask whether you pre-split your table before loading ? > > > > > > > > > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <[EMAIL PROTECTED]> > > wrote: > > > > > > > Hi all, > > > > > > > > I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 > > > machines > > > > and am trying to insert data. 3 of the machines are tasktrackers, > with > > 4 > > > > map tasks each. > > > > > > > > My data consists of about 1.3 billion rows with 4 columns each > > (100GB > > > > txt file). The column structure is "rowID, word1, word2, word3". My > > DFS > > > > replication in hadoop and hbase is set to 3 each. I have put only one > > > > column family and 3 qualifiers for each field (word*). > > > > > > > > I am using the SampleUploader present in the HBase distribution. > To > > > > complete 40% of the insertion, it has taken around 21 hrs and it's > > still > > > > running. I have 12 map tasks running.* I would like to know is the > > > > insertion time taken here on expected lines ??? Because when I used > > > lucene, > > > > I was able to insert the entire data in about 8 hours.* > > > > > > > > Also, there seems to be huge explosion of data size here. With a > > > > replication factor of 3 for HBase, I was expecting the table size > > > inserted > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have, > 300GB > > > for > > > > replicating the data 3 times and 50+ GB for additional storage > > > > information). But even for 40% completion of data insertion, the > space > > > > occupied is around 550GB (Looks like it might take around 1.2TB for > an > > > > 100GB file).* I have used the rowID to be a String, instead of Long. > > Will > > > > that account for such rapid increase in data storage??? Kranthi Reddy. B http://www.setusoftware.com/setu/index.htm +
kranthi reddy 2011-12-05, 09:10
-
Re: Unexpected Data insertion time and Data size explosionUlrich Staudinger 2011-12-05, 15:13
the point, I refer to is not so much about when hbase's server side
flushes, but when the client side flushes. If you put every value immediately, it will result every time in an RPC call. If you collect the data on the client side and flush (on the client side) manually, it will result in one RPC call with hundred or thousand small puts inside, instead of hundred or thousands individual put RPC calls. Another issue is, I am not so sure what happens if you collect hundreds of thousands of small puts, which might possibly be bigger than the memstore, and flush then. I guess the hbase client will hang. On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <[EMAIL PROTECTED]>wrote: > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do > the bulk insert ??? I was of the opinion that Hbase would flush all the > puts to the disk when it's memstore is filled, whose property is defined in > hbase-default.xml. Is my understanding wrong here ??? > > > > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger < > [EMAIL PROTECTED]> wrote: > > > Hi there, > > > > while I cannot give you any concrete advice on your particular storage > > problem, I can share some experiences with you regarding performance. > > > > I also bulk import data regularly, which is around 4GB every day in about > > 150 files with something between 10'000 to 30'000 lines in it. > > > > My first approach was to read every line and put it separately. Which > > resulted in a load time of about an hour. My next approach was to read an > > entire file, put each individual put into a list and then store the > entire > > list at once. This works fast in the beginning, but after about 20 files, > > the server ran into compactions and couldn't cope with the load and > > finally, the master crashed, leaving regionserver and zookeeper running. > To > > HBase's defense, I have to say that I did this on a standalone > installation > > without Hadoop underneath, so the test may not be entirely fair. > > Next, I switched to a proper Hadoop layer with HBase on top. I now also > put > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have > > insert times of around 0.5ms per row - which is very decent. My entire > > import now takes only 7 minutes. > > > > I think you must find a balance regarding the performance of your servers > > and how quick they are with compactions and the amount of data you put at > > once. I have definitely found single puts to result in low performance. > > > > Best regards, > > Ulrich > > > > > > > > > > > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <[EMAIL PROTECTED] > > >wrote: > > > > > No, I split the table on the fly. This I have done because converting > my > > > table into Hbase format (rowID, family, qualifier, value) would result > in > > > the input file being arnd 300GB. Hence, I had decided to do the > splitting > > > and generating this format on the fly. > > > > > > Will this effect the performance so heavily ??? > > > > > > On Mon, Dec 5, 2011 at 1:21 AM, <[EMAIL PROTECTED]> wrote: > > > > > > > May I ask whether you pre-split your table before loading ? > > > > > > > > > > > > > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 > > > > machines > > > > > and am trying to insert data. 3 of the machines are tasktrackers, > > with > > > 4 > > > > > map tasks each. > > > > > > > > > > My data consists of about 1.3 billion rows with 4 columns each > > > (100GB > > > > > txt file). The column structure is "rowID, word1, word2, word3". > My > > > DFS > > > > > replication in hadoop and hbase is set to 3 each. I have put only > one > > > > > column family and 3 qualifiers for each field (word*). > > > > > > > > > > I am using the SampleUploader present in the HBase distribution. > > To > > > > > complete 40% of the insertion, it has taken around 21 hrs and it's +
Ulrich Staudinger 2011-12-05, 15:13
-
Re: Unexpected Data insertion time and Data size explosionkranthi reddy 2011-12-05, 16:33
Ok. But can some1 explain why the data size is exploding the way I have
mentioned earlier. I have tried to insert sample data of arnd 12GB. The data occupied by Hbase table is arnd 130GB. All my columns i.e. including the ROWID are strings. I have even tried converting by ROWID to long, but that seems to occupy more space i.e. arnd 150GB. Sample rows 0-<>-f-<>-c-<>-Anarchism 0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy 0-<>-f-<>-e2-<>-anarchy 1-<>-f-<>-c-<>-Anarchism 1-<>-f-<>-e1-<>-anarchy 1-<>-f-<>-e2-<>-state (polity) 2-<>-f-<>-c-<>-Anarchism 2-<>-f-<>-e1-<>-anarchy 2-<>-f-<>-e2-<>-political philosophy 3-<>-f-<>-c-<>-Anarchism 3-<>-f-<>-e1-<>-The Globe and Mail 3-<>-f-<>-e2-<>-anarchy 4-<>-f-<>-c-<>-Anarchism 4-<>-f-<>-e1-<>-anarchy 4-<>-f-<>-e2-<>-stateless society Is there a way I can know the number of bytes occupied by each key:value for each cell ??? On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger < [EMAIL PROTECTED]> wrote: > the point, I refer to is not so much about when hbase's server side > flushes, but when the client side flushes. > If you put every value immediately, it will result every time in an RPC > call. If you collect the data on the client side and flush (on the client > side) manually, it will result in one RPC call with hundred or thousand > small puts inside, instead of hundred or thousands individual put RPC > calls. > > Another issue is, I am not so sure what happens if you collect hundreds of > thousands of small puts, which might possibly be bigger than the memstore, > and flush then. I guess the hbase client will hang. > > > > > On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <[EMAIL PROTECTED] > >wrote: > > > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do > > the bulk insert ??? I was of the opinion that Hbase would flush all the > > puts to the disk when it's memstore is filled, whose property is defined > in > > hbase-default.xml. Is my understanding wrong here ??? > > > > > > > > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger < > > [EMAIL PROTECTED]> wrote: > > > > > Hi there, > > > > > > while I cannot give you any concrete advice on your particular storage > > > problem, I can share some experiences with you regarding performance. > > > > > > I also bulk import data regularly, which is around 4GB every day in > about > > > 150 files with something between 10'000 to 30'000 lines in it. > > > > > > My first approach was to read every line and put it separately. Which > > > resulted in a load time of about an hour. My next approach was to read > an > > > entire file, put each individual put into a list and then store the > > entire > > > list at once. This works fast in the beginning, but after about 20 > files, > > > the server ran into compactions and couldn't cope with the load and > > > finally, the master crashed, leaving regionserver and zookeeper > running. > > To > > > HBase's defense, I have to say that I did this on a standalone > > installation > > > without Hadoop underneath, so the test may not be entirely fair. > > > Next, I switched to a proper Hadoop layer with HBase on top. I now also > > put > > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have > > > insert times of around 0.5ms per row - which is very decent. My entire > > > import now takes only 7 minutes. > > > > > > I think you must find a balance regarding the performance of your > servers > > > and how quick they are with compactions and the amount of data you put > at > > > once. I have definitely found single puts to result in low performance. > > > > > > Best regards, > > > Ulrich > > > > > > > > > > > > > > > > > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <[EMAIL PROTECTED] > > > >wrote: > > > > > > > No, I split the table on the fly. This I have done because converting > > my > > > > table into Hbase format (rowID, family, qualifier, value) would > result > > in > > > > the input file being arnd 300GB. Hence, I had decided to do the Kranthi Reddy. B http://www.setusoftware.com/setu/index.htm +
kranthi reddy 2011-12-05, 16:33
-
Re: Unexpected Data insertion time and Data size explosionkranthi reddy 2011-12-05, 17:26
1) Does having dfs.replication factor "3" in general result in table data
size of 3x + y (where x is the size of the file in local file system and y is some additional space for meta information storage) ??? 2) Does Hbase, pre allocate space for all the cell versions when the cell is created for the first time? Unfortunately, I am just unable to wrap my head around the problem of such exponential increase of data size. Except for this case happening (which I doubt), I just don't get it how such exponential growth of table data is possible. 3) Or is it case where my KEY is being larger than VALUE and hence resulting in such large size increase ??? *Similar to the the sample rows below, I have around 300 million entries and the ROWID increases linearly*. On Mon, Dec 5, 2011 at 10:03 PM, kranthi reddy <[EMAIL PROTECTED]>wrote: > Ok. But can some1 explain why the data size is exploding the way I have > mentioned earlier. > > I have tried to insert sample data of arnd 12GB. The data occupied by > Hbase table is arnd 130GB. All my columns i.e. including the ROWID are > strings. I have even tried converting by ROWID to long, but that seems to > occupy more space i.e. arnd 150GB. > > Sample rows > > 0-<>-f-<>-c-<>-Anarchism > 0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy > 0-<>-f-<>-e2-<>-anarchy > 1-<>-f-<>-c-<>-Anarchism > 1-<>-f-<>-e1-<>-anarchy > 1-<>-f-<>-e2-<>-state (polity) > 2-<>-f-<>-c-<>-Anarchism > 2-<>-f-<>-e1-<>-anarchy > 2-<>-f-<>-e2-<>-political philosophy > 3-<>-f-<>-c-<>-Anarchism > 3-<>-f-<>-e1-<>-The Globe and Mail > 3-<>-f-<>-e2-<>-anarchy > 4-<>-f-<>-c-<>-Anarchism > 4-<>-f-<>-e1-<>-anarchy > 4-<>-f-<>-e2-<>-stateless society > > Is there a way I can know the number of bytes occupied by each key:value > for each cell ??? > > > On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger < > [EMAIL PROTECTED]> wrote: > >> the point, I refer to is not so much about when hbase's server side >> flushes, but when the client side flushes. >> If you put every value immediately, it will result every time in an RPC >> call. If you collect the data on the client side and flush (on the client >> side) manually, it will result in one RPC call with hundred or thousand >> small puts inside, instead of hundred or thousands individual put RPC >> calls. >> >> Another issue is, I am not so sure what happens if you collect hundreds of >> thousands of small puts, which might possibly be bigger than the memstore, >> and flush then. I guess the hbase client will hang. >> >> >> >> >> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <[EMAIL PROTECTED] >> >wrote: >> >> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do >> > the bulk insert ??? I was of the opinion that Hbase would flush all the >> > puts to the disk when it's memstore is filled, whose property is >> defined in >> > hbase-default.xml. Is my understanding wrong here ??? >> > >> > >> > >> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Hi there, >> > > >> > > while I cannot give you any concrete advice on your particular storage >> > > problem, I can share some experiences with you regarding performance. >> > > >> > > I also bulk import data regularly, which is around 4GB every day in >> about >> > > 150 files with something between 10'000 to 30'000 lines in it. >> > > >> > > My first approach was to read every line and put it separately. Which >> > > resulted in a load time of about an hour. My next approach was to >> read an >> > > entire file, put each individual put into a list and then store the >> > entire >> > > list at once. This works fast in the beginning, but after about 20 >> files, >> > > the server ran into compactions and couldn't cope with the load and >> > > finally, the master crashed, leaving regionserver and zookeeper >> running. >> > To >> > > HBase's defense, I have to say that I did this on a standalone >> > installation >> > > without Hadoop underneath, so the test may not be entirely fair. Kranthi Reddy. B http://www.setusoftware.com/setu/index.htm +
kranthi reddy 2011-12-05, 17:26
-
Re: Unexpected Data insertion time and Data size explosionDoug Meil 2011-12-05, 17:42
Hi there- Have you looked at this? http://hbase.apache.org/book.html#keyvalue On 12/5/11 11:33 AM, "kranthi reddy" <[EMAIL PROTECTED]> wrote: >Ok. But can some1 explain why the data size is exploding the way I have >mentioned earlier. > >I have tried to insert sample data of arnd 12GB. The data occupied by >Hbase >table is arnd 130GB. All my columns i.e. including the ROWID are strings. >I >have even tried converting by ROWID to long, but that seems to occupy more >space i.e. arnd 150GB. > >Sample rows > >0-<>-f-<>-c-<>-Anarchism >0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy >0-<>-f-<>-e2-<>-anarchy >1-<>-f-<>-c-<>-Anarchism >1-<>-f-<>-e1-<>-anarchy >1-<>-f-<>-e2-<>-state (polity) >2-<>-f-<>-c-<>-Anarchism >2-<>-f-<>-e1-<>-anarchy >2-<>-f-<>-e2-<>-political philosophy >3-<>-f-<>-c-<>-Anarchism >3-<>-f-<>-e1-<>-The Globe and Mail >3-<>-f-<>-e2-<>-anarchy >4-<>-f-<>-c-<>-Anarchism >4-<>-f-<>-e1-<>-anarchy >4-<>-f-<>-e2-<>-stateless society > >Is there a way I can know the number of bytes occupied by each key:value >for each cell ??? > >On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger < >[EMAIL PROTECTED]> wrote: > >> the point, I refer to is not so much about when hbase's server side >> flushes, but when the client side flushes. >> If you put every value immediately, it will result every time in an RPC >> call. If you collect the data on the client side and flush (on the >>client >> side) manually, it will result in one RPC call with hundred or thousand >> small puts inside, instead of hundred or thousands individual put RPC >> calls. >> >> Another issue is, I am not so sure what happens if you collect hundreds >>of >> thousands of small puts, which might possibly be bigger than the >>memstore, >> and flush then. I guess the hbase client will hang. >> >> >> >> >> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <[EMAIL PROTECTED] >> >wrote: >> >> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" >>do >> > the bulk insert ??? I was of the opinion that Hbase would flush all >>the >> > puts to the disk when it's memstore is filled, whose property is >>defined >> in >> > hbase-default.xml. Is my understanding wrong here ??? >> > >> > >> > >> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Hi there, >> > > >> > > while I cannot give you any concrete advice on your particular >>storage >> > > problem, I can share some experiences with you regarding >>performance. >> > > >> > > I also bulk import data regularly, which is around 4GB every day in >> about >> > > 150 files with something between 10'000 to 30'000 lines in it. >> > > >> > > My first approach was to read every line and put it separately. >>Which >> > > resulted in a load time of about an hour. My next approach was to >>read >> an >> > > entire file, put each individual put into a list and then store the >> > entire >> > > list at once. This works fast in the beginning, but after about 20 >> files, >> > > the server ran into compactions and couldn't cope with the load and >> > > finally, the master crashed, leaving regionserver and zookeeper >> running. >> > To >> > > HBase's defense, I have to say that I did this on a standalone >> > installation >> > > without Hadoop underneath, so the test may not be entirely fair. >> > > Next, I switched to a proper Hadoop layer with HBase on top. I now >>also >> > put >> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and >>have >> > > insert times of around 0.5ms per row - which is very decent. My >>entire >> > > import now takes only 7 minutes. >> > > >> > > I think you must find a balance regarding the performance of your >> servers >> > > and how quick they are with compactions and the amount of data you >>put >> at >> > > once. I have definitely found single puts to result in low >>performance. >> > > >> > > Best regards, >> > > Ulrich >> > > >> > > >> > > >> > > >> > > >> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy +
Doug Meil 2011-12-05, 17:42
-
Re: Unexpected Data insertion time and Data size explosionkranthi reddy 2011-12-19, 05:54
Hi all,
I have been able to understand clearly as to why my Storage is occupying such huge space. I have an issue with the insertion time. I have currently .1 billion records (In hbase format, in future it would run into few billions) and am inserting them using 12 map tasks running on 4 machine hadoop cluster. The time taken is approximately 3 hours. Which on calculation leads to around 750 rows insertion per map task per second. IS THIS GOOD OR CAN IT BE IMPROVED??? .1 billion -> 100000000/( 180 min * 60 sec * 12 map task) = 750 (approx). I have tried using batch() function, but there is no improvement in the insertion time. * I have attached the codes that I am using to insert. Can some1 please check If what I am trying to do is the best way to insert data is the fastest and best way. * Regards, Kranthi On Mon, Dec 5, 2011 at 11:12 PM, Doug Meil <[EMAIL PROTECTED]>wrote: > > Hi there- > > Have you looked at this? > > http://hbase.apache.org/book.html#keyvalue > > > > > > On 12/5/11 11:33 AM, "kranthi reddy" <[EMAIL PROTECTED]> wrote: > > >Ok. But can some1 explain why the data size is exploding the way I have > >mentioned earlier. > > > >I have tried to insert sample data of arnd 12GB. The data occupied by > >Hbase > >table is arnd 130GB. All my columns i.e. including the ROWID are strings. > >I > >have even tried converting by ROWID to long, but that seems to occupy more > >space i.e. arnd 150GB. > > > >Sample rows > > > >0-<>-f-<>-c-<>-Anarchism > >0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy > >0-<>-f-<>-e2-<>-anarchy > >1-<>-f-<>-c-<>-Anarchism > >1-<>-f-<>-e1-<>-anarchy > >1-<>-f-<>-e2-<>-state (polity) > >2-<>-f-<>-c-<>-Anarchism > >2-<>-f-<>-e1-<>-anarchy > >2-<>-f-<>-e2-<>-political philosophy > >3-<>-f-<>-c-<>-Anarchism > >3-<>-f-<>-e1-<>-The Globe and Mail > >3-<>-f-<>-e2-<>-anarchy > >4-<>-f-<>-c-<>-Anarchism > >4-<>-f-<>-e1-<>-anarchy > >4-<>-f-<>-e2-<>-stateless society > > > >Is there a way I can know the number of bytes occupied by each key:value > >for each cell ??? > > > >On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger < > >[EMAIL PROTECTED]> wrote: > > > >> the point, I refer to is not so much about when hbase's server side > >> flushes, but when the client side flushes. > >> If you put every value immediately, it will result every time in an RPC > >> call. If you collect the data on the client side and flush (on the > >>client > >> side) manually, it will result in one RPC call with hundred or thousand > >> small puts inside, instead of hundred or thousands individual put RPC > >> calls. > >> > >> Another issue is, I am not so sure what happens if you collect hundreds > >>of > >> thousands of small puts, which might possibly be bigger than the > >>memstore, > >> and flush then. I guess the hbase client will hang. > >> > >> > >> > >> > >> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <[EMAIL PROTECTED] > >> >wrote: > >> > >> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" > >>do > >> > the bulk insert ??? I was of the opinion that Hbase would flush all > >>the > >> > puts to the disk when it's memstore is filled, whose property is > >>defined > >> in > >> > hbase-default.xml. Is my understanding wrong here ??? > >> > > >> > > >> > > >> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> > > Hi there, > >> > > > >> > > while I cannot give you any concrete advice on your particular > >>storage > >> > > problem, I can share some experiences with you regarding > >>performance. > >> > > > >> > > I also bulk import data regularly, which is around 4GB every day in > >> about > >> > > 150 files with something between 10'000 to 30'000 lines in it. > >> > > > >> > > My first approach was to read every line and put it separately. > >>Which > >> > > resulted in a load time of about an hour. My next approach was to > >>read > >> an > >> > > entire file, put each individual put into a list and then store the Kranthi Reddy. B http://www.setusoftware.com/setu/index.htm +
kranthi reddy 2011-12-19, 05:54
|