|
Upender K. Nimbekar
2012-12-17, 15:34
Ted Yu
2012-12-17, 17:45
Upender K. Nimbekar
2012-12-17, 19:11
Ted Yu
2012-12-18, 00:52
Upender K. Nimbekar
2012-12-18, 02:30
Ted Yu
2012-12-18, 03:28
Nick Dimiduk
2012-12-18, 17:31
Upender K. Nimbekar
2012-12-18, 19:06
Jean-Daniel Cryans
2012-12-18, 19:17
Nick Dimiduk
2012-12-18, 19:20
lars hofhansl
2012-12-19, 07:07
lars hofhansl
2012-12-19, 07:10
|
-
HBase Map/Reduce Data Ingest PerformanceUpender K. Nimbekar 2012-12-17, 15:34
Hi All,
I have question about improving the Map / Reduce job performance while ingesting huge amount of data into Hbase using HFileOutputFormat. Here is what we are using: 1) *Cloudera hadoop-0.20.2-cdh3u* 2) *hbase-0.90.40cdh3u2* I've used 2 different strategies as described below: *Strategy#1:* PreSplit the number of regions with 10 regions per region server. And then subsequently kick off the hadoop job with HFileOutputFormat.configureIncrementLoad. This mchanism does create reduce tasks equal to the number of regions * 10. We used the "hash" of each record as the Key to Mapoutput. This process resulted in each mapper finish process in accepetable amount of time. But the reduce task takes forever to finish. We found that first the copy/shuffle process too condierable amoun of time and then the sort process took foreever to finish. We tried to address this issue by constructing the key as "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a gven mapper. The idea was to reduce shuffling / copying from each mapper. But even this solution didn't save us anytime and the reduce step took significant amount to finish. I played with adjusting the number of pre-split regions in both dierctions but to no avail. This led us to move to Strategy#2 we got rid of the reduce step. *QUESTION:* Is there anything I could've done better in this strategy to make reduce step finish faster ? Do I need to produce Row Keys differently than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or Hbase0.90 ? Please help me troubleshoot. Strategy#2: PreSplit the number of regions with 10 regions per region server. And then subsequently kick off the hadoop job with HFileOutputFormat.configureIncrementLoad. But set the number of reducer 0. In this strategy (current), I pre-sorted all the mapper input using Treeset before writing to output. With No. of reducers = 0, this resulted the mapper to write directly to HFiles. This was cool because map/reduce (no reduce phase actually) finished very fast and we noticed the HFiles got written very quickly. Then I used * hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into Hbase. I called this method on successful completon of the job in the driver class. This is working much better than the Strategy#1 in terms of performance. But the bulkLoad() call in the driver sometimes takes longer if there is huge amount of data. *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I call this api from Mapper directly, instead of waiting the whole job to finish first? I've used used habse "completebulkload" utilty but it has two issues with it. First, I do not see any performance improvement with it. Second, it needs to be run separately from Hadoop Job driver class and we wanted to integrate both the piece. So we used *hbase.utils.LoadIncrementHFiles.bulkLoad(). * Also, we used Hbase RegionSplitter to pre-split the regions. But hbase 0.90 version doesn't have the option to pass ALGORITHM. Is that something we need to worry about? Please help me point in the right direction to address this problem. Thanks Upen +
Upender K. Nimbekar 2012-12-17, 15:34
-
Re: HBase Map/Reduce Data Ingest PerformanceTed Yu 2012-12-17, 17:45
Thanks for sharing your experiences.
Have you considered upgrading to HBase 0.92 or 0.94 ? There have been several bug fixes / enhancements to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. Cheers On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < [EMAIL PROTECTED]> wrote: > Hi All, > I have question about improving the Map / Reduce job performance while > ingesting huge amount of data into Hbase using HFileOutputFormat. Here is > what we are using: > > 1) *Cloudera hadoop-0.20.2-cdh3u* > 2) *hbase-0.90.40cdh3u2* > > I've used 2 different strategies as described below: > > *Strategy#1:* PreSplit the number of regions with 10 regions per region > server. And then subsequently kick off the hadoop job with > HFileOutputFormat.configureIncrementLoad. This mchanism does create reduce > tasks equal to the number of regions * 10. We used the "hash" of each > record as the Key to Mapoutput. This process resulted in each mapper finish > process in accepetable amount of time. But the reduce task takes forever to > finish. We found that first the copy/shuffle process too condierable amoun > of time and then the sort process took foreever to finish. > We tried to address this issue by constructing the key as > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a > gven mapper. The idea was to reduce shuffling / copying from each mapper. > But even this solution didn't save us anytime and the reduce step took > significant amount to finish. I played with adjusting the number of > pre-split regions in both dierctions but to no avail. > This led us to move to Strategy#2 we got rid of the reduce step. > > *QUESTION:* Is there anything I could've done better in this strategy to > make reduce step finish faster ? Do I need to produce Row Keys differently > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > Hbase0.90 ? Please help me troubleshoot. > > Strategy#2: PreSplit the number of regions with 10 regions per region > server. And then subsequently kick off the hadoop job with > HFileOutputFormat.configureIncrementLoad. But set the number of reducer > 0. In this strategy (current), I pre-sorted all the mapper input using > Treeset before writing to output. With No. of reducers = 0, this resulted > the mapper to write directly to HFiles. This was cool because map/reduce > (no reduce phase actually) finished very fast and we noticed the HFiles got > written very quickly. Then I used * > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into Hbase. > I called this method on successful completon of the job in the > driver class. This is working much better than the Strategy#1 in terms of > performance. But the bulkLoad() call in the driver sometimes takes longer > if there is huge amount of data. > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I call > this api from Mapper directly, instead of waiting the whole job to finish > first? I've used used habse "completebulkload" utilty but it has two > issues with it. First, I do not see any performance improvement with it. > Second, it needs to be run separately from Hadoop Job driver class and we > wanted to integrate both the piece. So we used > *hbase.utils.LoadIncrementHFiles.bulkLoad(). > * > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase 0.90 > version doesn't have the option to pass ALGORITHM. Is that something we > need to worry about? > > Please help me point in the right direction to address this problem. > > Thanks > Upen > +
Ted Yu 2012-12-17, 17:45
-
Re: HBase Map/Reduce Data Ingest PerformanceUpender K. Nimbekar 2012-12-17, 19:11
Sure. I can try that. Just curious, out of these 2 strategies, which one do
you thin is better ? Do you have any experience of trying one or the other ? Thanks Upen On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > Thanks for sharing your experiences. > > Have you considered upgrading to HBase 0.92 or 0.94 ? > There have been several bug fixes / enhancements > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > Cheers > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > [EMAIL PROTECTED]> wrote: > > > Hi All, > > I have question about improving the Map / Reduce job performance while > > ingesting huge amount of data into Hbase using HFileOutputFormat. Here is > > what we are using: > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > 2) *hbase-0.90.40cdh3u2* > > > > I've used 2 different strategies as described below: > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per region > > server. And then subsequently kick off the hadoop job with > > HFileOutputFormat.configureIncrementLoad. This mchanism does create > reduce > > tasks equal to the number of regions * 10. We used the "hash" of each > > record as the Key to Mapoutput. This process resulted in each mapper > finish > > process in accepetable amount of time. But the reduce task takes forever > to > > finish. We found that first the copy/shuffle process too condierable > amoun > > of time and then the sort process took foreever to finish. > > We tried to address this issue by constructing the key as > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a > > gven mapper. The idea was to reduce shuffling / copying from each mapper. > > But even this solution didn't save us anytime and the reduce step took > > significant amount to finish. I played with adjusting the number of > > pre-split regions in both dierctions but to no avail. > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > *QUESTION:* Is there anything I could've done better in this strategy to > > make reduce step finish faster ? Do I need to produce Row Keys > differently > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > > Hbase0.90 ? Please help me troubleshoot. > > > > Strategy#2: PreSplit the number of regions with 10 regions per region > > server. And then subsequently kick off the hadoop job with > > HFileOutputFormat.configureIncrementLoad. But set the number of reducer > > 0. In this strategy (current), I pre-sorted all the mapper input using > > Treeset before writing to output. With No. of reducers = 0, this resulted > > the mapper to write directly to HFiles. This was cool because map/reduce > > (no reduce phase actually) finished very fast and we noticed the HFiles > got > > written very quickly. Then I used * > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into > Hbase. > > I called this method on successful completon of the job in the > > driver class. This is working much better than the Strategy#1 in terms of > > performance. But the bulkLoad() call in the driver sometimes takes longer > > if there is huge amount of data. > > > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I > call > > this api from Mapper directly, instead of waiting the whole job to finish > > first? I've used used habse "completebulkload" utilty but it has two > > issues with it. First, I do not see any performance improvement with it. > > Second, it needs to be run separately from Hadoop Job driver class and we > > wanted to integrate both the piece. So we used > > *hbase.utils.LoadIncrementHFiles.bulkLoad(). > > * > > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase > 0.90 > > version doesn't have the option to pass ALGORITHM. Is that something we > > need to worry about? > > > > Please help me point in the right direction to address this problem. > > > > Thanks > > Upen > > > +
Upender K. Nimbekar 2012-12-17, 19:11
-
Re: HBase Map/Reduce Data Ingest PerformanceTed Yu 2012-12-18, 00:52
I think second approach is better.
Cheers On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < [EMAIL PROTECTED]> wrote: > Sure. I can try that. Just curious, out of these 2 strategies, which one do > you thin is better ? Do you have any experience of trying one or the other > ? > > Thanks > Upen > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > Thanks for sharing your experiences. > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > There have been several bug fixes / enhancements > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > Cheers > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > [EMAIL PROTECTED]> wrote: > > > > > Hi All, > > > I have question about improving the Map / Reduce job performance while > > > ingesting huge amount of data into Hbase using HFileOutputFormat. Here > is > > > what we are using: > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > 2) *hbase-0.90.40cdh3u2* > > > > > > I've used 2 different strategies as described below: > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per region > > > server. And then subsequently kick off the hadoop job with > > > HFileOutputFormat.configureIncrementLoad. This mchanism does create > > reduce > > > tasks equal to the number of regions * 10. We used the "hash" of each > > > record as the Key to Mapoutput. This process resulted in each mapper > > finish > > > process in accepetable amount of time. But the reduce task takes > forever > > to > > > finish. We found that first the copy/shuffle process too condierable > > amoun > > > of time and then the sort process took foreever to finish. > > > We tried to address this issue by constructing the key as > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records > of a > > > gven mapper. The idea was to reduce shuffling / copying from each > mapper. > > > But even this solution didn't save us anytime and the reduce step took > > > significant amount to finish. I played with adjusting the number of > > > pre-split regions in both dierctions but to no avail. > > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > > > *QUESTION:* Is there anything I could've done better in this strategy > to > > > make reduce step finish faster ? Do I need to produce Row Keys > > differently > > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > > > Hbase0.90 ? Please help me troubleshoot. > > > > > > Strategy#2: PreSplit the number of regions with 10 regions per region > > > server. And then subsequently kick off the hadoop job with > > > HFileOutputFormat.configureIncrementLoad. But set the number of > reducer > > > 0. In this strategy (current), I pre-sorted all the mapper input using > > > Treeset before writing to output. With No. of reducers = 0, this > resulted > > > the mapper to write directly to HFiles. This was cool because > map/reduce > > > (no reduce phase actually) finished very fast and we noticed the HFiles > > got > > > written very quickly. Then I used * > > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into > > Hbase. > > > I called this method on successful completon of the job in the > > > driver class. This is working much better than the Strategy#1 in terms > of > > > performance. But the bulkLoad() call in the driver sometimes takes > longer > > > if there is huge amount of data. > > > > > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I > > call > > > this api from Mapper directly, instead of waiting the whole job to > finish > > > first? I've used used habse "completebulkload" utilty but it has two > > > issues with it. First, I do not see any performance improvement with > it. > > > Second, it needs to be run separately from Hadoop Job driver class and > we > > > wanted to integrate both the piece. So we used > > > *hbase.utils.LoadIncrementHFiles.bulkLoad(). > > > * > > > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase +
Ted Yu 2012-12-18, 00:52
-
Re: HBase Map/Reduce Data Ingest PerformanceUpender K. Nimbekar 2012-12-18, 02:30
Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But running
into permission issues while hbase user tries to import Hfile into Hbase. Not sure, if there is way to change the target HDFS file permission via HFileOutputFormat. On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > I think second approach is better. > > Cheers > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < > [EMAIL PROTECTED]> wrote: > > > Sure. I can try that. Just curious, out of these 2 strategies, which one > do > > you thin is better ? Do you have any experience of trying one or the > other > > ? > > > > Thanks > > Upen > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > Thanks for sharing your experiences. > > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > > There have been several bug fixes / enhancements > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > > > Cheers > > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hi All, > > > > I have question about improving the Map / Reduce job performance > while > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. > Here > > is > > > > what we are using: > > > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > > 2) *hbase-0.90.40cdh3u2* > > > > > > > > I've used 2 different strategies as described below: > > > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per > region > > > > server. And then subsequently kick off the hadoop job with > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does create > > > reduce > > > > tasks equal to the number of regions * 10. We used the "hash" of each > > > > record as the Key to Mapoutput. This process resulted in each mapper > > > finish > > > > process in accepetable amount of time. But the reduce task takes > > forever > > > to > > > > finish. We found that first the copy/shuffle process too condierable > > > amoun > > > > of time and then the sort process took foreever to finish. > > > > We tried to address this issue by constructing the key as > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records > > of a > > > > gven mapper. The idea was to reduce shuffling / copying from each > > mapper. > > > > But even this solution didn't save us anytime and the reduce step > took > > > > significant amount to finish. I played with adjusting the number of > > > > pre-split regions in both dierctions but to no avail. > > > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > > > > > *QUESTION:* Is there anything I could've done better in this strategy > > to > > > > make reduce step finish faster ? Do I need to produce Row Keys > > > differently > > > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > > > > Hbase0.90 ? Please help me troubleshoot. > > > > > > > > Strategy#2: PreSplit the number of regions with 10 regions per region > > > > server. And then subsequently kick off the hadoop job with > > > > HFileOutputFormat.configureIncrementLoad. But set the number of > > reducer > > > > 0. In this strategy (current), I pre-sorted all the mapper input > using > > > > Treeset before writing to output. With No. of reducers = 0, this > > resulted > > > > the mapper to write directly to HFiles. This was cool because > > map/reduce > > > > (no reduce phase actually) finished very fast and we noticed the > HFiles > > > got > > > > written very quickly. Then I used * > > > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into > > > Hbase. > > > > I called this method on successful completon of the job in the > > > > driver class. This is working much better than the Strategy#1 in > terms > > of > > > > performance. But the bulkLoad() call in the driver sometimes takes > > longer > > > > if there is huge amount of data. > > > > > > > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I +
Upender K. Nimbekar 2012-12-18, 02:30
-
Re: HBase Map/Reduce Data Ingest PerformanceTed Yu 2012-12-18, 03:28
Experts from Cloudera would be more familiar with security in
hadoop-0.20.2-cdh3u If you can show us the exception (using pastebin e.g.), that would help find the root cause. Cheers On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < [EMAIL PROTECTED]> wrote: > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But running > into permission issues while hbase user tries to import Hfile into Hbase. > Not sure, if there is way to change the target HDFS file permission via > HFileOutputFormat. > > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > I think second approach is better. > > > > Cheers > > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < > > [EMAIL PROTECTED]> wrote: > > > > > Sure. I can try that. Just curious, out of these 2 strategies, which > one > > do > > > you thin is better ? Do you have any experience of trying one or the > > other > > > ? > > > > > > Thanks > > > Upen > > > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > > > Thanks for sharing your experiences. > > > > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > > > There have been several bug fixes / enhancements > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > > > > > Cheers > > > > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > Hi All, > > > > > I have question about improving the Map / Reduce job performance > > while > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. > > Here > > > is > > > > > what we are using: > > > > > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > > > 2) *hbase-0.90.40cdh3u2* > > > > > > > > > > I've used 2 different strategies as described below: > > > > > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per > > region > > > > > server. And then subsequently kick off the hadoop job with > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does create > > > > reduce > > > > > tasks equal to the number of regions * 10. We used the "hash" of > each > > > > > record as the Key to Mapoutput. This process resulted in each > mapper > > > > finish > > > > > process in accepetable amount of time. But the reduce task takes > > > forever > > > > to > > > > > finish. We found that first the copy/shuffle process too > condierable > > > > amoun > > > > > of time and then the sort process took foreever to finish. > > > > > We tried to address this issue by constructing the key as > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the > records > > > of a > > > > > gven mapper. The idea was to reduce shuffling / copying from each > > > mapper. > > > > > But even this solution didn't save us anytime and the reduce step > > took > > > > > significant amount to finish. I played with adjusting the number of > > > > > pre-split regions in both dierctions but to no avail. > > > > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > > > > > > > *QUESTION:* Is there anything I could've done better in this > strategy > > > to > > > > > make reduce step finish faster ? Do I need to produce Row Keys > > > > differently > > > > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > > > > > Hbase0.90 ? Please help me troubleshoot. > > > > > > > > > > Strategy#2: PreSplit the number of regions with 10 regions per > region > > > > > server. And then subsequently kick off the hadoop job with > > > > > HFileOutputFormat.configureIncrementLoad. But set the number of > > > reducer > > > > > 0. In this strategy (current), I pre-sorted all the mapper input > > using > > > > > Treeset before writing to output. With No. of reducers = 0, this > > > resulted > > > > > the mapper to write directly to HFiles. This was cool because > > > map/reduce > > > > > (no reduce phase actually) finished very fast and we noticed the > > HFiles +
Ted Yu 2012-12-18, 03:28
-
Re: HBase Map/Reduce Data Ingest PerformanceNick Dimiduk 2012-12-18, 17:31
Dumb question: what's the filesystem permissions of your generated HFiles?
Can the HBase process read them? Maybe a simple chmod or chown will get you the rest of the way there. On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < [EMAIL PROTECTED]> wrote: > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But running > into permission issues while hbase user tries to import Hfile into Hbase. > Not sure, if there is way to change the target HDFS file permission via > HFileOutputFormat. > > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > I think second approach is better. > > > > Cheers > > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < > > [EMAIL PROTECTED]> wrote: > > > > > Sure. I can try that. Just curious, out of these 2 strategies, which > one > > do > > > you thin is better ? Do you have any experience of trying one or the > > other > > > ? > > > > > > Thanks > > > Upen > > > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > > > Thanks for sharing your experiences. > > > > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > > > There have been several bug fixes / enhancements > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > > > > > Cheers > > > > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > Hi All, > > > > > I have question about improving the Map / Reduce job performance > > while > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. > > Here > > > is > > > > > what we are using: > > > > > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > > > 2) *hbase-0.90.40cdh3u2* > > > > > > > > > > I've used 2 different strategies as described below: > > > > > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per > > region > > > > > server. And then subsequently kick off the hadoop job with > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does create > > > > reduce > > > > > tasks equal to the number of regions * 10. We used the "hash" of > each > > > > > record as the Key to Mapoutput. This process resulted in each > mapper > > > > finish > > > > > process in accepetable amount of time. But the reduce task takes > > > forever > > > > to > > > > > finish. We found that first the copy/shuffle process too > condierable > > > > amoun > > > > > of time and then the sort process took foreever to finish. > > > > > We tried to address this issue by constructing the key as > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the > records > > > of a > > > > > gven mapper. The idea was to reduce shuffling / copying from each > > > mapper. > > > > > But even this solution didn't save us anytime and the reduce step > > took > > > > > significant amount to finish. I played with adjusting the number of > > > > > pre-split regions in both dierctions but to no avail. > > > > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > > > > > > > *QUESTION:* Is there anything I could've done better in this > strategy > > > to > > > > > make reduce step finish faster ? Do I need to produce Row Keys > > > > differently > > > > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > > > > > Hbase0.90 ? Please help me troubleshoot. > > > > > > > > > > Strategy#2: PreSplit the number of regions with 10 regions per > region > > > > > server. And then subsequently kick off the hadoop job with > > > > > HFileOutputFormat.configureIncrementLoad. But set the number of > > > reducer > > > > > 0. In this strategy (current), I pre-sorted all the mapper input > > using > > > > > Treeset before writing to output. With No. of reducers = 0, this > > > resulted > > > > > the mapper to write directly to HFiles. This was cool because > > > map/reduce > > > > > (no reduce phase actually) finished very fast and we noticed the > > HFiles > > > > got +
Nick Dimiduk 2012-12-18, 17:31
-
Re: HBase Map/Reduce Data Ingest PerformanceUpender K. Nimbekar 2012-12-18, 19:06
I would like to request you maintain the respect of people asking questions
on this forum. Let's not start the thread in the wrong direction. I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad. Call succeeded but bulkLoad call still threw exception. However, it does work if I do chmod and bulkLoad() from Hadoop Driver after the job is finished. BTW, Hbase user needs a WRITE permission and NOT read bease it created some _tmp directories. Upen On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > Dumb question: what's the filesystem permissions of your generated HFiles? > Can the HBase process read them? Maybe a simple chmod or chown will get you > the rest of the way there. > > On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < > [EMAIL PROTECTED]> wrote: > > > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But > running > > into permission issues while hbase user tries to import Hfile into Hbase. > > Not sure, if there is way to change the target HDFS file permission via > > HFileOutputFormat. > > > > > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > I think second approach is better. > > > > > > Cheers > > > > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Sure. I can try that. Just curious, out of these 2 strategies, which > > one > > > do > > > > you thin is better ? Do you have any experience of trying one or the > > > other > > > > ? > > > > > > > > Thanks > > > > Upen > > > > > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Thanks for sharing your experiences. > > > > > > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > > > > There have been several bug fixes / enhancements > > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > > > > > > > Cheers > > > > > > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > > > Hi All, > > > > > > I have question about improving the Map / Reduce job performance > > > while > > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. > > > Here > > > > is > > > > > > what we are using: > > > > > > > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > > > > 2) *hbase-0.90.40cdh3u2* > > > > > > > > > > > > I've used 2 different strategies as described below: > > > > > > > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per > > > region > > > > > > server. And then subsequently kick off the hadoop job with > > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does > create > > > > > reduce > > > > > > tasks equal to the number of regions * 10. We used the "hash" of > > each > > > > > > record as the Key to Mapoutput. This process resulted in each > > mapper > > > > > finish > > > > > > process in accepetable amount of time. But the reduce task takes > > > > forever > > > > > to > > > > > > finish. We found that first the copy/shuffle process too > > condierable > > > > > amoun > > > > > > of time and then the sort process took foreever to finish. > > > > > > We tried to address this issue by constructing the key as > > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the > > records > > > > of a > > > > > > gven mapper. The idea was to reduce shuffling / copying from each > > > > mapper. > > > > > > But even this solution didn't save us anytime and the reduce step > > > took > > > > > > significant amount to finish. I played with adjusting the number > of > > > > > > pre-split regions in both dierctions but to no avail. > > > > > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > > > > > > > > > *QUESTION:* Is there anything I could've done better in this > > strategy > > > > to > > > > > > make reduce step finish faster ? Do I need to produce Row Keys > > > > > differently +
Upender K. Nimbekar 2012-12-18, 19:06
-
Re: HBase Map/Reduce Data Ingest PerformanceJean-Daniel Cryans 2012-12-18, 19:17
I don't think Nick was being disrespectful, usually when people prefix
a question with "Dumb question" it means that they think their own question is dumb but they feel like asking it anyway in case something basic wasn't covered. J-D On Tue, Dec 18, 2012 at 11:06 AM, Upender K. Nimbekar <[EMAIL PROTECTED]> wrote: > I would like to request you maintain the respect of people asking questions > on this forum. Let's not start the thread in the wrong direction. > I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad. > Call succeeded but bulkLoad call still threw exception. However, it does > work if I do chmod and bulkLoad() from Hadoop Driver after the job is > finished. > BTW, Hbase user needs a WRITE permission and NOT read bease it created some > _tmp directories. > > Upen > > On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > >> Dumb question: what's the filesystem permissions of your generated HFiles? >> Can the HBase process read them? Maybe a simple chmod or chown will get you >> the rest of the way there. >> >> On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < >> [EMAIL PROTECTED]> wrote: >> >> > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But >> running >> > into permission issues while hbase user tries to import Hfile into Hbase. >> > Not sure, if there is way to change the target HDFS file permission via >> > HFileOutputFormat. >> > >> > >> > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote: >> > >> > > I think second approach is better. >> > > >> > > Cheers >> > > >> > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < >> > > [EMAIL PROTECTED]> wrote: >> > > >> > > > Sure. I can try that. Just curious, out of these 2 strategies, which >> > one >> > > do >> > > > you thin is better ? Do you have any experience of trying one or the >> > > other >> > > > ? >> > > > >> > > > Thanks >> > > > Upen >> > > > >> > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> >> wrote: >> > > > >> > > > > Thanks for sharing your experiences. >> > > > > >> > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? >> > > > > There have been several bug fixes / enhancements >> > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. >> > > > > >> > > > > Cheers >> > > > > >> > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < >> > > > > [EMAIL PROTECTED]> wrote: >> > > > > >> > > > > > Hi All, >> > > > > > I have question about improving the Map / Reduce job performance >> > > while >> > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. >> > > Here >> > > > is >> > > > > > what we are using: >> > > > > > >> > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* >> > > > > > 2) *hbase-0.90.40cdh3u2* >> > > > > > >> > > > > > I've used 2 different strategies as described below: >> > > > > > >> > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per >> > > region >> > > > > > server. And then subsequently kick off the hadoop job with >> > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does >> create >> > > > > reduce >> > > > > > tasks equal to the number of regions * 10. We used the "hash" of >> > each >> > > > > > record as the Key to Mapoutput. This process resulted in each >> > mapper >> > > > > finish >> > > > > > process in accepetable amount of time. But the reduce task takes >> > > > forever >> > > > > to >> > > > > > finish. We found that first the copy/shuffle process too >> > condierable >> > > > > amoun >> > > > > > of time and then the sort process took foreever to finish. >> > > > > > We tried to address this issue by constructing the key as >> > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the >> > records >> > > > of a >> > > > > > gven mapper. The idea was to reduce shuffling / copying from each >> > > > mapper. >> > > > > > But even this solution didn't save us anytime and the reduce step >> > > took +
Jean-Daniel Cryans 2012-12-18, 19:17
-
Re: HBase Map/Reduce Data Ingest PerformanceNick Dimiduk 2012-12-18, 19:20
Please forgive my poor choice of words; I meant no disrespect.
-n On Tue, Dec 18, 2012 at 11:06 AM, Upender K. Nimbekar < [EMAIL PROTECTED]> wrote: > I would like to request you maintain the respect of people asking questions > on this forum. Let's not start the thread in the wrong direction. > I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad. > Call succeeded but bulkLoad call still threw exception. However, it does > work if I do chmod and bulkLoad() from Hadoop Driver after the job is > finished. > BTW, Hbase user needs a WRITE permission and NOT read bease it created some > _tmp directories. > > Upen > > On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > > > Dumb question: what's the filesystem permissions of your generated > HFiles? > > Can the HBase process read them? Maybe a simple chmod or chown will get > you > > the rest of the way there. > > > > On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < > > [EMAIL PROTECTED]> wrote: > > > > > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But > > running > > > into permission issues while hbase user tries to import Hfile into > Hbase. > > > Not sure, if there is way to change the target HDFS file permission via > > > HFileOutputFormat. > > > > > > > > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > > > I think second approach is better. > > > > > > > > Cheers > > > > > > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > Sure. I can try that. Just curious, out of these 2 strategies, > which > > > one > > > > do > > > > > you thin is better ? Do you have any experience of trying one or > the > > > > other > > > > > ? > > > > > > > > > > Thanks > > > > > Upen > > > > > > > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > Thanks for sharing your experiences. > > > > > > > > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > > > > > There have been several bug fixes / enhancements > > > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > > > > > > > > > Cheers > > > > > > > > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > Hi All, > > > > > > > I have question about improving the Map / Reduce job > performance > > > > while > > > > > > > ingesting huge amount of data into Hbase using > HFileOutputFormat. > > > > Here > > > > > is > > > > > > > what we are using: > > > > > > > > > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > > > > > 2) *hbase-0.90.40cdh3u2* > > > > > > > > > > > > > > I've used 2 different strategies as described below: > > > > > > > > > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions > per > > > > region > > > > > > > server. And then subsequently kick off the hadoop job with > > > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does > > create > > > > > > reduce > > > > > > > tasks equal to the number of regions * 10. We used the "hash" > of > > > each > > > > > > > record as the Key to Mapoutput. This process resulted in each > > > mapper > > > > > > finish > > > > > > > process in accepetable amount of time. But the reduce task > takes > > > > > forever > > > > > > to > > > > > > > finish. We found that first the copy/shuffle process too > > > condierable > > > > > > amoun > > > > > > > of time and then the sort process took foreever to finish. > > > > > > > We tried to address this issue by constructing the key as > > > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the > > > records > > > > > of a > > > > > > > gven mapper. The idea was to reduce shuffling / copying from > each > > > > > mapper. > > > > > > > But even this solution didn't save us anytime and the reduce > step > > > > took > > > > > > > significant amount to finish. I played with adjusting the +
Nick Dimiduk 2012-12-18, 19:20
-
Re: HBase Map/Reduce Data Ingest Performancelars hofhansl 2012-12-19, 07:07
Hi Upender,
I think you misinterpreted what what Nick was saying. Personally, if I start something with "Dumb question" what I mean is "please forgive me if you had already thought about this, just making sure in case you missed it". I think Nick meant it the same way. We're pretty friendly folks here (mostly ;-) ). -- Lars ________________________________ From: Upender K. Nimbekar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, December 18, 2012 11:06 AM Subject: Re: HBase Map/Reduce Data Ingest Performance I would like to request you maintain the respect of people asking questions on this forum. Let's not start the thread in the wrong direction. I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad. Call succeeded but bulkLoad call still threw exception. However, it does work if I do chmod and bulkLoad() from Hadoop Driver after the job is finished. BTW, Hbase user needs a WRITE permission and NOT read bease it created some _tmp directories. Upen On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > Dumb question: what's the filesystem permissions of your generated HFiles? > Can the HBase process read them? Maybe a simple chmod or chown will get you > the rest of the way there. > > On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < > [EMAIL PROTECTED]> wrote: > > > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But > running > > into permission issues while hbase user tries to import Hfile into Hbase. > > Not sure, if there is way to change the target HDFS file permission via > > HFileOutputFormat. > > > > > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > I think second approach is better. > > > > > > Cheers > > > > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Sure. I can try that. Just curious, out of these 2 strategies, which > > one > > > do > > > > you thin is better ? Do you have any experience of trying one or the > > > other > > > > ? > > > > > > > > Thanks > > > > Upen > > > > > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Thanks for sharing your experiences. > > > > > > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > > > > There have been several bug fixes / enhancements > > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > > > > > > > Cheers > > > > > > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > > > Hi All, > > > > > > I have question about improving the Map / Reduce job performance > > > while > > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. > > > Here > > > > is > > > > > > what we are using: > > > > > > > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > > > > 2) *hbase-0.90.40cdh3u2* > > > > > > > > > > > > I've used 2 different strategies as described below: > > > > > > > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per > > > region > > > > > > server. And then subsequently kick off the hadoop job with > > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does > create > > > > > reduce > > > > > > tasks equal to the number of regions * 10. We used the "hash" of > > each > > > > > > record as the Key to Mapoutput. This process resulted in each > > mapper > > > > > finish > > > > > > process in accepetable amount of time. But the reduce task takes > > > > forever > > > > > to > > > > > > finish. We found that first the copy/shuffle process too > > condierable > > > > > amoun > > > > > > of time and then the sort process took foreever to finish. > > > > > > We tried to address this issue by constructing the key as > > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the > > records > > > > of a > > > > > > gven mapper. The idea was to reduce shuffling / copying from each +
lars hofhansl 2012-12-19, 07:07
-
Re: HBase Map/Reduce Data Ingest Performancelars hofhansl 2012-12-19, 07:10
Now of course I see that both Nick and J-D already replied saying something similar.
Apologies for repeating. Anyway, please keep asking questions. That is how we all learn. ________________________________ From: lars hofhansl <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, December 18, 2012 11:07 PM Subject: Re: HBase Map/Reduce Data Ingest Performance Hi Upender, I think you misinterpreted what what Nick was saying. Personally, if I start something with "Dumb question" what I mean is "please forgive me if you had already thought about this, just making sure in case you missed it". I think Nick meant it the same way. We're pretty friendly folks here (mostly ;-) ). -- Lars ________________________________ From: Upender K. Nimbekar <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, December 18, 2012 11:06 AM Subject: Re: HBase Map/Reduce Data Ingest Performance I would like to request you maintain the respect of people asking questions on this forum. Let's not start the thread in the wrong direction. I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad. Call succeeded but bulkLoad call still threw exception. However, it does work if I do chmod and bulkLoad() from Hadoop Driver after the job is finished. BTW, Hbase user needs a WRITE permission and NOT read bease it created some _tmp directories. Upen On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote: > Dumb question: what's the filesystem permissions of your generated HFiles? > Can the HBase process read them? Maybe a simple chmod or chown will get you > the rest of the way there. > > On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < > [EMAIL PROTECTED]> wrote: > > > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But > running > > into permission issues while hbase user tries to import Hfile into Hbase. > > Not sure, if there is way to change the target HDFS file permission via > > HFileOutputFormat. > > > > > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > > > I think second approach is better. > > > > > > Cheers > > > > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Sure. I can try that. Just curious, out of these 2 strategies, which > > one > > > do > > > > you thin is better ? Do you have any experience of trying one or the > > > other > > > > ? > > > > > > > > Thanks > > > > Upen > > > > > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Thanks for sharing your experiences. > > > > > > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > > > > There have been several bug fixes / enhancements > > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > > > > > > > Cheers > > > > > > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > > > Hi All, > > > > > > I have question about improving the Map / Reduce job performance > > > while > > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. > > > Here > > > > is > > > > > > what we are using: > > > > > > > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > > > > 2) *hbase-0.90.40cdh3u2* > > > > > > > > > > > > I've used 2 different strategies as described below: > > > > > > > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per > > > region > > > > > > server. And then subsequently kick off the hadoop job with > > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does > create > > > > > reduce > > > > > > tasks equal to the number of regions * 10. We used the "hash" of > > each > > > > > > record as the Key to Mapoutput. This process resulted in each > > mapper > > > > > finish > > > > > > process in accepetable amount of time. But the reduce task takes > > > > forever > > > > > to +
lars hofhansl 2012-12-19, 07:10
|