|
Hao Gong
2009-08-03, 07:59
Konstantin Boudnik
2009-08-03, 17:02
Hao Gong
2009-08-04, 01:49
Konstantin Boudnik
2009-08-04, 15:52
Konstantin Boudnik
2009-08-04, 16:06
Raghu Angadi
2009-08-04, 17:10
Hao Gong
2009-08-05, 01:42
Imran M Yousuf
2009-08-05, 02:02
Raghu Angadi
2009-08-05, 23:48
Raghu Angadi
2009-08-06, 00:42
Hao Gong
2009-08-07, 05:46
Jason Venner
2009-08-07, 13:36
|
-
why did I achieve such poor performance of HDFSHao Gong 2009-08-03, 07:59
Hi all,
I have used HDFS as distributed storage system for experiment. But in my test process, I find that the performance of HDFS is very poor. I make two scenarios. 1) Middle size file test: I PUT 200,000 middle size files (20KB~20MB randomly) into HDFS, and trigger 10 client to GET random 5000 files simultaneously. But the average GET throughput of client is very poor (approximately less than 14000 KBps). 2) Large size file test. I PUT 20,000 large size files (250MB~750MB randomly) into HDFS, and trigger 10 client to GET random 100 files simultaneously. But the average GET throughput of client is also very poor (approximately less than 12500 KBps). So I'm puzzle about these experiments, why did such a poor performance of HDFS, the available throughput of Client is far less than the limit of network bandwidth. Is that has any parameter I need to change for high performance in HDFS (I chose default parameter value)? My enviroment is list as follows 1) 30 common PC as HDFS slaves (core2 E7200, 4G ram, 1.5T hdd) 2) 10 common PC as HDFS clients (core2 E7200, 4G ram, 1.5T hdd) 3) A common PC as HDFS master (core2 E7200, 4G ram, 1.5T hdd) 4) 1000M switcher and link as star network architecture 5) The hadoop version is 0.20.0, JRE version is 1.6.0_11 Is there has anybody to research the performance of HDFS, please contact me. Thank you very much. Best regards, Hao Gong Huawei Technologies Co., Ltd *********************************************** This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! ***********************************************
-
Re: why did I achieve such poor performance of HDFSKonstantin Boudnik 2009-08-03, 17:02
Hi Hao.
Thanks for the observation. While I'll leave a chance to comment on the particular situation to someone knowing more about HDFS than me, I would like to ask you a couple of questions: - do you have that particular test in a completely separable form? I.e. is it automated and can it be reused easily by some one else? - could you share this test with the rest of the community through a JIRA or else? Thanks, Konstantin (aka Cos) On 8/3/09 12:59 AM, Hao Gong wrote: > Hi all, > > I have used HDFS as distributed storage system for experiment. But in my > test process, I find that the performance of HDFS is very poor. > > I make two scenarios. 1) Middle size file test: I PUT 200,000 middle > size files (20KB~20MB randomly) into HDFS, and trigger 10 client to GET > random 5000 files simultaneously. But the average GET throughput of > client is very poor (approximately less than 14000 KBps). 2) Large size > file test. I PUT 20,000 large size files (250MB~750MB randomly) into > HDFS, and trigger 10 client to GET random 100 files simultaneously. But > the average GET throughput of client is also very poor (approximately > less than 12500 KBps). > > So I�m puzzle about these experiments, why did such a poor performance > of HDFS, the available throughput of Client is far less than the limit > of network bandwidth. Is that has any parameter I need to change for > high performance in HDFS (I chose default parameter value)? > > My enviroment is list as follows > > 1) 30 common PC as HDFS slaves (core2 E7200, 4G ram, 1.5T hdd) > > 2) 10 common PC as HDFS clients (core2 E7200, 4G ram, 1.5T hdd) > > 3) A common PC as HDFS master (core2 E7200, 4G ram, 1.5T hdd) > > 4) 1000M switcher and link as star network architecture > > 5) The hadoop version is 0.20.0, JRE version is 1.6.0_11 > > Is there has anybody to research the performance of HDFS, please contact > me. Thank you very much. > > Best regards, > > Hao Gong > > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, > reproduction, or dissemination) by persons other than the intended > recipient(s) is prohibited. If you receive this e-mail in error, please > notify the sender by phone or email immediately and delete it! > *********************************************** > -- With best regards, Konstantin Boudnik (aka Cos) Yahoo! Grid Computing +1 (408) 349-4049 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 Attention! Streams of consciousness are disallowed
-
re: why did I achieve such poor performance of HDFSHao Gong 2009-08-04, 01:49
Hi Konstantin,
Thank you for your responsing. 1. Yes. It is automated and can be reused easily by anyone, I think. Because I didn't change the HDFS code and parameter except for the parameter of "hadoop.tmp.dir" and "fs.default.name". 2. Yes. I can share our test with the community. How to do it now? By the way, I have a little question about HDFS. 1. HDFS client is a single-threaded or multi-threaded when it transmit the blocks of a certain file? I mean that for example, if file A, its size is 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this file, the operation is sequential (one block by one) or simultaneous (client GET the 4 block from 4 datanodes at the same time)? In client source, I used "FSDataInputStream.read(long position, byte[] buffer, int offset, int length)" to GET the file. Thanks very much. Best regards, Hao Gong Huawei Technologies Co., Ltd *********************************************** This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! *********************************************** -----邮件原件----- 发件人: Konstantin Boudnik [mailto:[EMAIL PROTECTED]] 发送时间: 2009年8月4日 1:02 收件人: [EMAIL PROTECTED] 主题: Re: why did I achieve such poor performance of HDFS Hi Hao. Thanks for the observation. While I'll leave a chance to comment on the particular situation to someone knowing more about HDFS than me, I would like to ask you a couple of questions: - do you have that particular test in a completely separable form? I.e. is it automated and can it be reused easily by some one else? - could you share this test with the rest of the community through a JIRA or else? Thanks, Konstantin (aka Cos) On 8/3/09 12:59 AM, Hao Gong wrote: > Hi all, > > I have used HDFS as distributed storage system for experiment. But in my > test process, I find that the performance of HDFS is very poor. > > I make two scenarios. 1) Middle size file test: I PUT 200,000 middle > size files (20KB~20MB randomly) into HDFS, and trigger 10 client to GET > random 5000 files simultaneously. But the average GET throughput of > client is very poor (approximately less than 14000 KBps). 2) Large size > file test. I PUT 20,000 large size files (250MB~750MB randomly) into > HDFS, and trigger 10 client to GET random 100 files simultaneously. But > the average GET throughput of client is also very poor (approximately > less than 12500 KBps). > > So I’m puzzle about these experiments, why did such a poor performance > of HDFS, the available throughput of Client is far less than the limit > of network bandwidth. Is that has any parameter I need to change for > high performance in HDFS (I chose default parameter value)? > > My enviroment is list as follows > > 1) 30 common PC as HDFS slaves (core2 E7200, 4G ram, 1.5T hdd) > > 2) 10 common PC as HDFS clients (core2 E7200, 4G ram, 1.5T hdd) > > 3) A common PC as HDFS master (core2 E7200, 4G ram, 1.5T hdd) > > 4) 1000M switcher and link as star network architecture > > 5) The hadoop version is 0.20.0, JRE version is 1.6.0_11 > > Is there has anybody to research the performance of HDFS, please contact > me. Thank you very much. > > Best regards, > > Hao Gong > > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, > reproduction, or dissemination) by persons other than the intended With best regards, Konstantin Boudnik (aka Cos) Yahoo! Grid Computing +1 (408) 349-4049 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 Attention! Streams of consciousness are disallowed
-
Re: why did I achieve such poor performance of HDFSKonstantin Boudnik 2009-08-04, 15:52
Hi Hao.
One more question for you - I should've asked it in my first email, though... What is your network speed/throughput on such a massive reads WITHOUT HDFS in place? While I'm agree that ~14Kbps isn't that much at all, I was wondering what would be the speed of 5000 simultaneous reads from a native file systems over the same network? Could such a test be congregated in your setup? One more issue here is that in your first test the size of a file is smaller than a default HDFS block size (64MB i think) and it is likely to create significant overhead and affect the performance. 1) For a sharing of your current test you can simply create new JIRA under https://issues.apache.org/jira/browse/ under 'test' or simply send it to me as an attachment and I'll take care about JIRA stuff. But I'd love to see the result of the other test I've mentioned above if possible. 2) DFSClient does provide an API for random reads from a file and this API is thread safe. However, my uneducated guess would be that it is likely to be a responsibility of a client (your) problem to 'rebuild' the file from randomly read block in correct order. It is like pretty much any other filesystem out there: YOU have to know the sequence of the pieces of your file in order to reconstruct them from many concurrent reads. Hope it helps, Konstantin On 8/3/09 6:49 PM, Hao Gong wrote: > Hi Konstantin, > > Thank you for your responsing. > 1. Yes. It is automated and can be reused easily by anyone, I think. > Because I didn't change the HDFS code and parameter except for the parameter > of "hadoop.tmp.dir" and "fs.default.name". > 2. Yes. I can share our test with the community. How to do it now? > > By the way, I have a little question about HDFS. > 1. HDFS client is a single-threaded or multi-threaded when it transmit the > blocks of a certain file? I mean that for example, if file A, its size is > 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this file, > the operation is sequential (one block by one) or simultaneous (client GET > the 4 block from 4 datanodes at the same time)? > In client source, I used "FSDataInputStream.read(long position, byte[] > buffer, int offset, int length)" to GET the file. > > Thanks very much. > > Best regards, > Hao Gong > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dissemination) by persons other than the intended recipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender by > phone or email immediately and delete it! > *********************************************** > -----锟绞硷拷原锟斤拷----- > 锟斤拷锟斤拷锟斤拷: Konstantin Boudnik [mailto:[EMAIL PROTECTED]] > 锟斤拷锟斤拷时锟斤拷: 2009锟斤拷8锟斤拷4锟斤拷 1:02 > 锟秸硷拷锟斤拷: [EMAIL PROTECTED] > 锟斤拷锟斤拷: Re: why did I achieve such poor performance of HDFS > > Hi Hao. > > Thanks for the observation. While I'll leave a chance to comment on the > particular situation to someone knowing more about HDFS than me, I would > like to > ask you a couple of questions: > - do you have that particular test in a completely separable form? I.e. > is it > automated and can it be reused easily by some one else? > - could you share this test with the rest of the community through a JIRA > or > else? > > Thanks, > Konstantin (aka Cos) > > On 8/3/09 12:59 AM, Hao Gong wrote: >> Hi all, >> >> I have used HDFS as distributed storage system for experiment. But in my >> test process, I find that the performance of HDFS is very poor. >> >> I make two scenarios. 1) Middle size file test: I PUT 200,000 middle >> size files (20KB~20MB randomly) into HDFS, and trigger 10 client to GET >> random 5000 files simultaneously. But the average GET throughput of With best regards, Konstantin Boudnik (aka Cos) Yahoo! Grid Computing +1 (408) 349-4049 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 Attention! Streams of consciousness are disallowed
-
Re: why did I achieve such poor performance of HDFSKonstantin Boudnik 2009-08-04, 16:06
And here's some reading you can find useful:
http://www.facebook.com/note.php?note_id=53035052002&ref=mf On 8/4/09 8:52 AM, Konstantin Boudnik wrote: > Hi Hao. > > > One more question for you - I should've asked it in my first email, though... > What is your network speed/throughput on such a massive reads WITHOUT HDFS in > place? While I'm agree that ~14Kbps isn't that much at all, I was wondering what > would be the speed of 5000 simultaneous reads from a native file systems over > the same network? > > Could such a test be congregated in your setup? > > One more issue here is that in your first test the size of a file is smaller > than a default HDFS block size (64MB i think) and it is likely to create > significant overhead and affect the performance. > > 1) For a sharing of your current test you can simply create new JIRA under > https://issues.apache.org/jira/browse/ under 'test' or simply send it to me as > an attachment and I'll take care about JIRA stuff. But I'd love to see the > result of the other test I've mentioned above if possible. > > 2) DFSClient does provide an API for random reads from a file and this API is > thread safe. However, my uneducated guess would be that it is likely to be a > responsibility of a client (your) problem to 'rebuild' the file from randomly > read block in correct order. It is like pretty much any other filesystem out > there: YOU have to know the sequence of the pieces of your file in order to > reconstruct them from many concurrent reads. > > Hope it helps, > Konstantin > > On 8/3/09 6:49 PM, Hao Gong wrote: >> Hi Konstantin, >> >> Thank you for your responsing. >> 1. Yes. It is automated and can be reused easily by anyone, I think. >> Because I didn't change the HDFS code and parameter except for the parameter >> of "hadoop.tmp.dir" and "fs.default.name". >> 2. Yes. I can share our test with the community. How to do it now? >> >> By the way, I have a little question about HDFS. >> 1. HDFS client is a single-threaded or multi-threaded when it transmit the >> blocks of a certain file? I mean that for example, if file A, its size is >> 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this file, >> the operation is sequential (one block by one) or simultaneous (client GET >> the 4 block from 4 datanodes at the same time)? >> In client source, I used "FSDataInputStream.read(long position, byte[] >> buffer, int offset, int length)" to GET the file. >> >> Thanks very much. >> >> Best regards, >> Hao Gong >> Huawei Technologies Co., Ltd >> *********************************************** >> This e-mail and its attachments contain confidential information from >> HUAWEI, which is intended only for the person or entity whose address is >> listed above. Any use of the information contained herein in any way >> (including, but not limited to, total or partial disclosure, reproduction, >> or dissemination) by persons other than the intended recipient(s) is >> prohibited. If you receive this e-mail in error, please notify the sender by >> phone or email immediately and delete it! >> *********************************************** >> -----锟绞硷拷原锟斤拷----- >> 锟斤拷锟斤拷锟斤拷: Konstantin Boudnik [mailto:[EMAIL PROTECTED]] >> 锟斤拷锟斤拷时锟斤拷: 2009锟斤拷8锟斤拷4锟斤拷 1:02 >> 锟秸硷拷锟斤拷: [EMAIL PROTECTED] >> 锟斤拷锟斤拷: Re: why did I achieve such poor performance of HDFS >> >> Hi Hao. >> >> Thanks for the observation. While I'll leave a chance to comment on the >> particular situation to someone knowing more about HDFS than me, I would >> like to >> ask you a couple of questions: >> - do you have that particular test in a completely separable form? I.e. >> is it >> automated and can it be reused easily by some one else? >> - could you share this test with the rest of the community through a JIRA >> or >> else? >> >> Thanks, >> Konstantin (aka Cos) >> >> On 8/3/09 12:59 AM, Hao Gong wrote: >>> Hi all, >>> >>> I have used HDFS as distributed storage system for experiment. But in my With best regards, Konstantin Boudnik (aka Cos) Yahoo! Grid Computing +1 (408) 349-4049 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 Attention! Streams of consciousness are disallowed
-
Re: why did I achieve such poor performance of HDFSRaghu Angadi 2009-08-04, 17:10
Most of the time such simple reads (especially of large files) are I/O bound : either network or disk. One way to debug your issue is to read 100 larger from a single client and observe what the b/w you get. If it is low, you could do various things to check the bottlenecks (stack trace of client, iostat, netstat).. Raghu. Hao Gong wrote: > Hi all, > > I have used HDFS as distributed storage system for experiment. But in my > test process, I find that the performance of HDFS is very poor. > > I make two scenarios. 1) Middle size file test: I PUT 200,000 middle > size files (20KB~20MB randomly) into HDFS, and trigger 10 client to GET > random 5000 files simultaneously. But the average GET throughput of > client is very poor (approximately less than 14000 KBps). 2) Large size > file test. I PUT 20,000 large size files (250MB~750MB randomly) into > HDFS, and trigger 10 client to GET random 100 files simultaneously. But > the average GET throughput of client is also very poor (approximately > less than 12500 KBps). > > So I�m puzzle about these experiments, why did such a poor performance > of HDFS, the available throughput of Client is far less than the limit > of network bandwidth. Is that has any parameter I need to change for > high performance in HDFS (I chose default parameter value)? > > My enviroment is list as follows > > 1) 30 common PC as HDFS slaves (core2 E7200, 4G ram, 1.5T hdd) > > 2) 10 common PC as HDFS clients (core2 E7200, 4G ram, 1.5T hdd) > > 3) A common PC as HDFS master (core2 E7200, 4G ram, 1.5T hdd) > > 4) 1000M switcher and link as star network architecture > > 5) The hadoop version is 0.20.0, JRE version is 1.6.0_11 > > Is there has anybody to research the performance of HDFS, please contact > me. Thank you very much. > > > > Best regards, > > Hao Gong > > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, > reproduction, or dissemination) by persons other than the intended > recipient(s) is prohibited. If you receive this e-mail in error, please > notify the sender by phone or email immediately and delete it! > *********************************************** > > >
-
re: why did I achieve such poor performance of HDFSHao Gong 2009-08-05, 01:42
Hi Konstantin and Raghu
1. There may be a misunderstanding. We didn't trigger 5000 simultaneous reads; we only start 10 clients and every client trigger 5000 random read sequentially (one by one) by single-threaded, so a certain file only be read by 10 client simultaneously. And our network speed is 1000Mb/s, we test the throughput between any two nodes is approximately 90MB/s. We also test the disk I/O by "iostat", and the result is approximately 80MB/s, our disk is Seagate 7200.11 series 1.5TB. 2. I will upload our test to JIRA as soon as possible. Thanks. Best regards, Hao Gong Huawei Technologies Co., Ltd *********************************************** This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! *********************************************** -----邮件原件----- 发件人: Konstantin Boudnik [mailto:[EMAIL PROTECTED]] 发送时间: 2009年8月5日 0:07 收件人: [EMAIL PROTECTED] 主题: Re: why did I achieve such poor performance of HDFS And here's some reading you can find useful: http://www.facebook.com/note.php?note_id=53035052002&ref=mf On 8/4/09 8:52 AM, Konstantin Boudnik wrote: > Hi Hao. > > > One more question for you - I should've asked it in my first email, though... > What is your network speed/throughput on such a massive reads WITHOUT HDFS in > place? While I'm agree that ~14Kbps isn't that much at all, I was wondering what > would be the speed of 5000 simultaneous reads from a native file systems over > the same network? > > Could such a test be congregated in your setup? > > One more issue here is that in your first test the size of a file is smaller > than a default HDFS block size (64MB i think) and it is likely to create > significant overhead and affect the performance. > > 1) For a sharing of your current test you can simply create new JIRA under > https://issues.apache.org/jira/browse/ under 'test' or simply send it to me as > an attachment and I'll take care about JIRA stuff. But I'd love to see the > result of the other test I've mentioned above if possible. > > 2) DFSClient does provide an API for random reads from a file and this API is > thread safe. However, my uneducated guess would be that it is likely to be a > responsibility of a client (your) problem to 'rebuild' the file from randomly > read block in correct order. It is like pretty much any other filesystem out > there: YOU have to know the sequence of the pieces of your file in order to > reconstruct them from many concurrent reads. > > Hope it helps, > Konstantin > > On 8/3/09 6:49 PM, Hao Gong wrote: >> Hi Konstantin, >> >> Thank you for your responsing. >> 1. Yes. It is automated and can be reused easily by anyone, I think. >> Because I didn't change the HDFS code and parameter except for the parameter >> of "hadoop.tmp.dir" and "fs.default.name". >> 2. Yes. I can share our test with the community. How to do it now? >> >> By the way, I have a little question about HDFS. >> 1. HDFS client is a single-threaded or multi-threaded when it transmit the >> blocks of a certain file? I mean that for example, if file A, its size is >> 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this file, >> the operation is sequential (one block by one) or simultaneous (client GET >> the 4 block from 4 datanodes at the same time)? >> In client source, I used "FSDataInputStream.read(long position, byte[] >> buffer, int offset, int length)" to GET the file. >> >> Thanks very much. >> >> Best regards, >> Hao Gong >> Huawei Technologies Co., Ltd >> *********************************************** reproduction, by I.e. JIRA With best regards, Konstantin Boudnik (aka Cos) Yahoo! Grid Computing +1 (408) 349-4049 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 Attention! Streams of consciousness are disallowed
-
Re: why did I achieve such poor performance of HDFSImran M Yousuf 2009-08-05, 02:02
2009/8/5 Hao Gong <[EMAIL PROTECTED]>:
> Hi Konstantin and Raghu > > 1. There may be a misunderstanding. We didn't trigger 5000 simultaneous > reads; we only start 10 clients and every client trigger 5000 random read > sequentially (one by one) by single-threaded, so a certain file only be read > by 10 client simultaneously. > And our network speed is 1000Mb/s, we test the throughput between any two > nodes is approximately 90MB/s. We also test the disk I/O by "iostat", and Hi Hao, So its actually 720Mb/s thats about 72% of the network speed and more than the I/O speed, I would have been quite impressed TBH :). Looking forward to checkout the tests. - Imran > the result is approximately 80MB/s, our disk is Seagate 7200.11 series > 1.5TB. > 2. I will upload our test to JIRA as soon as possible. > Thanks. > > Best regards, > Hao Gong > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dissemination) by persons other than the intended recipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender by > phone or email immediately and delete it! > *********************************************** > -----邮件原件----- > 发件人: Konstantin Boudnik [mailto:[EMAIL PROTECTED]] > 发送时间: 2009年8月5日 0:07 > 收件人: [EMAIL PROTECTED] > 主题: Re: why did I achieve such poor performance of HDFS > > And here's some reading you can find useful: > http://www.facebook.com/note.php?note_id=53035052002&ref=mf > > On 8/4/09 8:52 AM, Konstantin Boudnik wrote: >> Hi Hao. >> >> >> One more question for you - I should've asked it in my first email, > though... >> What is your network speed/throughput on such a massive reads WITHOUT HDFS > in >> place? While I'm agree that ~14Kbps isn't that much at all, I was > wondering what >> would be the speed of 5000 simultaneous reads from a native file systems > over >> the same network? >> >> Could such a test be congregated in your setup? >> >> One more issue here is that in your first test the size of a file is > smaller >> than a default HDFS block size (64MB i think) and it is likely to create >> significant overhead and affect the performance. >> >> 1) For a sharing of your current test you can simply create new JIRA under >> https://issues.apache.org/jira/browse/ under 'test' or simply send it to > me as >> an attachment and I'll take care about JIRA stuff. But I'd love to see the >> result of the other test I've mentioned above if possible. >> >> 2) DFSClient does provide an API for random reads from a file and this API > is >> thread safe. However, my uneducated guess would be that it is likely to be > a >> responsibility of a client (your) problem to 'rebuild' the file from > randomly >> read block in correct order. It is like pretty much any other filesystem > out >> there: YOU have to know the sequence of the pieces of your file in order > to >> reconstruct them from many concurrent reads. >> >> Hope it helps, >> Konstantin >> >> On 8/3/09 6:49 PM, Hao Gong wrote: >>> Hi Konstantin, >>> >>> Thank you for your responsing. >>> 1. Yes. It is automated and can be reused easily by anyone, I think. >>> Because I didn't change the HDFS code and parameter except for the > parameter >>> of "hadoop.tmp.dir" and "fs.default.name". >>> 2. Yes. I can share our test with the community. How to do it now? >>> >>> By the way, I have a little question about HDFS. >>> 1. HDFS client is a single-threaded or multi-threaded when it > transmit the >>> blocks of a certain file? I mean that for example, if file A, its size is >>> 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this > file, >>> the operation is sequential (one block by one) or simultaneous (client Imran M Yousuf Entrepreneur & Software Engineer Smart IT Engineering Dhaka, Bangladesh Email: [EMAIL PROTECTED] Blog: http://imyousuf-tech.blogs.smartitengineering.com/ Mobile: +880-1711402557
-
Re: why did I achieve such poor performance of HDFSRaghu Angadi 2009-08-05, 23:48
It is still simpler if you test with just one client first. Raghu. Hao Gong wrote: > Hi Konstantin and Raghu > > 1. There may be a misunderstanding. We didn't trigger 5000 simultaneous > reads; we only start 10 clients and every client trigger 5000 random read > sequentially (one by one) by single-threaded, so a certain file only be read > by 10 client simultaneously. > And our network speed is 1000Mb/s, we test the throughput between any two > nodes is approximately 90MB/s. We also test the disk I/O by "iostat", and > the result is approximately 80MB/s, our disk is Seagate 7200.11 series > 1.5TB. > 2. I will upload our test to JIRA as soon as possible. > Thanks. > > Best regards, > Hao Gong > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dissemination) by persons other than the intended recipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender by > phone or email immediately and delete it! > *********************************************** > -----锟绞硷拷原锟斤拷----- > 锟斤拷锟斤拷锟斤拷: Konstantin Boudnik [mailto:[EMAIL PROTECTED]] > 锟斤拷锟斤拷时锟斤拷: 2009锟斤拷8锟斤拷5锟斤拷 0:07 > 锟秸硷拷锟斤拷: [EMAIL PROTECTED] > 锟斤拷锟斤拷: Re: why did I achieve such poor performance of HDFS > > And here's some reading you can find useful: > http://www.facebook.com/note.php?note_id=53035052002&ref=mf > > On 8/4/09 8:52 AM, Konstantin Boudnik wrote: >> Hi Hao. >> >> >> One more question for you - I should've asked it in my first email, > though... >> What is your network speed/throughput on such a massive reads WITHOUT HDFS > in >> place? While I'm agree that ~14Kbps isn't that much at all, I was > wondering what >> would be the speed of 5000 simultaneous reads from a native file systems > over >> the same network? >> >> Could such a test be congregated in your setup? >> >> One more issue here is that in your first test the size of a file is > smaller >> than a default HDFS block size (64MB i think) and it is likely to create >> significant overhead and affect the performance. >> >> 1) For a sharing of your current test you can simply create new JIRA under >> https://issues.apache.org/jira/browse/ under 'test' or simply send it to > me as >> an attachment and I'll take care about JIRA stuff. But I'd love to see the >> result of the other test I've mentioned above if possible. >> >> 2) DFSClient does provide an API for random reads from a file and this API > is >> thread safe. However, my uneducated guess would be that it is likely to be > a >> responsibility of a client (your) problem to 'rebuild' the file from > randomly >> read block in correct order. It is like pretty much any other filesystem > out >> there: YOU have to know the sequence of the pieces of your file in order > to >> reconstruct them from many concurrent reads. >> >> Hope it helps, >> Konstantin >> >> On 8/3/09 6:49 PM, Hao Gong wrote: >>> Hi Konstantin, >>> >>> Thank you for your responsing. >>> 1. Yes. It is automated and can be reused easily by anyone, I think. >>> Because I didn't change the HDFS code and parameter except for the > parameter >>> of "hadoop.tmp.dir" and "fs.default.name". >>> 2. Yes. I can share our test with the community. How to do it now? >>> >>> By the way, I have a little question about HDFS. >>> 1. HDFS client is a single-threaded or multi-threaded when it > transmit the >>> blocks of a certain file? I mean that for example, if file A, its size is >>> 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this > file, >>> the operation is sequential (one block by one) or simultaneous (client > GET >>> the 4 block from 4 datanodes at the same time)? >>> In client source, I used "FSDataInputStream.read(long position,
-
Re: why did I achieve such poor performance of HDFSRaghu Angadi 2009-08-06, 00:42
Hao Gong wrote:
> Hi Konstantin and Raghu > > 1. There may be a misunderstanding. We didn't trigger 5000 simultaneous > reads; we only start 10 clients and every client trigger 5000 random read > sequentially (one by one) by single-threaded, so a certain file only be read > by 10 client simultaneously. > And our network speed is 1000Mb/s, we test the throughput between any two > nodes is approximately 90MB/s. We also test the disk I/O by "iostat", and > the result is approximately 80MB/s, our disk is Seagate 7200.11 series > 1.5TB. I meant you could use netstat, iostat _while_ reading from HDFS so that you get better idea of where the bottleneck is.. (Ideally you need to run these on datanode along with the client). What does client do with data it reads from HDFS? Raghu. > 2. I will upload our test to JIRA as soon as possible. > Thanks. > > Best regards, > Hao Gong > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dissemination) by persons other than the intended recipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender by > phone or email immediately and delete it! > *********************************************** > -----锟绞硷拷原锟斤拷----- > 锟斤拷锟斤拷锟斤拷: Konstantin Boudnik [mailto:[EMAIL PROTECTED]] > 锟斤拷锟斤拷时锟斤拷: 2009锟斤拷8锟斤拷5锟斤拷 0:07 > 锟秸硷拷锟斤拷: [EMAIL PROTECTED] > 锟斤拷锟斤拷: Re: why did I achieve such poor performance of HDFS > > And here's some reading you can find useful: > http://www.facebook.com/note.php?note_id=53035052002&ref=mf > > On 8/4/09 8:52 AM, Konstantin Boudnik wrote: >> Hi Hao. >> >> >> One more question for you - I should've asked it in my first email, > though... >> What is your network speed/throughput on such a massive reads WITHOUT HDFS > in >> place? While I'm agree that ~14Kbps isn't that much at all, I was > wondering what >> would be the speed of 5000 simultaneous reads from a native file systems > over >> the same network? >> >> Could such a test be congregated in your setup? >> >> One more issue here is that in your first test the size of a file is > smaller >> than a default HDFS block size (64MB i think) and it is likely to create >> significant overhead and affect the performance. >> >> 1) For a sharing of your current test you can simply create new JIRA under >> https://issues.apache.org/jira/browse/ under 'test' or simply send it to > me as >> an attachment and I'll take care about JIRA stuff. But I'd love to see the >> result of the other test I've mentioned above if possible. >> >> 2) DFSClient does provide an API for random reads from a file and this API > is >> thread safe. However, my uneducated guess would be that it is likely to be > a >> responsibility of a client (your) problem to 'rebuild' the file from > randomly >> read block in correct order. It is like pretty much any other filesystem > out >> there: YOU have to know the sequence of the pieces of your file in order > to >> reconstruct them from many concurrent reads. >> >> Hope it helps, >> Konstantin >> >> On 8/3/09 6:49 PM, Hao Gong wrote: >>> Hi Konstantin, >>> >>> Thank you for your responsing. >>> 1. Yes. It is automated and can be reused easily by anyone, I think. >>> Because I didn't change the HDFS code and parameter except for the > parameter >>> of "hadoop.tmp.dir" and "fs.default.name". >>> 2. Yes. I can share our test with the community. How to do it now? >>> >>> By the way, I have a little question about HDFS. >>> 1. HDFS client is a single-threaded or multi-threaded when it > transmit the >>> blocks of a certain file? I mean that for example, if file A, its size is >>> 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this
-
re: why did I achieve such poor performance of HDFSHao Gong 2009-08-07, 05:46
Hi Raghu,
I have done the experiment again yesterday. And in test, I use iostat and NMON tools to measure the disk and network I/O throughput. The topo is as same as the former test. 1. Large size file test. I PUT 10,000 large size files (250MB~750MB randomly) into HDFS, and trigger 1 client to GET random 5 files sequentially. When client read from HDFS, it only record the file size, the client didn't put the file into local disk. So the client disk I/O is approximately 0KB/s. According to the record of NMON, the client network throughput is average 32743KB/s, the peak is 79079KB/s. The datanode network throughput is average 22254KB/s, the peak is 35056KB/s. And the disk I/O of datanode is average 24324KB/s. 2. Then, I trigger 10 clients to GET random files. The client network throughput is decline. Average is only 16343KB/s, the peak is 38920KB/s. The datanode network throughput is average 34216KB/s, and the disk I/O of datanode is average 35323KB/s. But the I/O utilization rate is far from 100% I have confused that where the bottleneck is. Maybe I need to upload my test code to JIRA and someone can help me. Thank you. Best regards, Hao Gong Huawei Technologies Co., Ltd *********************************************** This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! *********************************************** -----邮件原件----- 发件人: Raghu Angadi [mailto:[EMAIL PROTECTED]] 发送时间: 2009年8月6日 8:42 收件人: [EMAIL PROTECTED] 主题: Re: why did I achieve such poor performance of HDFS Hao Gong wrote: > Hi Konstantin and Raghu > > 1. There may be a misunderstanding. We didn't trigger 5000 simultaneous > reads; we only start 10 clients and every client trigger 5000 random read > sequentially (one by one) by single-threaded, so a certain file only be read > by 10 client simultaneously. > And our network speed is 1000Mb/s, we test the throughput between any two > nodes is approximately 90MB/s. We also test the disk I/O by "iostat", and > the result is approximately 80MB/s, our disk is Seagate 7200.11 series > 1.5TB. I meant you could use netstat, iostat _while_ reading from HDFS so that you get better idea of where the bottleneck is.. (Ideally you need to run these on datanode along with the client). What does client do with data it reads from HDFS? Raghu. > 2. I will upload our test to JIRA as soon as possible. > Thanks. > > Best regards, > Hao Gong > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dissemination) by persons other than the intended recipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender by > phone or email immediately and delete it! > *********************************************** > -----邮件原件----- > 发件人: Konstantin Boudnik [mailto:[EMAIL PROTECTED]] > 发送时间: 2009年8月5日 0:07 > 收件人: [EMAIL PROTECTED] > 主题: Re: why did I achieve such poor performance of HDFS > > And here's some reading you can find useful: > http://www.facebook.com/note.php?note_id=53035052002&ref=mf > > On 8/4/09 8:52 AM, Konstantin Boudnik wrote: >> Hi Hao. >> >> >> One more question for you - I should've asked it in my first email, > though... >> What is your network speed/throughput on such a massive reads WITHOUT HDFS > in under the API be is sender a my contact is
-
Re: why did I achieve such poor performance of HDFSJason Venner 2009-08-07, 13:36
Just out of simple curiosity, what is the io speed you get when you simply
use command line tools to read the data blocks on your datanodes. Also, is there significant garbage collection going on in your Datanodes, or your client tasks? Is there anything else doing IO on your machines, or by chance are your datanodes and clients not on the same switch, or is the switch rate limiting the network traffice? 2009/8/6 Hao Gong <[EMAIL PROTECTED]> > Hi Raghu, > > I have done the experiment again yesterday. > And in test, I use iostat and NMON tools to measure the disk and network > I/O throughput. The topo is as same as the former test. > > 1. Large size file test. I PUT 10,000 large size files (250MB~750MB > randomly) into HDFS, and trigger 1 client to GET random 5 files > sequentially. When client read from HDFS, it only record the file size, the > client didn't put the file into local disk. So the client disk I/O is > approximately 0KB/s. According to the record of NMON, the client network > throughput is average 32743KB/s, the peak is 79079KB/s. The datanode > network > throughput is average 22254KB/s, the peak is 35056KB/s. And the disk I/O of > datanode is average 24324KB/s. > 2. Then, I trigger 10 clients to GET random files. The client network > throughput is decline. Average is only 16343KB/s, the peak is 38920KB/s. > The > datanode network throughput is average 34216KB/s, and the disk I/O of > datanode is average 35323KB/s. But the I/O utilization rate is far from > 100% > > I have confused that where the bottleneck is. Maybe I need to upload my > test code to JIRA and someone can help me. Thank you. > > Best regards, > Hao Gong > Huawei Technologies Co., Ltd > *********************************************** > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dissemination) by persons other than the intended recipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender > by > phone or email immediately and delete it! > *********************************************** > > -----邮件原件----- > 发件人: Raghu Angadi [mailto:[EMAIL PROTECTED]] > 发送时间: 2009年8月6日 8:42 > 收件人: [EMAIL PROTECTED] > 主题: Re: why did I achieve such poor performance of HDFS > > Hao Gong wrote: > > Hi Konstantin and Raghu > > > > 1. There may be a misunderstanding. We didn't trigger 5000 simultaneous > > reads; we only start 10 clients and every client trigger 5000 random read > > sequentially (one by one) by single-threaded, so a certain file only be > read > > by 10 client simultaneously. > > And our network speed is 1000Mb/s, we test the throughput between any > two > > nodes is approximately 90MB/s. We also test the disk I/O by "iostat", and > > the result is approximately 80MB/s, our disk is Seagate 7200.11 series > > 1.5TB. > > I meant you could use netstat, iostat _while_ reading from HDFS so that > you get better idea of where the bottleneck is.. (Ideally you need to > run these on datanode along with the client). What does client do with > data it reads from HDFS? > > Raghu. > > > 2. I will upload our test to JIRA as soon as possible. > > Thanks. > > > > Best regards, > > Hao Gong > > Huawei Technologies Co., Ltd > > *********************************************** > > This e-mail and its attachments contain confidential information from > > HUAWEI, which is intended only for the person or entity whose address is > > listed above. Any use of the information contained herein in any way > > (including, but not limited to, total or partial disclosure, > reproduction, > > or dissemination) by persons other than the intended recipient(s) is > > prohibited. If you receive this e-mail in error, please notify the sender > by > > phone or email immediately and delete it! Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals |