Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - why did I achieve such poor performance of HDFS

Copy link to this message
Re: why did I achieve such poor performance of HDFS
Konstantin Boudnik 2009-08-04, 15:52
Hi Hao.
One more question for you - I should've asked it in my first email, though...
What is your network speed/throughput on such a massive reads WITHOUT HDFS in
place? While I'm agree that ~14Kbps isn't that much at all, I was wondering what
would be the speed of 5000 simultaneous reads from a native file systems over
the same network?

Could such a test be congregated in your setup?

One more issue here is that in your first test the size of a file is smaller
than a default HDFS block size (64MB i think) and it is likely to create
significant overhead and affect the performance.

1) For a sharing of your current test you can simply create new JIRA under
https://issues.apache.org/jira/browse/ under 'test' or simply send it to me as
an attachment and I'll take care about JIRA stuff. But I'd love to see the
result of the other test I've mentioned above if possible.

2) DFSClient does provide an API for random reads from a file and this API is
thread safe. However, my uneducated guess would be that it is likely to be a
responsibility of a client (your) problem to 'rebuild' the file from randomly
read block in correct order. It is like pretty much any other filesystem out
there: YOU have to know the sequence of the pieces of your file in order to
reconstruct them from many concurrent reads.

Hope it helps,

On 8/3/09 6:49 PM, Hao Gong wrote:
> Hi Konstantin,
>    Thank you for your responsing.
>    1. Yes. It is automated and can be reused easily by anyone, I think.
> Because I didn't change the HDFS code and parameter except for the parameter
> of "hadoop.tmp.dir" and "fs.default.name".
>    2. Yes. I can share our test with the community. How to do it now?
>    By the way, I have a little question about HDFS.
>    1. HDFS client is a single-threaded or multi-threaded when it transmit the
> blocks of a certain file? I mean that for example, if file A, its size is
> 256MB, it divide 4 blocks in 4 datanodes. When client PUT or GET this file,
> the operation is sequential (one block by one) or simultaneous (client GET
> the 4 block from 4 datanodes at the same time)?
>    In client source, I used "FSDataInputStream.read(long position, byte[]
> buffer, int offset, int length)" to GET the file.
>    Thanks very much.
> Best regards,
> Hao Gong
> Huawei Technologies Co., Ltd
> ***********************************************
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender by
> phone or email immediately and delete it!
> ***********************************************
> -----锟绞硷拷原锟斤拷-----
> 锟斤拷锟斤拷锟斤拷: Konstantin Boudnik [mailto:[EMAIL PROTECTED]]
> 锟斤拷锟斤拷时锟斤拷: 2009锟斤拷8锟斤拷4锟斤拷 1:02
> 锟秸硷拷锟斤拷: [EMAIL PROTECTED]
> 锟斤拷锟斤拷: Re: why did I achieve such poor performance of HDFS
> Hi Hao.
> Thanks for the observation. While I'll leave a chance to comment on the
> particular situation to someone knowing more about HDFS than me, I would
> like to
> ask you a couple of questions:
>     - do you have that particular test in a completely separable form? I.e.
> is it
> automated and can it be reused easily by some one else?
>     - could you share this test with the rest of the community through a JIRA
> or
> else?
> Thanks,
>     Konstantin (aka Cos)
> On 8/3/09 12:59 AM, Hao Gong wrote:
>> Hi all,
>> I have used HDFS as distributed storage system for experiment. But in my
>> test process, I find that the performance of HDFS is very poor.
>> I make two scenarios. 1) Middle size file test: I PUT 200,000 middle
>> size files (20KB~20MB randomly) into HDFS, and trigger 10 client to GET
>> random 5000 files simultaneously. But the average GET throughput of

With best regards,
Konstantin Boudnik (aka Cos)

        Yahoo! Grid Computing
        +1 (408) 349-4049

2CAC 8312 4870 D885 8616  6115 220F 6980 1F27 E622
Attention! Streams of consciousness are disallowed