Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Hadoop throughput question


Copy link to this message
-
Re: Hadoop throughput question
What is the value of the io.file.buffer.size property? Try tuning it up to
64k or 128k and see if this improves performance when reading
SequenceFiles.

-Michael

On Jan 3, 2013, at 7:00 PM, Artem Ervits <[EMAIL PROTECTED]> wrote:

  I will follow up on that certainly, thank you for the information.

So further investigation showed that counting SequenceFile records takes
about 26mb/sec. If I simply read bytes on the same cluster and the same
file, the speed is 70mb/sec. Is there a configuration for optimizing
SequenceFile processing?

Thank you.

*From:* John Lilley [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]>]

*Sent:* Thursday, January 03, 2013 6:09 PM
*To:* [EMAIL PROTECTED]
*Subject:* RE: Hadoop throughput question

Unless the Hadoop processing and the OneFS storage are co-located,
MapReduce can’t schedule tasks so as to take advantage of data locality.
You would basically be doing a distributed computation against a separate
NAS, so throughput would be limited by the performance properties of the
Insilon NAS and the network switch architecture.  Still, 26MB/sec in
aggregate is far worse than what I’d expect Insilon to deliver, even over a
single 1GB connection.

john

*From:* Artem Ervits [mailto:[EMAIL PROTECTED]]
*Sent:* Thursday, January 03, 2013 4:02 PM
*To:* [EMAIL PROTECTED]
*Subject:* RE: Hadoop throughput question

Hadoop is using OneFS, not HDFS in our configuration. Isilon NAS and the
Hadoop nodes are in the same datacenter but as far as rack locations, I
cannot tell.

*From:* John Lilley [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]>]

*Sent:* Thursday, January 03, 2013 5:15 PM
*To:* [EMAIL PROTECTED]
*Subject:* RE: Hadoop throughput question

Let’s suppose you are doing a read-intensive job like, for example,
counting records.  This is will be disk bandwidth limited.  On a 4-node
cluster with 2 local SATA on each node you should easily read 400MB/sec in
aggregate.  When you are running the Hadoop cluster, is the Hadoop
processing co-located with the Ilsilon nodes?  Is Hadoop configured to use
OneFS or HDFS?

John

*From:* Artem Ervits [mailto:[EMAIL PROTECTED]]
*Sent:* Thursday, January 03, 2013 3:00 PM
*To:* [EMAIL PROTECTED]
*Subject:* Hadoop throughput question

Hello all,

I’d like to pick the community brain on average throughput speeds for a
moderately specced 4-node Hadoop cluster with 1GigE networking. Is it
reasonable to expect constant average speeds of 150-200mb/sec on such
setup? Forgive me if the question is loaded but we’re Hadoop cluster with
HDFS served via EMC Isilon storage. We’re getting about 30mb/sec with our
machines and we do not see a difference in job speed between 2 node cluster
and 4 node cluster.

Thank you.

--------------------

This electronic message is intended to be for the use only of the
named recipient, and may contain information that is confidential or
privileged.  If you are not the intended recipient, you are hereby
notified that any disclosure, copying, distribution or use of the
contents of this message is strictly prohibited.  If you have received
this message in error or are not the named recipient, please notify us
immediately by contacting the sender at the electronic mail address
noted above, and delete and destroy all copies of this message.  Thank
you.

--------------------

This electronic message is intended to be for the use only of the
named recipient, and may contain information that is confidential or
privileged.  If you are not the intended recipient, you are hereby
notified that any disclosure, copying, distribution or use of the
contents of this message is strictly prohibited.  If you have received
this message in error or are not the named recipient, please notify us
immediately by contacting the sender at the electronic mail address
noted above, and delete and destroy all copies of this message.  Thank
you.

 --------------------

This electronic message is intended to be for the use only of the
named recipient, and may contain information that is confidential or
privileged.  If you are not the intended recipient, you are hereby
notified that any disclosure, copying, distribution or use of the
contents of this message is strictly prohibited.  If you have received
this message in error or are not the named recipient, please notify us
immediately by contacting the sender at the electronic mail address
noted above, and delete and destroy all copies of this message.  Thank
you.

This electronic message is intended to be for the use only of the
named recipient, and may contain information that is confidential or
privileged.  If you are not the intended recipient, you are hereby
notified that any disclosure, copying, distribution or use of the
contents of this message is strictly prohibited.  If you have received
this message in error or are not the named recipient, please notify us
immediately by contacting the sender at the electronic mail address
noted above, and delete and destroy all copies of this message.  Thank
you.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB