Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> RE: Hadoop throughput question


+
John Lilley 2013-01-04, 02:03
Copy link to this message
-
Re: Hadoop throughput question
I'd also check the SequenceFile compression type
(io.seqfile.compression.type). By default this is set to RECORD, and if
needed, should ideally be set to BLOCK (or NONE if not desired).

Also, how are you comparing the "raw" read of 70mb/s? Is it by using hdfs
-cat? If so, what happens if you change it to hdfs -text? Does the read
rate slow down as well or is the 70mb/s sustained?

-Michael

On Thu, Jan 3, 2013 at 8:03 PM, Artem Ervits <[EMAIL PROTECTED]> wrote:

>  Setting the property to 64k made the throughput jump to 36mb/sec, 39mb
> for 128k.****
>
> ** **
>
> Thank you for the tip.****
>
> ** **
>
> *From:* Michael Katzenellenbogen [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, January 03, 2013 7:28 PM
>
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Hadoop throughput question****
>
>  ** **
>
> What is the value of the io.file.buffer.size property? Try tuning it up to
> 64k or 128k and see if this improves performance when reading
> SequenceFiles.
>
> -Michael****
>
>
> On Jan 3, 2013, at 7:00 PM, Artem Ervits <[EMAIL PROTECTED]> wrote:****
>
>  I will follow up on that certainly, thank you for the information.****
>
>  ****
>
> So further investigation showed that counting SequenceFile records takes
> about 26mb/sec. If I simply read bytes on the same cluster and the same
> file, the speed is 70mb/sec. Is there a configuration for optimizing
> SequenceFile processing?****
>
>  ****
>
> Thank you.****
>
>  ****
>
> *From:* John Lilley [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]>]
>
> *Sent:* Thursday, January 03, 2013 6:09 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* RE: Hadoop throughput question****
>
>  ****
>
> Unless the Hadoop processing and the OneFS storage are co-located,
> MapReduce can’t schedule tasks so as to take advantage of data locality.
> You would basically be doing a distributed computation against a separate
> NAS, so throughput would be limited by the performance properties of the
> Insilon NAS and the network switch architecture.  Still, 26MB/sec in
> aggregate is far worse than what I’d expect Insilon to deliver, even over a
> single 1GB connection.****
>
> john****
>
>  ****
>
> *From:* Artem Ervits [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, January 03, 2013 4:02 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* RE: Hadoop throughput question****
>
>  ****
>
> Hadoop is using OneFS, not HDFS in our configuration. Isilon NAS and the
> Hadoop nodes are in the same datacenter but as far as rack locations, I
> cannot tell. ****
>
>  ****
>
> *From:* John Lilley [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]>]
>
> *Sent:* Thursday, January 03, 2013 5:15 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* RE: Hadoop throughput question****
>
>  ****
>
> Let’s suppose you are doing a read-intensive job like, for example,
> counting records.  This is will be disk bandwidth limited.  On a 4-node
> cluster with 2 local SATA on each node you should easily read 400MB/sec in
> aggregate.  When you are running the Hadoop cluster, is the Hadoop
> processing co-located with the Ilsilon nodes?  Is Hadoop configured to use
> OneFS or HDFS?****
>
> John****
>
>  ****
>
> *From:* Artem Ervits [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, January 03, 2013 3:00 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Hadoop throughput question****
>
>  ****
>
> Hello all,****
>
>  ****
>
> I’d like to pick the community brain on average throughput speeds for a
> moderately specced 4-node Hadoop cluster with 1GigE networking. Is it
> reasonable to expect constant average speeds of 150-200mb/sec on such
> setup? Forgive me if the question is loaded but we’re Hadoop cluster with
> HDFS served via EMC Isilon storage. We’re getting about 30mb/sec with our
> machines and we do not see a difference in job speed between 2 node cluster
> and 4 node cluster. ****
>
>  ****
>
> Thank you.****
>
>  ****
>
>  ****
>
> --------------------****
>
>  ****
>
> This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged.  If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited.  If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message.  Thank you.****
+
John Lilley 2013-01-04, 01:12
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB