Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Hadoop throughput question


+
Artem Ervits 2013-01-03, 22:00
+
John Lilley 2013-01-03, 22:15
+
Artem Ervits 2013-01-03, 23:02
+
John Lilley 2013-01-03, 23:09
+
Michael Segel 2013-01-04, 00:11
+
Artem Ervits 2013-01-04, 00:00
+
Michael Katzenellenbogen 2013-01-04, 00:27
+
Artem Ervits 2013-01-04, 01:03
Copy link to this message
-
Re: Hadoop throughput question
If from the same machine, you can read the raw data of the file at 70MB/s
and when reading it using SequenceFile you get 26MB/sec, I would presume
that the speed difference comes down to the read pattern as well as the
Isilon file system implementation.

For the 70MB/s, if you are doing something like "hadoop fs -cat <file> >
/dev/null" then its probably doing individual read operations of 64KB or
128KB or whatever Isilon supports.  Then when you use the sequence file
format to read record by record, instead of reading 64KB, maybe its reading
a 16KB record at a time, and each record requires an operation to be sent
to Isilon to retrieve the data.  Hence, I would presume the difference
comes down to your file system implementation.  Of course, if your record
reader is poorly written or doing a lot of processing for each record, you
might bottleneck on CPU.  Presuming you aren't bottlenecked on CPU it would
see to be the IO pattern and the file system implementation.

If its the IO pattern and file system implementation, you can try to see if
Isilon supports read-ahead at all.  As a contrived example, with MapRFS,
your user level process may issue a 16KB read to the MapRFS library, and in
turn the MapRFS library can read ahead 128KB so that the next series of
16KB reads in your program are served out of the local cache on your
client, reducing the effects of network latency, etc.
On Thu, Jan 3, 2013 at 4:00 PM, Artem Ervits <[EMAIL PROTECTED]> wrote:

>  I will follow up on that certainly, thank you for the information.****
>
> ** **
>
> So further investigation showed that counting SequenceFile records takes
> about 26mb/sec. If I simply read bytes on the same cluster and the same
> file, the speed is 70mb/sec. Is there a configuration for optimizing
> SequenceFile processing?****
>
> ** **
>
> Thank you.****
>
> ** **
>
> *From:* John Lilley [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, January 03, 2013 6:09 PM
>
> *To:* [EMAIL PROTECTED]
> *Subject:* RE: Hadoop throughput question****
>
>  ** **
>
> Unless the Hadoop processing and the OneFS storage are co-located,
> MapReduce can’t schedule tasks so as to take advantage of data locality.
> You would basically be doing a distributed computation against a separate
> NAS, so throughput would be limited by the performance properties of the
> Insilon NAS and the network switch architecture.  Still, 26MB/sec in
> aggregate is far worse than what I’d expect Insilon to deliver, even over a
> single 1GB connection.****
>
> john****
>
> ** **
>
> *From:* Artem Ervits [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, January 03, 2013 4:02 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* RE: Hadoop throughput question****
>
> ** **
>
> Hadoop is using OneFS, not HDFS in our configuration. Isilon NAS and the
> Hadoop nodes are in the same datacenter but as far as rack locations, I
> cannot tell. ****
>
> ** **
>
> *From:* John Lilley [mailto:[EMAIL PROTECTED]<[EMAIL PROTECTED]>]
>
> *Sent:* Thursday, January 03, 2013 5:15 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* RE: Hadoop throughput question****
>
> ** **
>
> Let’s suppose you are doing a read-intensive job like, for example,
> counting records.  This is will be disk bandwidth limited.  On a 4-node
> cluster with 2 local SATA on each node you should easily read 400MB/sec in
> aggregate.  When you are running the Hadoop cluster, is the Hadoop
> processing co-located with the Ilsilon nodes?  Is Hadoop configured to use
> OneFS or HDFS?****
>
> John****
>
> ** **
>
> *From:* Artem Ervits [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, January 03, 2013 3:00 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Hadoop throughput question****
>
> ** **
>
> Hello all,****
>
> ** **
>
> I’d like to pick the community brain on average throughput speeds for a
> moderately specced 4-node Hadoop cluster with 1GigE networking. Is it
> reasonable to expect constant average speeds of 150-200mb/sec on such
> setup? Forgive me if the question is loaded but we’re Hadoop cluster with
+
Michael Katzenellenbogen 2013-01-03, 22:08
+
Artem Ervits 2013-01-03, 22:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB