Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> VOTE: HDFS-347 merge

Copy link to this message
Re: VOTE: HDFS-347 merge
Hi Bikas,

I completely agree with you in principle -- short circuit reads end up
ceding control of the data path from the DataNode to the user applications.
This has a few disadvantages which you've mentioned, and have been brought
up in the JIRA as well: particularly QoS, metrics, the flexibility to
change our data layout on disk in the future, etc.

However, the performance advantages of this approach are quite stark when
the data sets have been cached in the OS buffer cache. For example, using a
low-overhead client like Impala executing a simple table scan query, we've
seen a 2x or more improvement in overall response time using short-circuit
reads versus localhost TCP. The overhead comes primarily from the kernel
layers, not from our own code -- eg localhost TCP connections still perform
packet segmentation, enforce multiple buffer copies to and from kernel
space, incur several syscalls, etc. A better implemented datanode, and
perhaps doing transfer over domain sockets might close the gap a bit, but
based on all of my benchmarks, it will still be ~50% slower than short

If you look at the history of HDFS-347, I actually asked Colin to implement
and experiment with a non-short-circuit path over domain sockets, under the
assumption that they may be more efficient than loopback TCP sockets. The
results weren't particularly encouraging, though it may still be enabled
for anyone who wants to experiment with optimizing it further. There are
also some improvements coming down the road in the Linux kernel (in
particular "TCP friends") which can eliminate some of the TCP stack
overhead for loopback connections, but unfortunately they're several years
off for those of us deploying on mainstream distros.

Most of the above is in reference to sequential throughput. Random IO
performance is even more drastically effected - the benchmarks I posted on
HDFS-347 show a 3-4x improvement in some workloads when the data is in the
buffer cache. As the RAM capacities of our machines continue to increase,
and as solid state storage becomes more cost effective, more and more
random reads fall into this category where they're not bound by the
hardware, but rather bound by our software overhead.

Given all of the above, I think the performance benefits of short circuit
read outweigh the disadvantages. Given that it is entirely an
implementation optimization, and not an API, we can always re-evaluate in
future versions, if either someone figures out a way to get a
non-short-circuit implementation to comparable performance, or if the
kernel guys catch up and implement TCP friends and other features which
close the gap. Colin has also been careful to build in capability in the
API for the datanode to reject a short circuit request, causing a client to
seamlessly fall back, based on a version number. This would allow us to
change the underlying format on DNs to something which isn't SCR-friendly,
without causing any incompatibility in existing clients, etc.

Hope the above explains the motivation for the feature.


On Tue, Feb 26, 2013 at 1:47 PM, Bikas Saha <[EMAIL PROTECTED]> wrote:

> Hi,
> In my opinion, this feature of short circuit reads (HDFS-347 or HDFS-2246)
> is not a desirable feature for HDFS. We should be working towards removing
> this feature instead of enhancing it and making it popular.
> Maybe short-circuit reads were something that HBase needed for performance
> at a point in time when HDFS performance was slow. But after all the
> improvements that have been made, is it still unacceptably slow to read
> from HDFS? Is there more good engineering that we can do to close that
> gap? Close it for all HDFS users and not just the ones who use
> short-circuit reads?
> Which brings me to the question - Who is the target audience for this
> feature? From what I see, anyone who potentially wants to use it => everyone. Now if everyone starts using short circuit reads what happens to
> the performance problem that we are trying to solve? Will performance

Todd Lipcon
Software Engineer, Cloudera