Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> HDFS read/write data throttling

Copy link to this message
Re: HDFS read/write data throttling
Hey Steve,

My research project (Cake, published at SoCC '12) was trying to provide
SLAs for mixed workloads of latency-sensitive and throughput-bound
applications, e.g. HBase running alongside MR. This was challenging because
seeks are a real killer. Basically, we had to strongly limit MR I/O to keep
worst-case seek latency down, and did so by putting schedulers on the RPC
queues in HBase and HDFS to restrict queuing in the OS and disk where we
lacked preemption.

Regarding citations of note, most academics consider throughput-sharing to
be a solved problem. It's not dissimilar from normal time slicing, you try
to ensure fairness over some coarse timescale. I think cgroups [1] and
ioprio_set [2] essentially provide this.

Mixing throughput and latency though is difficult, and my conclusion is
that there isn't a really great solution for spinning disks besides
physical isolation. As we all know, you can get either IOPS or bandwidth,
but not both, and it's not a linear tradeoff between the two. If you're
interested in this though, I can dig up some related work from my Cake

However, since it seems that we're more concerned with throughput-bound
apps, we might be okay just using cgroups and ioprio_set to do
time-slicing. I actually hacked up some code a while ago which passed a
client-provided priority byte to the DN, which used it to set the I/O
priority of the handling DataXceiver accordingly. This isn't the most
outlandish idea, since we've put QoS fields in our RPC protocol for
instance; this would just be another byte. Short-circuit reads are outside
this paradigm, but then you can use cgroup controls instead.

My casual conversations with Googlers indicate that there isn't any special
Borg/Omega sauce either, just that they heavily prioritize DFS I/O over
non-DFS. Maybe that's another approach: if we can separate block management
in HDFS, MR tasks could just write their output to a raw HDFS block, thus
bringing a lot of I/O back into the fold of "datanode as I/O manager" for a

Overall, I strongly agree with you that it's important to first define what
our goals are regarding I/O QoS. The general case is a tarpit, so it'd be
good to carve off useful things that can be done now (like Lohit's
direction of per-stream/FS throughput throttling with trusted clients) and
then carefully grow the scope as we find more usecases we can confidently


[1] cgroups blkio controller
[2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html
On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran <[EMAIL PROTECTED]>wrote:

> I've looked at it a bit within the context of YARN.
> YARN containers are where this would be ideal, as then you'd be able to
> request IO capacity as well as CPU and RAM. For that to work, the
> throttling would have to be outside the App, as you are trying to limit
> code whether or not it wants to be, and because you probably (*) want to
> give it more bandwidth if the system is otherwise idle. Self-throttling
> doesn't pick up spare IO
>    1. you can use cgroups in YARN to throttle local disk IO through the
>    file:// URLs or the java filesystem APIs -such as for MR temp data
>    2. you can't c-group throttle HDFS per YARN container, which would be
>    the ideal use case for it. The IO is taking place in the DN, and cgroups
>    only limits IO in the throttled process group.
>    3. implementing it in the DN would require a lot more complex code there
>    to prioritise work based on block ID (sole identifier that goes around
>    everywhere) or input source (local sockets for HBase IO vs TCP stack)
>    4. One you go to a heterogenous filesystem you need to think about IO
>    load per storage layer as well as/alongside per-volume
>    5. There's also generic RPC request throttle to prevent DoS against the
>    NN and other HDFS services. That would need to be server side, but once
>    implemented in the RPC code be universal.