lohit 2013-11-11, 18:59
Adam Muise 2013-11-11, 19:27
lohit 2013-11-11, 19:47
Haosong Huang 2013-11-12, 02:16
Andrew Wang 2013-11-12, 03:44
lohit 2013-11-12, 04:24
I've looked at it a bit within the context of YARN.
YARN containers are where this would be ideal, as then you'd be able to
request IO capacity as well as CPU and RAM. For that to work, the
throttling would have to be outside the App, as you are trying to limit
code whether or not it wants to be, and because you probably (*) want to
give it more bandwidth if the system is otherwise idle. Self-throttling
doesn't pick up spare IO
1. you can use cgroups in YARN to throttle local disk IO through the
file:// URLs or the java filesystem APIs -such as for MR temp data
2. you can't c-group throttle HDFS per YARN container, which would be
the ideal use case for it. The IO is taking place in the DN, and cgroups
only limits IO in the throttled process group.
3. implementing it in the DN would require a lot more complex code there
to prioritise work based on block ID (sole identifier that goes around
everywhere) or input source (local sockets for HBase IO vs TCP stack)
4. One you go to a heterogenous filesystem you need to think about IO
load per storage layer as well as/alongside per-volume
5. There's also generic RPC request throttle to prevent DoS against the
NN and other HDFS services. That would need to be server side, but once
implemented in the RPC code be universal.
You also need to define what is the load you are trying to throttle, pure
RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a file is
lined up for sequential reading, you'd almost want it to stream through the
next blocks until a high priority request came through, but operations like
a seek which would involve a disk head movement backwards would be
something to throttle (hence you need to be storage type aware as SSD seeks
costs less). You also need to consider that although the cost of writes is
high, it's usually being done with the goal of preserving data -and you
don't want to impact durability.
(*) probably, because that's one of the issues that causes debates in other
datacentre platforms, such as Google Omega: do you want max cluster
utilisation vs max determinism of workload.
If someone were to do IOP throttling in the 3.x+ timeline,
1. It needs clear use cases, YARN containers being #1 for me
2. We'd have to look at all the research done on this in the past to see
what works, doesn't
Andrew, what citations of relevance do you have?
On 12 November 2013 04:24, lohit <[EMAIL PROTECTED]> wrote:
> 2013/11/11 Andrew Wang <[EMAIL PROTECTED]>
> > Hey Lohit,
> > This is an interesting topic, and something I actually worked on in grad
> > school before coming to Cloudera. It'd help if you could outline some of
> > your usecases and how per-FileSystem throttling would help. For what I
> > doing, it made more sense to throttle on the DN side since you have a
> > better view over all the I/O happening on the system, and you have
> > knowledge of different volumes so you can set limits per-disk. This still
> > isn't 100% reliable though since normally a portion of each disk is used
> > for MR scratch space, which the DN doesn't have control over. I tried
> > playing with thread I/O priorities here, but didn't see much improvement.
> > Maybe the newer cgroups stuff can help out.
> Thanks. Yes, we also thought about having something on DataNode. This would
> also mean one could easily throttle client who access from outside the
> cluster, for example distcp or hftp copies. Clients need not worry about
> throttle configs and each cluster can control how much much throughput can
> be achieved. We do want to have something like this.
> > I'm sure per-FileSystem throttling will have some benefits (and probably
> > easier than some DN-side implementation) but again, it'd help to better
> > understand the problem you are trying to solve.
> One idea was flexibility for client to override and have value they can
> set. For on trusted cluster we could allow clients to go beyond default
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
Andrew Wang 2013-11-13, 06:27
Steve Loughran 2013-11-13, 10:54
Andrew Wang 2013-11-18, 18:25
Jay Vyas 2013-11-18, 18:46
Andrew Wang 2013-11-18, 21:25