-Scaling hdfs writes when dealing with large numbers of files
Juhani Connolly 2013-09-02, 05:00
Hey guys, trying to figure out an approach around what I personally feel
is a bit of an abuse of the system, would appreciate your input, and
perhaps can come up with a solution to share with others.
Short version, see below for detail:
Lot's of files being simulataneously written. Number of connections from
each data path is defined by path file no * sink no, with a lot of files
this will cause even increasing from 2->3 hdfs sinks to not scale even
though nothing(flume aggregators or HDFS datanodes) is fully loaded.
Experimenting with alternate configs, increasing data paths thus
reducing files per path, allowing 2 sinks per path has shown the number
of connections up are causing this, but what exactly might be causing
the bottleneck and is there a way to get rid of it?
We've lately been running into somewhat interesting issues with
performance seemingly capped by the number of hdfs connections that are
up. This is due to the very significant number of different files that
are being streamed simultaneously.
I have serious doubts about the viability of the approach and have
suggested storing files together and post-processing but at the moment
this appears not to be a possibility, so I'm looking for alternate
approaches to allow scaling of throughput.
Some basic stats: on each aggregator node we have roughly 2,500 files
being written to every hour, and a bit over 25,000,000 lines in that
period of time. Approx 10k events per second. There are multiple
aggregators writing to the same hdfs though each one writes to separate
files from one another.
Originally we were scaling individual aggregator nodes by increasing the
HDFS sink count, but this wasn't giving any increases beyond the second
node, despite the results from Mike Percy's
By increasing sinks we were also increasing the number of hdfs
connections for every single file, hitting some kind of
bottleneck(datanode transfer threads? Considering each has 4k and there
are a lot more datanodes than aggregators, there should be plenty
Afterwards, splitting incoming data into multiple paths(separate sources
and sinks) allowed each path to have two sinks and thus scales our
throughput by increasing the number of sinks without increasing the
connections(because each sink was only handling a fraction of the
filepaths). Now a single node has 5 avro sources set up with a file
channel and 2 sinks attached to each. With all of the transfer
threads(resulting from all the different files), each node has about
4000 threads running though at peak it can be double that.
So we have a rough idea of what is wrong(too many connections are
hitting some kind of bottleneck) but can't track any exact cause(neither
the aggregators nor datanodes are at full load, the only blocked waiting
threads in flume are those waiting on HDFS). I personally feel we just
need to reduce the number of files being simultaneously written since
HDFS isn't really made to deal with such small files, and batches are
not getting efficiently processed(waiting on dozens, possible hundreds
of small transfers before being able to commit). That being said, can
anyone provide specific insight into what may cause the bottleneck at
high connection numbers, and if there are ways around it(other than
reducing file counts and proportion of files to each sink which I'm
already pushing for)?