Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Writing to HDFS from multiple HDFS agents (separate machines)

Copy link to this message
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Hi Gary,
All the suggestions in this thread are good. Something else to consider is
that adding multiple HDFS sinks pulling from the same channel is a
recommended practice to maximize performance (competing consumers pattern).
In that case, not only would it be a good idea to put the data into
directories that are specific to the hostname of the Flume agent writing to
HDFS, you will also need to do something like number the HDFS sink path (or
filePrefix) to indicate which HDFS sink wrote out the event, in order to
prevent name collisions.


# add hostname interceptor to your source as described above

# hdfs sinks...
agent.sinks.hdfs-1.path = /some/path/%{host}/1/web-events
# … snip ...
agent.sinks.hdfs-2.path = /some/path/%{host}/2/web-events
# … etc ...

Hope that helps.


On Thu, Mar 14, 2013 at 3:34 PM, Gary Malouf <[EMAIL PROTECTED]> wrote:

> To be clear, I am referring to the segregating of data from different
> flume sinks as opposed to the original source of the event.  Having said
> that, it sounds like your approach is the easiest.
> -Gary
> On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <[EMAIL PROTECTED]>wrote:
>> Hi guys,
>> I'm new to flume (hdfs for that metter), using the version packaged with
>> CDH4 (1.3.0) and was wondering how others are maintaining different file
>> names being written to per HDFS sink.
>> My initial thought is to create a separate sub-directory in hdfs for each
>> sink - though I feel like the better way is to somehow prefix each file
>> with a unique sink id.  Are there any patterns that others are following
>> for this?
>> -Gary