|
|
-
Writing to HDFS from multiple HDFS agents (separate machines)
Gary Malouf 2013-03-14, 21:54
Hi guys,
I'm new to flume (hdfs for that metter), using the version packaged with CDH4 (1.3.0) and was wondering how others are maintaining different file names being written to per HDFS sink.
My initial thought is to create a separate sub-directory in hdfs for each sink - though I feel like the better way is to somehow prefix each file with a unique sink id. Are there any patterns that others are following for this?
-Gary
-
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Mohammad Tariq 2013-03-14, 22:00
Hello sir, One idea could be to create the sub directories with the machines' hostnames, in case you are getting data from multiple sources. you can easily find out which data belongs to which machine then. Warm Regards, Tariq https://mtariq.jux.com/cloudfront.blogspot.com On Fri, Mar 15, 2013 at 3:24 AM, Gary Malouf <[EMAIL PROTECTED]> wrote: > Hi guys, > > I'm new to flume (hdfs for that metter), using the version packaged with > CDH4 (1.3.0) and was wondering how others are maintaining different file > names being written to per HDFS sink. > > My initial thought is to create a separate sub-directory in hdfs for each > sink - though I feel like the better way is to somehow prefix each file > with a unique sink id. Are there any patterns that others are following > for this? > > -Gary >
-
RE: Writing to HDFS from multiple HDFS agents (separate machines)
Paul Chavez 2013-03-14, 22:31
You can use a Host Interceptor on the agents running an HDFS sink, and then use %{host} in the .hdfs.filePrefix property. This isn't really documented but it works, docs only mention using those tokens in the path property but they seem to be ok for the filePrefix.
Here's some excerpts of a test config I have that does just that:
#define the interceptor on the source staging2.sources.httpSource_stg.interceptors = iHost staging2.sources.httpSource_stg.interceptors.iHost.type = host staging2.sources.httpSource_stg.interceptors.iHost.useIP = false
#use the header the interceptor added in the filePrefix staging2.sinks.hdfs_FilterLogst.type = hdfs staging2.sinks.hdfs_FilterLogs.channel = mc_FilterLogs staging2.sinks.hdfs_FilterLogs.hdfs.path = /flume_stg/FilterLogsJSON/%Y%m%d staging2.sinks.hdfs_FilterLogs.hdfs.filePrefix = %{host}
Hope that helps, Paul Chavez
________________________________ From: Gary Malouf [mailto:[EMAIL PROTECTED]] Sent: Thursday, March 14, 2013 2:55 PM To: user Subject: Writing to HDFS from multiple HDFS agents (separate machines)
Hi guys,
I'm new to flume (hdfs for that metter), using the version packaged with CDH4 (1.3.0) and was wondering how others are maintaining different file names being written to per HDFS sink.
My initial thought is to create a separate sub-directory in hdfs for each sink - though I feel like the better way is to somehow prefix each file with a unique sink id. Are there any patterns that others are following for this?
-Gary
-
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Gary Malouf 2013-03-14, 22:34
To be clear, I am referring to the segregating of data from different flume sinks as opposed to the original source of the event. Having said that, it sounds like your approach is the easiest.
-Gary On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <[EMAIL PROTECTED]> wrote:
> Hi guys, > > I'm new to flume (hdfs for that metter), using the version packaged with > CDH4 (1.3.0) and was wondering how others are maintaining different file > names being written to per HDFS sink. > > My initial thought is to create a separate sub-directory in hdfs for each > sink - though I feel like the better way is to somehow prefix each file > with a unique sink id. Are there any patterns that others are following > for this? > > -Gary >
-
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Mike Percy 2013-03-15, 01:46
Hi Gary, All the suggestions in this thread are good. Something else to consider is that adding multiple HDFS sinks pulling from the same channel is a recommended practice to maximize performance (competing consumers pattern). In that case, not only would it be a good idea to put the data into directories that are specific to the hostname of the Flume agent writing to HDFS, you will also need to do something like number the HDFS sink path (or filePrefix) to indicate which HDFS sink wrote out the event, in order to prevent name collisions.
Example:
# add hostname interceptor to your source as described above
# hdfs sinks... agent.sinks.hdfs-1.path = /some/path/%{host}/1/web-events # … snip ... agent.sinks.hdfs-2.path = /some/path/%{host}/2/web-events # … etc ...
Hope that helps.
Regards, Mike
On Thu, Mar 14, 2013 at 3:34 PM, Gary Malouf <[EMAIL PROTECTED]> wrote:
> To be clear, I am referring to the segregating of data from different > flume sinks as opposed to the original source of the event. Having said > that, it sounds like your approach is the easiest. > > -Gary > > > On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <[EMAIL PROTECTED]>wrote: > >> Hi guys, >> >> I'm new to flume (hdfs for that metter), using the version packaged with >> CDH4 (1.3.0) and was wondering how others are maintaining different file >> names being written to per HDFS sink. >> >> My initial thought is to create a separate sub-directory in hdfs for each >> sink - though I feel like the better way is to somehow prefix each file >> with a unique sink id. Are there any patterns that others are following >> for this? >> >> -Gary >> > >
-
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Gary Malouf 2013-03-15, 02:30
Paul, I interpreted the host property to be for identifying the host that an event originates from rather than the host of the sink which writes the event to HDFS? Is my understanding correct? What happens if I am using the NettyAvroRpcClient to feed events from a different server round robin style to two hdfs writing agents; should I then NOT set the host property on client side and rely on the interceptor? On Thu, Mar 14, 2013 at 6:34 PM, Gary Malouf <[EMAIL PROTECTED]> wrote:
> To be clear, I am referring to the segregating of data from different > flume sinks as opposed to the original source of the event. Having said > that, it sounds like your approach is the easiest. > > -Gary > > > On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <[EMAIL PROTECTED]>wrote: > >> Hi guys, >> >> I'm new to flume (hdfs for that metter), using the version packaged with >> CDH4 (1.3.0) and was wondering how others are maintaining different file >> names being written to per HDFS sink. >> >> My initial thought is to create a separate sub-directory in hdfs for each >> sink - though I feel like the better way is to somehow prefix each file >> with a unique sink id. Are there any patterns that others are following >> for this? >> >> -Gary >> > >
-
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Gary Malouf 2013-03-15, 02:42
Thanks for the pointer Mike. Any thoughts on how you choose how many consumers per channel? I will eventually find the optimal number via perf testing, but it would be good to start with a nice default.
Thanks,
Gary On Thu, Mar 14, 2013 at 10:30 PM, Gary Malouf <[EMAIL PROTECTED]> wrote:
> Paul, I interpreted the host property to be for identifying the host that > an event originates from rather than the host of the sink which writes the > event to HDFS? Is my understanding correct? > > > What happens if I am using the NettyAvroRpcClient to feed events from a > different server round robin style to two hdfs writing agents; should I > then NOT set the host property on client side and rely on the interceptor? > > > On Thu, Mar 14, 2013 at 6:34 PM, Gary Malouf <[EMAIL PROTECTED]>wrote: > >> To be clear, I am referring to the segregating of data from different >> flume sinks as opposed to the original source of the event. Having said >> that, it sounds like your approach is the easiest. >> >> -Gary >> >> >> On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <[EMAIL PROTECTED]>wrote: >> >>> Hi guys, >>> >>> I'm new to flume (hdfs for that metter), using the version packaged with >>> CDH4 (1.3.0) and was wondering how others are maintaining different file >>> names being written to per HDFS sink. >>> >>> My initial thought is to create a separate sub-directory in hdfs for >>> each sink - though I feel like the better way is to somehow prefix each >>> file with a unique sink id. Are there any patterns that others are >>> following for this? >>> >>> -Gary >>> >> >> >
-
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Paul Chavez 2013-03-15, 03:30
It just depends on what you want to do with the header. In the case I presented the header is set by the agent running the HDFS sink, which seemed to align with your use case. If you need to know the originating host, just have the interceptor or originating host set a different header, the %{} notation allows you to specify an arbitrary header to swap in for the token, as long as it exists, of course.
-Paul On Mar 14, 2013, at 7:31 PM, "Gary Malouf" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Paul, I interpreted the host property to be for identifying the host that an event originates from rather than the host of the sink which writes the event to HDFS? Is my understanding correct? What happens if I am using the NettyAvroRpcClient to feed events from a different server round robin style to two hdfs writing agents; should I then NOT set the host property on client side and rely on the interceptor? On Thu, Mar 14, 2013 at 6:34 PM, Gary Malouf <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: To be clear, I am referring to the segregating of data from different flume sinks as opposed to the original source of the event. Having said that, it sounds like your approach is the easiest.
-Gary On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi guys,
I'm new to flume (hdfs for that metter), using the version packaged with CDH4 (1.3.0) and was wondering how others are maintaining different file names being written to per HDFS sink.
My initial thought is to create a separate sub-directory in hdfs for each sink - though I feel like the better way is to somehow prefix each file with a unique sink id. Are there any patterns that others are following for this?
-Gary
-
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Mike Percy 2013-03-15, 20:43
In my experience, 3-5 HDFS sinks will give optimal performance, but it's dependent on whether you use memory channel or file channel, your overall throughput, batch sizes, and event sizes.
Regards, Mike On Thu, Mar 14, 2013 at 7:42 PM, Gary Malouf <[EMAIL PROTECTED]> wrote:
> Thanks for the pointer Mike. Any thoughts on how you choose how many > consumers per channel? I will eventually find the optimal number via perf > testing, but it would be good to start with a nice default. > > Thanks, > > Gary > > > On Thu, Mar 14, 2013 at 10:30 PM, Gary Malouf <[EMAIL PROTECTED]>wrote: > >> Paul, I interpreted the host property to be for identifying the host that >> an event originates from rather than the host of the sink which writes the >> event to HDFS? Is my understanding correct? >> >> >> What happens if I am using the NettyAvroRpcClient to feed events from a >> different server round robin style to two hdfs writing agents; should I >> then NOT set the host property on client side and rely on the interceptor? >> >> >> On Thu, Mar 14, 2013 at 6:34 PM, Gary Malouf <[EMAIL PROTECTED]>wrote: >> >>> To be clear, I am referring to the segregating of data from different >>> flume sinks as opposed to the original source of the event. Having said >>> that, it sounds like your approach is the easiest. >>> >>> -Gary >>> >>> >>> On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <[EMAIL PROTECTED]>wrote: >>> >>>> Hi guys, >>>> >>>> I'm new to flume (hdfs for that metter), using the version packaged >>>> with CDH4 (1.3.0) and was wondering how others are maintaining different >>>> file names being written to per HDFS sink. >>>> >>>> My initial thought is to create a separate sub-directory in hdfs for >>>> each sink - though I feel like the better way is to somehow prefix each >>>> file with a unique sink id. Are there any patterns that others are >>>> following for this? >>>> >>>> -Gary >>>> >>> >>> >> >
-
Re: Writing to HDFS from multiple HDFS agents (separate machines)
Seshu V 2013-03-15, 21:20
I could differentiate different sources using this config by creating separate directories by hostname: agent.sources.syslogsrc.interceptors = ts agent.sources.syslogsrc.interceptors.ts.type = timestamp agent.sinks.hdfsSink.hdfs.path hdfs://<ip_addr>:<port>/flumetest/%{host}/%y-%m-%d However, I have a question related to this. When two different products are sending their logs to one source and I am collecting them via syslog. Is there a way to differentiate two different product logs coming from single source in flume? I would ideally like to have sub directory at the sink like '/flumetest/%{host}/<product_name>/%y-%m-%d. How can I do this? Thanks, - Seshu On Thu, Mar 14, 2013 at 5:00 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello sir, > > One idea could be to create the sub directories with the machines' > hostnames, in case you are getting data from multiple sources. you can > easily find out which data belongs to which machine then. > > Warm Regards, > Tariq > https://mtariq.jux.com/> cloudfront.blogspot.com > > > On Fri, Mar 15, 2013 at 3:24 AM, Gary Malouf <[EMAIL PROTECTED]>wrote: > >> Hi guys, >> >> I'm new to flume (hdfs for that metter), using the version packaged with >> CDH4 (1.3.0) and was wondering how others are maintaining different file >> names being written to per HDFS sink. >> >> My initial thought is to create a separate sub-directory in hdfs for each >> sink - though I feel like the better way is to somehow prefix each file >> with a unique sink id. Are there any patterns that others are following >> for this? >> >> -Gary >> > >
|
|