Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Roll based on date


Copy link to this message
-
Re: Roll based on date
Hi David,

Almost every few seconds.

Thanks.

Martinus
On Thu, Oct 24, 2013 at 9:49 PM, David Sinclair <
[EMAIL PROTECTED]> wrote:

> How often are your events coming in?
>
>
> On Thu, Oct 24, 2013 at 2:21 AM, Martinus m <[EMAIL PROTECTED]> wrote:
>
>> Hi David,
>>
>> Thanks for the example. I have set it just like above, but it only
>> generate for the first 15 minutes. After waiting for more than one hour,
>> there is no update at all in the s3 bucket.
>>
>> Thanks.
>>
>> Martinus
>>
>>
>> On Wed, Oct 23, 2013 at 8:48 PM, David Sinclair <
>> [EMAIL PROTECTED]> wrote:
>>
>>> You can set all of the time/size based rolling policies to zero and set
>>> an idle timeout on the sink. Below has a 15 minute timeout
>>>
>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>> agent.sinks.sink.hdfs.fileType = DataStream
>>> agent.sinks.sink.hdfs.rollInterval = 0
>>> agent.sinks.sink.hdfs.rollSize = 0
>>> agent.sinks.sink.hdfs.batchSize = 0
>>> agent.sinks.sink.hdfs.rollCount = 0
>>> agent.sinks.sink.hdfs.idleTimeout = 900
>>>
>>>
>>>
>>> On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hi David,
>>>>
>>>> The requirement is only roll per day actually.
>>>>
>>>> Hi Devin,
>>>>
>>>> Thanks for sharing your experienced. I also tried to set the config as
>>>> following :
>>>>
>>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>> agent.sinks.sink.hdfs.rollInterval = 0
>>>> agent.sinks.sink.hdfs.rollSize = 0
>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>
>>>> But I didn't see anything on the s3 bucket. So I guess, I need to
>>>> change the rollInterval into 86400. In my understanding, rollInterval 86400
>>>> will roll the file after 24 hours like you said, but it will not generate
>>>> new file if it's changed the day and haven't been 24 hours interval (unless
>>>> we put prefix to fileSuffix as above).
>>>>
>>>> Thanks to both of you.
>>>>
>>>> Best regards,
>>>>
>>>> Martinus
>>>>
>>>>
>>>> On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Martinus, you have to set all the other roll options to 0 explicitly
>>>>> in the configuration if you want them only to roll on one parameter, it
>>>>> will take the shortest working parameter it can meet for the roll. If you
>>>>> want it to roll once a day, you will have to specifically disable all the
>>>>> other options for roll triggers - they all take default settings unless
>>>>> told not to. When I was experimenting, for example, it kept rolling in 30
>>>>> seconds even though I had the hdfs.rollSize set to 64MB (our test data is
>>>>> generated slowly). So I ended up with a pile of small (0.2KB - 19~KB) files
>>>>> in a bunch of directories sorted by timestamp in ten-minute intervals.
>>>>>
>>>>> So, maybe a conf like this:
>>>>>
>>>>> agent.sinks.sink.type = hdfs
>>>>> agent.sinks.sink.channel = channel
>>>>> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
>>>>> agent.sinks.sink.hdfs.fileSuffix = .avro
>>>>> agent.sinks.sink.serializer = avro_event
>>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>>> agent.sinks.sink.hdfs.rollInterval = 86400
>>>>> agent.sinks.sink.hdfs.rollSize = 134217728
>>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>>
>>>>> This one will roll in HDFS in 24-hour intervals, or at 128MB file size
>>>>> for the file, and will close the file if it has 15000 events in it, but if
>>>>> the hdfs.rollCount line was not set to "0" or some higher value (I probably
>>>>> could have set that at 15000 to match the hdfs.batchSize for same results)
>>>>> then the file would roll as soon as the default of only 10 events were
>>>>> written in to the file.
>>>>>
>>>>> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we
>>>>> collect from syslogTCP which comes from remote host. It then goes to avro
>>>>> sink to aggregate the small event entries into larger avro files. Then, a