Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Roll based on date


Copy link to this message
-
Re: Roll based on date
Hi David,

Almost every few seconds.

Thanks.

Martinus
On Thu, Oct 24, 2013 at 9:49 PM, David Sinclair <
[EMAIL PROTECTED]> wrote:

> How often are your events coming in?
>
>
> On Thu, Oct 24, 2013 at 2:21 AM, Martinus m <[EMAIL PROTECTED]> wrote:
>
>> Hi David,
>>
>> Thanks for the example. I have set it just like above, but it only
>> generate for the first 15 minutes. After waiting for more than one hour,
>> there is no update at all in the s3 bucket.
>>
>> Thanks.
>>
>> Martinus
>>
>>
>> On Wed, Oct 23, 2013 at 8:48 PM, David Sinclair <
>> [EMAIL PROTECTED]> wrote:
>>
>>> You can set all of the time/size based rolling policies to zero and set
>>> an idle timeout on the sink. Below has a 15 minute timeout
>>>
>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>> agent.sinks.sink.hdfs.fileType = DataStream
>>> agent.sinks.sink.hdfs.rollInterval = 0
>>> agent.sinks.sink.hdfs.rollSize = 0
>>> agent.sinks.sink.hdfs.batchSize = 0
>>> agent.sinks.sink.hdfs.rollCount = 0
>>> agent.sinks.sink.hdfs.idleTimeout = 900
>>>
>>>
>>>
>>> On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hi David,
>>>>
>>>> The requirement is only roll per day actually.
>>>>
>>>> Hi Devin,
>>>>
>>>> Thanks for sharing your experienced. I also tried to set the config as
>>>> following :
>>>>
>>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>> agent.sinks.sink.hdfs.rollInterval = 0
>>>> agent.sinks.sink.hdfs.rollSize = 0
>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>
>>>> But I didn't see anything on the s3 bucket. So I guess, I need to
>>>> change the rollInterval into 86400. In my understanding, rollInterval 86400
>>>> will roll the file after 24 hours like you said, but it will not generate
>>>> new file if it's changed the day and haven't been 24 hours interval (unless
>>>> we put prefix to fileSuffix as above).
>>>>
>>>> Thanks to both of you.
>>>>
>>>> Best regards,
>>>>
>>>> Martinus
>>>>
>>>>
>>>> On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Martinus, you have to set all the other roll options to 0 explicitly
>>>>> in the configuration if you want them only to roll on one parameter, it
>>>>> will take the shortest working parameter it can meet for the roll. If you
>>>>> want it to roll once a day, you will have to specifically disable all the
>>>>> other options for roll triggers - they all take default settings unless
>>>>> told not to. When I was experimenting, for example, it kept rolling in 30
>>>>> seconds even though I had the hdfs.rollSize set to 64MB (our test data is
>>>>> generated slowly). So I ended up with a pile of small (0.2KB - 19~KB) files
>>>>> in a bunch of directories sorted by timestamp in ten-minute intervals.
>>>>>
>>>>> So, maybe a conf like this:
>>>>>
>>>>> agent.sinks.sink.type = hdfs
>>>>> agent.sinks.sink.channel = channel
>>>>> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
>>>>> agent.sinks.sink.hdfs.fileSuffix = .avro
>>>>> agent.sinks.sink.serializer = avro_event
>>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>>> agent.sinks.sink.hdfs.rollInterval = 86400
>>>>> agent.sinks.sink.hdfs.rollSize = 134217728
>>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>>
>>>>> This one will roll in HDFS in 24-hour intervals, or at 128MB file size
>>>>> for the file, and will close the file if it has 15000 events in it, but if
>>>>> the hdfs.rollCount line was not set to "0" or some higher value (I probably
>>>>> could have set that at 15000 to match the hdfs.batchSize for same results)
>>>>> then the file would roll as soon as the default of only 10 events were
>>>>> written in to the file.
>>>>>
>>>>> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we
>>>>> collect from syslogTCP which comes from remote host. It then goes to avro
>>>>> sink to aggregate the small event entries into larger avro files. Then, a
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB