Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Roll based on date


Copy link to this message
-
Re: Roll based on date
Hi David,

Thanks for the example. I have set it just like above, but it only generate
for the first 15 minutes. After waiting for more than one hour, there is no
update at all in the s3 bucket.

Thanks.

Martinus
On Wed, Oct 23, 2013 at 8:48 PM, David Sinclair <
[EMAIL PROTECTED]> wrote:

> You can set all of the time/size based rolling policies to zero and set an
> idle timeout on the sink. Below has a 15 minute timeout
>
> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
> agent.sinks.sink.hdfs.fileType = DataStream
> agent.sinks.sink.hdfs.rollInterval = 0
> agent.sinks.sink.hdfs.rollSize = 0
> agent.sinks.sink.hdfs.batchSize = 0
> agent.sinks.sink.hdfs.rollCount = 0
> agent.sinks.sink.hdfs.idleTimeout = 900
>
>
>
> On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <[EMAIL PROTECTED]>wrote:
>
>> Hi David,
>>
>> The requirement is only roll per day actually.
>>
>> Hi Devin,
>>
>> Thanks for sharing your experienced. I also tried to set the config as
>> following :
>>
>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>> agent.sinks.sink.hdfs.fileType = DataStream
>> agent.sinks.sink.hdfs.rollInterval = 0
>> agent.sinks.sink.hdfs.rollSize = 0
>> agent.sinks.sink.hdfs.batchSize = 15000
>> agent.sinks.sink.hdfs.rollCount = 0
>>
>> But I didn't see anything on the s3 bucket. So I guess, I need to change
>> the rollInterval into 86400. In my understanding, rollInterval 86400 will
>> roll the file after 24 hours like you said, but it will not generate new
>> file if it's changed the day and haven't been 24 hours interval (unless we
>> put prefix to fileSuffix as above).
>>
>> Thanks to both of you.
>>
>> Best regards,
>>
>> Martinus
>>
>>
>> On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>>
>>> Martinus, you have to set all the other roll options to 0 explicitly in
>>> the configuration if you want them only to roll on one parameter, it will
>>> take the shortest working parameter it can meet for the roll. If you want
>>> it to roll once a day, you will have to specifically disable all the other
>>> options for roll triggers - they all take default settings unless told not
>>> to. When I was experimenting, for example, it kept rolling in 30 seconds
>>> even though I had the hdfs.rollSize set to 64MB (our test data is generated
>>> slowly). So I ended up with a pile of small (0.2KB - 19~KB) files in a
>>> bunch of directories sorted by timestamp in ten-minute intervals.
>>>
>>> So, maybe a conf like this:
>>>
>>> agent.sinks.sink.type = hdfs
>>> agent.sinks.sink.channel = channel
>>> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
>>> agent.sinks.sink.hdfs.fileSuffix = .avro
>>> agent.sinks.sink.serializer = avro_event
>>> agent.sinks.sink.hdfs.fileType = DataStream
>>> agent.sinks.sink.hdfs.rollInterval = 86400
>>> agent.sinks.sink.hdfs.rollSize = 134217728
>>> agent.sinks.sink.hdfs.batchSize = 15000
>>> agent.sinks.sink.hdfs.rollCount = 0
>>>
>>> This one will roll in HDFS in 24-hour intervals, or at 128MB file size
>>> for the file, and will close the file if it has 15000 events in it, but if
>>> the hdfs.rollCount line was not set to "0" or some higher value (I probably
>>> could have set that at 15000 to match the hdfs.batchSize for same results)
>>> then the file would roll as soon as the default of only 10 events were
>>> written in to the file.
>>>
>>> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we
>>> collect from syslogTCP which comes from remote host. It then goes to avro
>>> sink to aggregate the small event entries into larger avro files. Then, a
>>> second tier collects that with avro source, then hdfs sink. So, we get them
>>> all as individual events streamed into an avro container, then the avro
>>> container is put into HDFS every 24 hours or if it hits 128 MB. We were
>>> getting many small files because of the lower velocity of our sample set,
>>> and we did not want to clutter up FSImage. The avro serializer and
>>> DataStream type are necessary also, because the default behavior of HDFS