Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Roll based on date


Copy link to this message
-
Re: Roll based on date
Hi David,

The requirement is only roll per day actually.

Hi Devin,

Thanks for sharing your experienced. I also tried to set the config as
following :

agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
agent.sinks.sink.hdfs.fileType = DataStream
agent.sinks.sink.hdfs.rollInterval = 0
agent.sinks.sink.hdfs.rollSize = 0
agent.sinks.sink.hdfs.batchSize = 15000
agent.sinks.sink.hdfs.rollCount = 0

But I didn't see anything on the s3 bucket. So I guess, I need to change
the rollInterval into 86400. In my understanding, rollInterval 86400 will
roll the file after 24 hours like you said, but it will not generate new
file if it's changed the day and haven't been 24 hours interval (unless we
put prefix to fileSuffix as above).

Thanks to both of you.

Best regards,

Martinus
On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <[EMAIL PROTECTED]> wrote:

> Martinus, you have to set all the other roll options to 0 explicitly in
> the configuration if you want them only to roll on one parameter, it will
> take the shortest working parameter it can meet for the roll. If you want
> it to roll once a day, you will have to specifically disable all the other
> options for roll triggers - they all take default settings unless told not
> to. When I was experimenting, for example, it kept rolling in 30 seconds
> even though I had the hdfs.rollSize set to 64MB (our test data is generated
> slowly). So I ended up with a pile of small (0.2KB - 19~KB) files in a
> bunch of directories sorted by timestamp in ten-minute intervals.
>
> So, maybe a conf like this:
>
> agent.sinks.sink.type = hdfs
> agent.sinks.sink.channel = channel
> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
> agent.sinks.sink.hdfs.fileSuffix = .avro
> agent.sinks.sink.serializer = avro_event
> agent.sinks.sink.hdfs.fileType = DataStream
> agent.sinks.sink.hdfs.rollInterval = 86400
> agent.sinks.sink.hdfs.rollSize = 134217728
> agent.sinks.sink.hdfs.batchSize = 15000
> agent.sinks.sink.hdfs.rollCount = 0
>
> This one will roll in HDFS in 24-hour intervals, or at 128MB file size for
> the file, and will close the file if it has 15000 events in it, but if the
> hdfs.rollCount line was not set to "0" or some higher value (I probably
> could have set that at 15000 to match the hdfs.batchSize for same results)
> then the file would roll as soon as the default of only 10 events were
> written in to the file.
>
> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we
> collect from syslogTCP which comes from remote host. It then goes to avro
> sink to aggregate the small event entries into larger avro files. Then, a
> second tier collects that with avro source, then hdfs sink. So, we get them
> all as individual events streamed into an avro container, then the avro
> container is put into HDFS every 24 hours or if it hits 128 MB. We were
> getting many small files because of the lower velocity of our sample set,
> and we did not want to clutter up FSImage. The avro serializer and
> DataStream type are necessary also, because the default behavior of HDFS
> sink is to put things in as SequenceFile format.
>
> Hope this helps you out.
>
> Sincerely,
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair <
> [EMAIL PROTECTED]> wrote:
>
>> Do you need to roll based on size as well? Can you tell me the
>> requirements?
>>
>>
>> On Tue, Oct 22, 2013 at 2:15 AM, Martinus m <[EMAIL PROTECTED]>wrote:
>>
>>> Hi David,
>>>
>>> Thanks for your answer. I already did that, but using %Y-%m-%d. But,
>>> since there are still roll based on Size, so it will keep generating two or
>>> mores FlumeData.%Y-%m-%d with different postfix.
>>>
>>> Thanks.
>>>
>>> Martinus
>>>
>>>
>>> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> The SyslogTcpSource will put a header on the flume event named
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB