Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Roll based on date

Copy link to this message
Re: Roll based on date
Martinus, you have to set all the other roll options to 0 explicitly in the
configuration if you want them only to roll on one parameter, it will take
the shortest working parameter it can meet for the roll. If you want it to
roll once a day, you will have to specifically disable all the other
options for roll triggers - they all take default settings unless told not
to. When I was experimenting, for example, it kept rolling in 30 seconds
even though I had the hdfs.rollSize set to 64MB (our test data is generated
slowly). So I ended up with a pile of small (0.2KB - 19~KB) files in a
bunch of directories sorted by timestamp in ten-minute intervals.

So, maybe a conf like this:

agent.sinks.sink.type = hdfs
agent.sinks.sink.channel = channel
agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
agent.sinks.sink.hdfs.fileSuffix = .avro
agent.sinks.sink.serializer = avro_event
agent.sinks.sink.hdfs.fileType = DataStream
agent.sinks.sink.hdfs.rollInterval = 86400
agent.sinks.sink.hdfs.rollSize = 134217728
agent.sinks.sink.hdfs.batchSize = 15000
agent.sinks.sink.hdfs.rollCount = 0

This one will roll in HDFS in 24-hour intervals, or at 128MB file size for
the file, and will close the file if it has 15000 events in it, but if the
hdfs.rollCount line was not set to "0" or some higher value (I probably
could have set that at 15000 to match the hdfs.batchSize for same results)
then the file would roll as soon as the default of only 10 events were
written in to the file.

Are you using a 1-tier or 2-tier design for this? For syslogTCP, we collect
from syslogTCP which comes from remote host. It then goes to avro sink to
aggregate the small event entries into larger avro files. Then, a second
tier collects that with avro source, then hdfs sink. So, we get them all as
individual events streamed into an avro container, then the avro container
is put into HDFS every 24 hours or if it hits 128 MB. We were getting many
small files because of the lower velocity of our sample set, and we did not
want to clutter up FSImage. The avro serializer and DataStream type are
necessary also, because the default behavior of HDFS sink is to put things
in as SequenceFile format.

Hope this helps you out.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair <

> Do you need to roll based on size as well? Can you tell me the
> requirements?
> On Tue, Oct 22, 2013 at 2:15 AM, Martinus m <[EMAIL PROTECTED]> wrote:
>> Hi David,
>> Thanks for your answer. I already did that, but using %Y-%m-%d. But,
>> since there are still roll based on Size, so it will keep generating two or
>> mores FlumeData.%Y-%m-%d with different postfix.
>> Thanks.
>> Martinus
>> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair <
>> [EMAIL PROTECTED]> wrote:
>>> The SyslogTcpSource will put a header on the flume event named
>>> 'timestamp'. This timestamp will be from the syslog entry. You could then
>>> set the filePrefix in the sink to grab this out.
>>> For example
>>> tier1.sinks.hdfsSink.hdfs.filePrefix = FlumeData.%{timestamp}
>>> dave
>>> On Thu, Oct 17, 2013 at 10:23 PM, Martinus m <[EMAIL PROTECTED]>wrote:
>>>> Hi David,
>>>> It's syslogtcp.
>>>> Thanks.
>>>> Martinus
>>>> On Thu, Oct 17, 2013 at 9:17 PM, David Sinclair <
>>>> [EMAIL PROTECTED]> wrote:
>>>>> What type of source are you using?
>>>>> On Wed, Oct 16, 2013 at 9:56 PM, Martinus m <[EMAIL PROTECTED]>wrote:
>>>>>> Hi,
>>>>>> Is there any option in HDFS sink that I can start rolling a new file
>>>>>> whenever the date in the log change? For example, I got below logs :
>>>>>> Oct 16 23:58:56 test-host : just test
>>>>>> Oct 16 23:59:51 test-host : test again
>>>>>> Oct 17 00:00:56 test-host : just test