-Re: Problem setting the rollInterval for HDFS sink
DSuiter RDX 2013-10-24, 15:09
No problem! Glad I was able to help!
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Thu, Oct 24, 2013 at 11:05 AM, Christopher Surage <[EMAIL PROTECTED]>wrote:
> First of all thank you for your help, the typo was the problem. Second the
> reason I was using DataStream as my file type for my hdfs sink was because
> when I had it as a SequenceFile, the sink was adding a lot of garbage data
> to the file when it copied to the hdfs, which was causing undesired
> behavior with my created hive table. When I changed to DataStream, it just
> put the plain text in the file. With regard to the channels, that is
> something I will definitely look at in order to fine tune the performance,
> now that I have solved this problem I can look at that, I have fumbled
> around with the memory channel playing with the capacity and
> transitionCapacity attributes and I have run into choking of the channel,
> just have to read more about it. I don't know if you have seen these before
> but I've been looking at them
> Thanks for your help,
> On Thu, Oct 24, 2013 at 10:17 AM, DSuiter RDX <[EMAIL PROTECTED]> wrote:
>> I use a very similar setup. I had a similar problem for a while. The HDFS
>> sink defaults are the tricky part - they are all pretty small, since they
>> assume a high data velocity. The tricky part is that unless they are all
>> explicitly declared as OFF, then they are on.
>> So, your HDFS batch size parameter might be the problem. Also, I notice
>> you need to capitalize the "S" in the hdfs.roll*S*ize parameter -
>> camelcase got me on transactionCapacity once :-) not sure if this is
>> copypasta from your config, but that will cause an issue with the param
>> being respected, so in your case it would roll it at 1024 bytes, or about
>> 10 lines of text probably.
>> One question about your config, though - I notice you have the
>> hdfs.fileType as DataStream for Avro, but you do not have a serializer of
>> avro_event declared. In what format are your files being put into HDFS? As
>> Avro-contained streams, or as aggregated text bodies with newline
>> delimiters? I ask because this setup for us has led to us needing to unwrap
>> Avro event files in MapReduce, which is tricky - if you are getting
>> aggregate text, I have some reconfiguring to do.
>> Other things to look out for are - make sure the HDFS file being written
>> to doesn't close mid-stream, I have not seen that recover gracefully, I am
>> getting OOME in my testbed right now due to something like that; and make
>> sure your transaction capacity in your channels is high enough through the
>> flow, my original one kept choking with a small transaction capacity from
>> the first channel to the Avro sink.
>> Good luck!
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>> On Thu, Oct 24, 2013 at 9:44 AM, Christopher Surage <[EMAIL PROTECTED]>wrote:
>>> Hello I am having an issue increasing the size of the file which get
>>> written into my hdfs. I have tried playing with the rollCount attribute for
>>> an hdfs sink but it seems to cap at 10 lines of text per file, with many
>>> files written to the hdfs directory. Now one may see why I need to change
>>> I have 2 boxes running
>>> 1) uses a spooldir source to check for new log files copied to a
>>> specific dir. It then sends the events to an avro sink through a mem
>>> channel to the other box with the hdfs on it.
>>> 2) uses an avro source and sends events to the hdfs sink.
>>> # Name the compnents of the agent
>>> a1.sources = r1
>>> a1.sinks = k1
>>> a1.channels = c1
>>> ###############Describe/configure the source#################