Ensure your Flume machine has a local Hadoop config file that matches your cluster config. I.e. deploy your Hadoop configs there too. That should solve the issue.
Sent from my iPhone
On Aug 22, 2013, at 3:47 AM, Andrei <[EMAIL PROTECTED]> wrote:
> Hi Mike,
> it makes sense - replication factor is really less then recommended: we test Hadoop on 2 large machines and thus replication is set to 1, but HDFS seems to ignore config and still replicate block 3 times. I was confused about generating small files before normal large files, but if Flume has some counter for replication attempts, that explains it.
> On Thu, Aug 22, 2013 at 1:13 PM, Mike Percy <[EMAIL PROTECTED]> wrote:
>> Are you sure your HDFS cluster is configured properly? How big is the cluster?
>> It's complaining that your HDFS blocks are not replicated enough based on your configured replication factor, and tries to get a sufficiently replicated pipeline by closing the current file and opening a new one to write to. Finally it gives up.
>> That code is still there on trunk...
>> Sent from my iPhone
>> On Aug 20, 2013, at 3:11 AM, Andrei <[EMAIL PROTECTED]> wrote:
>>> I have Flume agent with spool directory as source and HDFS sink. I have configured sink to roll files only when they reach some (quite large) size (see full config below). However, when I restart Flume, it first generates ~15 small files (~500 bytes) and only after that starts writing large file. In Flume logs at the time of generating small files I see message "Block Under-replication detected. Rotating file".
>>> From source code I've figured out several things:
>>> 1. This message is specific to Flume 1.3 and doesn't exist in latest version.
>>> 2. It comes from BlockWriter.shouldRotate() methid which in its turn calls HDFSWriter.isUnderReplicated(), and if it returns true, above message is generated and files is rotated.
>>> My questions are: why it happens and how do I fix it?
>>> Flume 1.3 CDH 4.3
>>> agent.sources = my-src
>>> agent.channels = my-ch
>>> agent.sinks = my-sink
>>> agent.sources.my-src.type = spooldir
>>> agent.sources.my-src.spoolDir = /flume/data
>>> agent.sources.my-src.channels = my-ch
>>> agent.sources.my-src.deletePolicy = immediate
>>> agent.sources.my-src.interceptors = tstamp-int
>>> agent.sources.my-src.interceptors.tstamp-int.type = timestamp
>>> agent.channels.my-ch.type = file
>>> agent.channels.my-ch.checkpointDir = /flume/checkpoint
>>> agent.channels.my-ch.dataDirs = /flume/channel-data
>>> agent.sinks.my-sink.type = hdfs
>>> agent.sinks.my-sink.hdfs.path = hdfs://my-hdfs:8020/logs
>>> agent.sinks.my-sink.hdfs.filePrefix = Log
>>> agent.sinks.my-sink.hdfs.batchSize = 10
>>> agent.sinks.my-sink.hdfs.rollInterval = 3600
>>> agent.sinks.my-sink.hdfs.rollCount = 0
>>> agent.sinks.my-sink.hdfs.rollSize = 134217728
>>> agent.sinks.my-sink.hdfs.fileType = DataStream
>>> agent.sinks.my-sink.channel = my-ch