Thanks Bhaskar for the details. Regarding this issue, agreed HADOOP-4829 is needed to fix it. I was just remarking on this configuration which seemed to never roll the files. Either way, yes HADOOP-4829 is required, else ".tmp" files are left around. But these files should be closed by the name node automatically later, though the ".tmp" extension will not be removed - that will need to be removed manually.
Nishant - In this case, I agree with Bhaskar's advise. Also please make sure you close your files using one of the criteria in the flume configuration - else your file will be closed and renamed only when Flume is shutdown.
On Friday, October 19, 2012 at 4:42 PM, Bhaskar V. Karambelkar wrote:
> There are 2 sides to this problem,
> The flume side issue was fixed in 1.2.0, FLUME-1163.
> But the solution in FLUME-1163, requires that the Hadoop Libraries
> used with flume, support the 'fs.automatic.shutdown' flag.
> Unfortunately hadoop 1.x branch, does not have that feature pulled in
> So if you use flume with hadoop libs from 1.x series, you'll still get
> the .tmp files leftover, and the exception mentioned.
> CDH3u5, does indeed pull in HADOOP-4829, so flume + CDH3u5 works
> correctly and the files get properly closed at flume shutdown.
> Do note, that this is a Hadoop Client side problem, so on your hadoop
> cluster you can use 1.x version, and connect to it using flume +
> CDH3u5 combo, and that should work as long as CDH3u5 libs are able to
> talk with Hadoop 1.x cluster. But usually hadoop even on client side
> is very picky about versions.
> Here's a mail thread, where I discussed this before.
> I can say with certainty, that if you use HDFS 1.x client side, the
> .tmp files will not be renamed at shutdown, and you will get
> exceptions as mentioned in the original mail.
> On Fri, Oct 19, 2012 at 4:37 PM, Hari Shreedharan
> <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> > Nishant,
> > a: if CDH4 was working for you, you could use it with hadoop-2.x or CDH3u5
> > with hadoop-1.x.
> > b: Looks like your rollSize/rollCount/rollInterval are all 0. Can you
> > increase rollCount to say 1000 or so? If you see here:
> > http://flume.apache.org/FlumeUserGuide.html#hdfs-sink, if you set the roll*
> > configuration params to 0, they would never roll the files. If files are not
> > rolled, they are not closed, and HDFS will show those as 0-sized files. Once
> > the roll happens, HDFS GUI will show you the real file size. You can use any
> > one of the three roll* config parameters to roll the files.
> > Thanks,
> > Hari
> > --
> > Hari Shreedharan
> > On Friday, October 19, 2012 at 1:29 PM, Nishant Neeraj wrote:
> > Thanks for the responses.
> > a: Got rid of all the CDH stuffs. (basically, started on a fresh AWS
> > instance)
> > b: Installed from binary files.
> > It DID NOT work. Here is what I observed:
> > flume-ng version: Flume 1.2.0
> > Hadoop: 1.0.4
> > This is what my configuration is:
> > agent1.sinks.fileSink1.type = hdfs
> > agent1.sinks.fileSink1.channel = memChannel1
> > agent1.sinks.fileSink1.hdfs.path > > hdfs://localhost:54310/flume/agg1/%y-%m-%d
> > agent1.sinks.fileSink1.hdfs.filePrefix = agg2
> > agent1.sinks.fileSink1.hdfs.rollInterval = 0
> > agent1.sinks.fileSink1.hdfs.rollSize = 0
> > agent1.sinks.fileSink1.hdfs.rollCount = 0
> > agent1.sinks.fileSink1.hdfs.fileType = DataStream
> > agent1.sinks.fileSink1.hdfs.writeFormat = Text
> > #agent1.sinks.fileSink1.hdfs.batchSize = 10
> > #1: startup error
> > -----------------------------------
> > With new intallation, I start to find this exception on start of Flume (it
> > does not stop me from adding data to hdfs)