Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> HDFS Sink keeps .tmp files and closes with exception


Copy link to this message
-
Re: HDFS Sink keeps .tmp files and closes with exception
Thanks Bhaskar for the details. Regarding this issue, agreed HADOOP-4829 is needed to fix it. I was just remarking on this configuration which seemed to never roll the files. Either way, yes HADOOP-4829 is required, else ".tmp" files are left around. But these files should be closed by the name node automatically later, though the ".tmp" extension will not be removed - that will need to be removed manually.  

Nishant - In this case, I agree with Bhaskar's advise. Also please make sure you close your files using one of the criteria in the flume configuration - else your file will be closed and renamed only when Flume is shutdown.
Thanks,
Hari

--
Hari Shreedharan
On Friday, October 19, 2012 at 4:42 PM, Bhaskar V. Karambelkar wrote:

> Hari,
> There are 2 sides to this problem,
> The flume side issue was fixed in 1.2.0, FLUME-1163.
> But the solution in FLUME-1163, requires that the Hadoop Libraries
> used with flume, support the 'fs.automatic.shutdown' flag.
> Unfortunately hadoop 1.x branch, does not have that feature pulled in
> https://issues.apache.org/jira/browse/HADOOP-4829
>
> So if you use flume with hadoop libs from 1.x series, you'll still get
> the .tmp files leftover, and the exception mentioned.
>
> CDH3u5, does indeed pull in HADOOP-4829, so flume + CDH3u5 works
> correctly and the files get properly closed at flume shutdown.
>
> Do note, that this is a Hadoop Client side problem, so on your hadoop
> cluster you can use 1.x version, and connect to it using flume +
> CDH3u5 combo, and that should work as long as CDH3u5 libs are able to
> talk with Hadoop 1.x cluster. But usually hadoop even on client side
> is very picky about versions.
>
> Here's a mail thread, where I discussed this before.
> http://mail-archives.apache.org/mod_mbox/flume-user/201208.mbox/%3CCAJLbxRad_DUuz3NrPBQo4gtqAdq%2BCLBUvD2_q7-0%2B-L8Afax_A%40mail.gmail.com%3E
>
> I can say with certainty, that if you use HDFS 1.x client side, the
> .tmp files will not be renamed at shutdown, and you will get
> exceptions as mentioned in the original mail.
>
> thanks
> Bhaskar
>
>
>
> On Fri, Oct 19, 2012 at 4:37 PM, Hari Shreedharan
> <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> > Nishant,
> >
> > a: if CDH4 was working for you, you could use it with hadoop-2.x or CDH3u5
> > with hadoop-1.x.
> > b: Looks like your rollSize/rollCount/rollInterval are all 0. Can you
> > increase rollCount to say 1000 or so? If you see here:
> > http://flume.apache.org/FlumeUserGuide.html#hdfs-sink, if you set the roll*
> > configuration params to 0, they would never roll the files. If files are not
> > rolled, they are not closed, and HDFS will show those as 0-sized files. Once
> > the roll happens, HDFS GUI will show you the real file size. You can use any
> > one of the three roll* config parameters to roll the files.
> >
> >
> >
> > Thanks,
> > Hari
> >
> >
> > --
> > Hari Shreedharan
> >
> > On Friday, October 19, 2012 at 1:29 PM, Nishant Neeraj wrote:
> >
> > Thanks for the responses.
> >
> > a: Got rid of all the CDH stuffs. (basically, started on a fresh AWS
> > instance)
> > b: Installed from binary files.
> >
> > It DID NOT work. Here is what I observed:
> > flume-ng version: Flume 1.2.0
> > Hadoop: 1.0.4
> >
> > This is what my configuration is:
> >
> > agent1.sinks.fileSink1.type = hdfs
> > agent1.sinks.fileSink1.channel = memChannel1
> > agent1.sinks.fileSink1.hdfs.path > > hdfs://localhost:54310/flume/agg1/%y-%m-%d
> > agent1.sinks.fileSink1.hdfs.filePrefix = agg2
> > agent1.sinks.fileSink1.hdfs.rollInterval = 0
> > agent1.sinks.fileSink1.hdfs.rollSize = 0
> > agent1.sinks.fileSink1.hdfs.rollCount = 0
> > agent1.sinks.fileSink1.hdfs.fileType = DataStream
> > agent1.sinks.fileSink1.hdfs.writeFormat = Text
> > #agent1.sinks.fileSink1.hdfs.batchSize = 10
> >
> > #1: startup error
> > -----------------------------------
> > With new intallation, I start to find this exception on start of Flume (it
> > does not stop me from adding data to hdfs)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB