-How to exclude .tmp files?
Paul Chavez 2012-12-27, 18:01
This is kind of a generic HDFS question, but it does relate to flume, so hopefully someone can provide feedback.
I have a flume configuration that sinks to HDFS using timestamp headers. I would like to setup a post-processor using Oozie to pull the data as it lands in HDFS into Hive, doing some cleaning and compression along the way.
However I am running into an issue where if I inadvertently read a .tmp file the flume agent that is writing to it stops sinking with an HDFS error.
The flume docs state "The file in use will have the name mangled to include ".tmp" at the end. Once the file is closed, this extension is removed. This allows excluding partially complete files in the directory." but I cannot figure out how to exclude files based on extension via either Pig or Hive.
In general I should not need to exclude as I could reasonably assume the directory is done being written to, but in the event of delays in flume or my initial app agent starting the data flow the directory could still be written to when the Oozie coordinator materializes a job.
It seems like this should be easy, but I'm not having any luck searching for a solution. Any insight or advice is appreciated,