Paul Chavez 2012-12-27, 18:01
Hari Shreedharan 2012-12-27, 19:18
Thanks for the update, Hari.
For now I'll just run manual workflow jobs when I'm sure the directories are done being written to. That will at least let me develop the post-processing while waiting for this feature.
From: Hari Shreedharan [mailto:[EMAIL PROTECTED]]
Sent: Thursday, December 27, 2012 11:18 AM
To: [EMAIL PROTECTED]
Subject: Re: How to exclude .tmp files?
We recently committed https://issues.apache.org/jira/browse/FLUME-1702 to trunk. This will be available in the next release of Flume. This should help in the Pig case, not sure about Hive though.
On Thursday, December 27, 2012, Paul Chavez wrote:
This is kind of a generic HDFS question, but it does relate to flume, so hopefully someone can provide feedback.
I have a flume configuration that sinks to HDFS using timestamp headers. I would like to setup a post-processor using Oozie to pull the data as it lands in HDFS into Hive, doing some cleaning and compression along the way.
However I am running into an issue where if I inadvertently read a .tmp file the flume agent that is writing to it stops sinking with an HDFS error.
The flume docs state "The file in use will have the name mangled to include ".tmp" at the end. Once the file is closed, this extension is removed. This allows excluding partially complete files in the directory." but I cannot figure out how to exclude files based on extension via either Pig or Hive.
In general I should not need to exclude as I could reasonably assume the directory is done being written to, but in the event of delays in flume or my initial app agent starting the data flow the directory could still be written to when the Oozie coordinator materializes a job.
It seems like this should be easy, but I'm not having any luck searching for a solution. Any insight or advice is appreciated,