Paul Chavez 2012-12-27, 18:01
We recently committed https://issues.apache.org/jira/browse/FLUME-1702 to
trunk. This will be available in the next release of Flume. This should
help in the Pig case, not sure about Hive though.
On Thursday, December 27, 2012, Paul Chavez wrote:
> This is kind of a generic HDFS question, but it does relate to flume, so
> hopefully someone can provide feedback.
> I have a flume configuration that sinks to HDFS using timestamp headers. I
> would like to setup a post-processor using Oozie to pull the data as it
> lands in HDFS into Hive, doing some cleaning and compression along the way.
> However I am running into an issue where if I inadvertently read a .tmp
> file the flume agent that is writing to it stops sinking with an HDFS error.
> The flume docs state "The file in use will have the name mangled to
> include ”.tmp” at the end. Once the file is closed, this extension is
> removed. This allows excluding partially complete files in the directory."
> but I cannot figure out how to exclude files based on extension via either
> Pig or Hive.
> In general I should not need to exclude as I could reasonably assume the
> directory is done being written to, but in the event of delays in flume or
> my initial app agent starting the data flow the directory could still be
> written to when the Oozie coordinator materializes a job.
> It seems like this should be easy, but I'm not having any luck searching
> for a solution. Any insight or advice is appreciated,
> thank you,
> Paul Chavez
Paul Chavez 2012-12-27, 21:17