|
|
-
Possible to include open .avro file in Map/Reduce job?
Terry Healy 2013-01-14, 19:22
I have a log collection application that writes .avro files within HDFS. Ideally I would like to include the current days (open for append) file as one of the input files for a periodic M/R job.
I tried this but the Map job exited in error with the dreaded "Invalid Sync!" IOException. I guess I should have expected this, but is there a reasonable way around it? Can I catch the exception and just exit the map at that point?
All suggestions appreciated.
-Terry
+
Terry Healy 2013-01-14, 19:22
-
Re: Possible to include open .avro file in Map/Reduce job?
Doug Cutting 2013-01-17, 21:36
Folks often move files once they're closed into a directory where they're processed to avoid issues with partially written data. Maybe you could start a new log file every hour rather than every day?
We could add an ignoreTruncation or ignoreCorruption option to DataFileReader that attempts to read files that might be truncated or corrupted.
And yes, you can probably just catch those exceptions and exit the map at that point.
Doug
On Mon, Jan 14, 2013 at 11:22 AM, Terry Healy <[EMAIL PROTECTED]> wrote: > I have a log collection application that writes .avro files within HDFS. > Ideally I would like to include the current days (open for append) file > as one of the input files for a periodic M/R job. > > I tried this but the Map job exited in error with the dreaded "Invalid > Sync!" IOException. I guess I should have expected this, but is there a > reasonable way around it? Can I catch the exception and just exit the > map at that point? > > All suggestions appreciated. > > -Terry
+
Doug Cutting 2013-01-17, 21:36
-
Re: Possible to include open .avro file in Map/Reduce job?
Terry Healy 2013-01-18, 14:51
Thanks Doug.
In this case I could truncate the logs earlier, but then I have to go back at some point and recombine the small files. For now, I can live with moving the files daily.
I was unable to find a way to trap the "Invalid Sync" (org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
Since my mapper extends AvroMapper, and map throws exceptions, I don't know where to trap it. Another person suggested using low-level avro functions for this. Perhaps I need to write an avro file validator of some sort to be run before the Map/Reduce job? This seems nasty. But I had another M/R job failure for this error over night, and even finding the offending file via the logs is quite a pain.
Any suggestions?
-Terry
On 01/17/2013 04:36 PM, Doug Cutting wrote: > Folks often move files once they're closed into a directory where > they're processed to avoid issues with partially written data. Maybe > you could start a new log file every hour rather than every day? > > We could add an ignoreTruncation or ignoreCorruption option to > DataFileReader that attempts to read files that might be truncated or > corrupted. > > And yes, you can probably just catch those exceptions and exit the map > at that point. > > Doug > > On Mon, Jan 14, 2013 at 11:22 AM, Terry Healy <[EMAIL PROTECTED]> wrote: >> I have a log collection application that writes .avro files within HDFS. >> Ideally I would like to include the current days (open for append) file >> as one of the input files for a periodic M/R job. >> >> I tried this but the Map job exited in error with the dreaded "Invalid >> Sync!" IOException. I guess I should have expected this, but is there a >> reasonable way around it? Can I catch the exception and just exit the >> map at that point? >> >> All suggestions appreciated. >> >> -Terry
+
Terry Healy 2013-01-18, 14:51
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext