Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Possible to include open .avro file in Map/Reduce job?


+
Terry Healy 2013-01-14, 19:22
+
Doug Cutting 2013-01-17, 21:36
Copy link to this message
-
Re: Possible to include open .avro file in Map/Reduce job?
Terry Healy 2013-01-18, 14:51
Thanks Doug.

In this case I could truncate the logs earlier, but then I have to go
back at some point and recombine the small files. For now, I can live
with moving the files daily.

I was unable to find a way to trap the "Invalid Sync"
(org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid
sync! at
org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)

Since my mapper extends AvroMapper, and map throws exceptions, I don't
know where to trap it. Another person suggested using low-level avro
functions for this. Perhaps I need to write an avro file validator of
some sort to be run before the Map/Reduce job? This seems nasty. But I
had another M/R job failure for this error over night, and even finding
the offending file via the logs is quite a pain.

Any suggestions?

-Terry

On 01/17/2013 04:36 PM, Doug Cutting wrote:
> Folks often move files once they're closed into a directory where
> they're processed to avoid issues with partially written data.  Maybe
> you could start a new log file every hour rather than every day?
>
> We could add an ignoreTruncation or ignoreCorruption option to
> DataFileReader that attempts to read files that might be truncated or
> corrupted.
>
> And yes, you can probably just catch those exceptions and exit the map
> at that point.
>
> Doug
>
> On Mon, Jan 14, 2013 at 11:22 AM, Terry Healy <[EMAIL PROTECTED]> wrote:
>> I have a log collection application that writes .avro files within HDFS.
>> Ideally I would like to include the current days (open for append) file
>> as one of the input files for a periodic M/R job.
>>
>> I tried this but the Map job exited in error with the dreaded "Invalid
>> Sync!" IOException. I guess I should have expected this, but is there a
>> reasonable way around it? Can I catch the exception and just exit the
>> map at that point?
>>
>> All suggestions appreciated.
>>
>> -Terry