Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Possible to include open .avro file in Map/Reduce job?


Copy link to this message
-
Re: Possible to include open .avro file in Map/Reduce job?
Thanks Doug.

In this case I could truncate the logs earlier, but then I have to go
back at some point and recombine the small files. For now, I can live
with moving the files daily.

I was unable to find a way to trap the "Invalid Sync"
(org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid
sync! at
org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)

Since my mapper extends AvroMapper, and map throws exceptions, I don't
know where to trap it. Another person suggested using low-level avro
functions for this. Perhaps I need to write an avro file validator of
some sort to be run before the Map/Reduce job? This seems nasty. But I
had another M/R job failure for this error over night, and even finding
the offending file via the logs is quite a pain.

Any suggestions?

-Terry

On 01/17/2013 04:36 PM, Doug Cutting wrote:
> Folks often move files once they're closed into a directory where
> they're processed to avoid issues with partially written data.  Maybe
> you could start a new log file every hour rather than every day?
>
> We could add an ignoreTruncation or ignoreCorruption option to
> DataFileReader that attempts to read files that might be truncated or
> corrupted.
>
> And yes, you can probably just catch those exceptions and exit the map
> at that point.
>
> Doug
>
> On Mon, Jan 14, 2013 at 11:22 AM, Terry Healy <[EMAIL PROTECTED]> wrote:
>> I have a log collection application that writes .avro files within HDFS.
>> Ideally I would like to include the current days (open for append) file
>> as one of the input files for a periodic M/R job.
>>
>> I tried this but the Map job exited in error with the dreaded "Invalid
>> Sync!" IOException. I guess I should have expected this, but is there a
>> reasonable way around it? Can I catch the exception and just exit the
>> map at that point?
>>
>> All suggestions appreciated.
>>
>> -Terry
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB