The best would be to get a hold on a Flume developer. I am not strictly
sure of all the differences between sync/flush/hsync/hflush and the
different hadoop versions. It might be the case that you are only flushing
on the client side. Even if it was a clean strategy, creation+flush is
unlikely to be an atomic operation.
It is worth testing the read of an empty sequence file (real empty and with
only header). It should be quite easy with a unit test. A solution would
indeed to validate the behaviour of SequenceFileReader / InputFormat on
edge cases. But nothing guarantee you that you won't have a record split
between two HDFS blocks. This implies that during the writing only the
first block is visible and only a part of the record. It would be normal
for the reader to fail on that case. You could tweak mapreduce bad records
skipping but that feels like hacking a system where the design is wrong
from the beginning.
Anyway, a solution (seen in Flume if I remember correctly) is having a good
file name strategy. For exemple, all new files should end in ".open" and
only when they are finished the suffix is removed. Then for processing, you
only target the latter.
For Hive, you might need to adapt the strategy a bit because Hive may not
be able to target only files with a specific name (you are the expert). A
simple move of the file from a temporary directory to the table directory
would have the same effect (because from the point of view of HDFS, it's
the same operation : metadata change only).
On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <[EMAIL PROTECTED]>