Terry Healy 2013-01-03, 20:36
A sync marker delimits each block in the avro file. If you want to start
reading data from the middle of a 100GB file, DataFileReader will seek to
the middle and find the next sync marker. Each block can be individually
compressed, and by default when writing a file the writer will not
compress the block and flush to disk until a block as gotten as large as
the sync interval in bytes. Alternatively, you can manually sync().
If you have a 1000000 byte sync interval, you may not see any data reach
disk until that many bytes have been written (or sync() is called
Your problem is likely that the first block in the file has not been
flushed to disk yet, and therefore the file is corrupt and missing a
trailing sync marker.
On 1/3/13 12:36 PM, "Terry Healy" <[EMAIL PROTECTED]> wrote:
>I'm upgrading a logging program to append GenericRecords to a .avro file
>instead of text (.tsv). I have a working schema that is used to convert
>existing .tsv of the same format into .avro and that works fine.
>When I run a test writing 30,000 bogus records, it runs but when I try
>to use "avro-tools-1.7.3.jar tojson" on the output file, it reports:
>"AvroRuntimeException: java.io.IOException: Invalid sync!"
>The file is still open at this point since the logging program is
>running. Is this expected behavior because it is still open? (getmeta
>and getschema work fine).
>I'm not sure if it has any bearing, since I never really understood the
>function of the the AVRO sync interval; in this and the working programs
>it is set to 1000000.
>Any ideas appreciated.
Terry Healy 2013-01-09, 16:21