I'm a bit new to Avro format, trying to process slightly large avro file w. 300 columns 200K rows with python 3.
However, it's a bit slow and I would like to try processing individual parts of file with 5 process.
I wonder if there is any easy way to seek within an avro file without causing data corruption, rather than looping through each record sequentially ?
I believe it is a splittable format since it can be processed via mapreduce/spark in parallel, but I'm not sure if python avro module supports jumping within a file to find a safe position to start reading from.
Currently all I can do is to process it row by row, which doesn't help for parallelisation;
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
i = 0
for user in reader:
i += 1
or should I switch to C or Java to process bigger files if not spark/mapreduce ?