I am building a java utility that reads large AVRO files and does some processing. These files have millions of records in them and it can take minutes to read them using DataFileReader.next(). Is there a way to read more than one record at a time? thanks, Yael
Which language are you using? Afaik, most language implementations of Avro only have an interface for reading one record at a time, but they do buffer the input file internally, so there shouldn't be a performance disadvantage to reading one record at a time.
If you have an example that is particularly slow, you could be a great help to the Avro community by getting out a profiler and finding the bottleneck :)
On 14 May 2014, at 20:13, yael aharon <[EMAIL PROTECTED]> wrote:
I am using Java. I did play with the size of the buffer reader, but I found that the default size of 8K gave me the best performance. thanks, Yael On Fri, May 23, 2014 at 4:14 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:
While I haven't benchmarked java performance I have looked closely at Ruby vs C with regards to reading large avro files. With C - I have processed ~900Mb files with 25+M rows in ~42s. And routinely process 270Mb / 7.5M record files with C, on average, in 15s. These numbers were observed running on a Mac Book Pro 2012 model (exact specs elude me at the moment). Not scientific but may help give you a ballpark of what is possible. I am using Java. I did play with the size of the buffer reader, but I found that the default size of 8K gave me the best performance. thanks, Yael On Fri, May 23, 2014 at 4:14 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:
Thank you very much Mike. I am looking @ Avro C API right now and this is extremely helpful. Lewis On Sat, May 24, 2014 at 6:00 AM, Mike Stanley <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext