|
|
-
Re: HUG talk on PTD/AvroKen Krugler 2010-04-26, 20:12
Hi Doug,
On Apr 23, 2010, at 1:31pm, Doug Cutting wrote: > Ken Krugler wrote: >> 3. It would be great to get feedback on both the Avro Cascading >> scheme (http://github.com/bixolabs/cascading.avro) and the content >> we're currently saving in the Avro file. > > Overall it looks fine to me. > > What do you think of https://issues.apache.org/jira/browse/AVRO-513? > Would that make your life much easier? I read through it, but don't understand why "...explicitly detect sequences of matching data" is a issue. What's the definition of "matching data"? Is there a common use case for Avro where you need to detect duplicates? > It might be more efficient, instead of reading Avro generic data and > converting it to your desired representation, to subclass > GenericDatumReader and override #readString(), #readBytes(), > #readMap(), and #readArray(). Similarly for DatumWriter. But we'd > then also need to permit one to configure AvroRecordReader to use a > different DatumReader implementation. We might, e.g., add a > DataRepresentationFactory interface: > > interface DataRepresentation<T> { > DatumReader<T> createDatumReader(); > DatumWriter<T> createDatumWriter(); > } > Then we could replace AvroJob#setInputSpecific() and > #setInputGeneric() with > #setInputRepresentation(Class<DataRepresentation> rep, Schema s). > You could subclass GenericDatumReader & Writer and implement a > DataRepresentation that returns these. > > Worth it? I assume the performance win comes because there's only one conversion to/from the serialized & stored data, versus two. If so, then it would definitely be faster, but I don't know by how much. It seems like the most likely bottleneck would be with strings, as these need conversion and can be long/common. I'd either need to hook up a profiler to a typical read or write flow, or disable the string conversion and measure the speedup. So no recommendation for now, until I get time to try that out. Thanks, -- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g |