Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> HUG talk on PTD/Avro

Copy link to this message
Re: HUG talk on PTD/Avro
Hi Doug,

On Apr 23, 2010, at 1:31pm, Doug Cutting wrote:

> Ken Krugler wrote:
>> 3. It would be great to get feedback on both the Avro Cascading  
>> scheme (http://github.com/bixolabs/cascading.avro) and the content  
>> we're currently saving in the Avro file.
> Overall it looks fine to me.
> What do you think of https://issues.apache.org/jira/browse/AVRO-513?  
> Would that make your life much easier?

I read through it, but don't understand why "...explicitly detect  
sequences of matching data" is a issue.

What's the definition of "matching data"? Is there a common use case  
for Avro where you need to detect duplicates?

> It might be more efficient, instead of reading Avro generic data and  
> converting it to your desired representation, to subclass  
> GenericDatumReader and override #readString(), #readBytes(),  
> #readMap(), and #readArray().  Similarly for DatumWriter.  But we'd  
> then also need to permit one to configure AvroRecordReader to use a  
> different DatumReader implementation.  We might, e.g., add a  
> DataRepresentationFactory interface:
> interface DataRepresentation<T> {
>  DatumReader<T> createDatumReader();
>  DatumWriter<T> createDatumWriter();
> }

> Then we could replace AvroJob#setInputSpecific() and  
> #setInputGeneric() with  
> #setInputRepresentation(Class<DataRepresentation> rep, Schema s).  
> You could subclass GenericDatumReader & Writer and implement a  
> DataRepresentation that returns these.
> Worth it?

I assume the performance win comes because there's only one conversion  
to/from the serialized & stored data, versus two.

If so, then it would definitely be faster, but I don't know by how  
much. It seems like the most likely bottleneck would be with strings,  
as these need conversion and can be long/common.

I'd either need to hook up a profiler to a typical read or write flow,  
or disable the string conversion and measure the speedup.

So no recommendation for now, until I get time to try that out.


-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g