Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> HUG talk on PTD/Avro


Copy link to this message
-
Re: HUG talk on PTD/Avro
Hi Doug,

On Apr 23, 2010, at 1:31pm, Doug Cutting wrote:

> Ken Krugler wrote:
>> 3. It would be great to get feedback on both the Avro Cascading  
>> scheme (http://github.com/bixolabs/cascading.avro) and the content  
>> we're currently saving in the Avro file.
>
> Overall it looks fine to me.
>
> What do you think of https://issues.apache.org/jira/browse/AVRO-513?  
> Would that make your life much easier?

I read through it, but don't understand why "...explicitly detect  
sequences of matching data" is a issue.

What's the definition of "matching data"? Is there a common use case  
for Avro where you need to detect duplicates?

> It might be more efficient, instead of reading Avro generic data and  
> converting it to your desired representation, to subclass  
> GenericDatumReader and override #readString(), #readBytes(),  
> #readMap(), and #readArray().  Similarly for DatumWriter.  But we'd  
> then also need to permit one to configure AvroRecordReader to use a  
> different DatumReader implementation.  We might, e.g., add a  
> DataRepresentationFactory interface:
>
> interface DataRepresentation<T> {
>  DatumReader<T> createDatumReader();
>  DatumWriter<T> createDatumWriter();
> }

> Then we could replace AvroJob#setInputSpecific() and  
> #setInputGeneric() with  
> #setInputRepresentation(Class<DataRepresentation> rep, Schema s).  
> You could subclass GenericDatumReader & Writer and implement a  
> DataRepresentation that returns these.
>
> Worth it?

I assume the performance win comes because there's only one conversion  
to/from the serialized & stored data, versus two.

If so, then it would definitely be faster, but I don't know by how  
much. It seems like the most likely bottleneck would be with strings,  
as these need conversion and can be long/common.

I'd either need to hook up a profiler to a typical read or write flow,  
or disable the string conversion and measure the speedup.

So no recommendation for now, until I get time to try that out.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB