Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Specific/GenericDatumReader performance and resolving decoders


Copy link to this message
-
Re: Specific/GenericDatumReader performance and resolving decoders
I think this approach makes sense, reader=writer is common.  In addition to
record fields, unions are affected.

I have been thinking about the issue that resolving records is slower than
not for a while.  In theory, it could be just as fast because you can
pre-compute the steps needed and bake that into the reading logic.  This
seems like a reasonable way to avoid the cost for the case where schemas
equal.

Please open a JIRA ticket and put your preliminary thoughts there.  It is a
good place to discuss the technical bits of the issue even before you have a
patch.

On 4/19/12 2:09 AM, "Irving, Dave" <[EMAIL PROTECTED]> wrote:

> Hi,
>  
> Recently I¹ve been looking at the performance of avros
> SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it
> quite usual for reader / writer schemas to be identical. Interestingly,
> GenericDatumReader bakes in the use of ResolvingDecoders right in to its core.
> So even if constructed with a single (reader/writer) schema, a
> ResolvingDecoder is still used.
> I experimented a little, and wrote a SpecificDatumReader which instead of
> being hard wired with a ResolvingDecoder, uses a DecodeStrategy ­ leaving the
> reader only dealing with Decoders directly.
> Details follow ­ but for Œsame schema¹ decodes ­ the performance difference is
> impressive. For the types of records I deal with, a decode with reader schema
> == writer schema using this approach is about 1.6x faster than a standard
> SpecificDatumReader decode.
>  
>  
> interface DecodeStrategy
> {
>   Decoder configureForRead(Decoder in) throws IOException;
>  
>   void readComplete() throws IOException;
>  
>   void decodeRecordFields(Object old, SpecificRecord record, Schema expected,
> Decoder in, SpecificDatumReader2 reader) throws IOException;
> }
>  
> The idea is that when we hit a record, instead of getting field order from a
> ResolvingDecoder directly, we just let the decode strategy do it for us
> (calling back for each field to the reader ­ allowing recursion).
> For e.g. when we know reader / writer schemas are identical, and we don¹t want
> validation ­ an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull
> the fields direct from the provided record schema (calling back on the reader
> for each one):
>  
> ...
>  
> void decodeRecordFields(......)
> {
>   List<Field> fields = expected.getFields();
>   For (int i=0, len = fields.size(); i<len; ++i)
>   {
>     reader.readField(old, in, field, record);
>   }
> }
>  
> ...
>  
> The resolving decoder impl of this strategy just does a ŒreadFieldOrder¹ like
> GenericDatumReader does today.
>  
> For each read (given a Decoder), the datum reader lets the decode strategy
> return back the actual decoder to be used (via #configureForRead). This means
> that a resolving implementation can use this hook to configure the
> ResolvingDecoder and return this.
> The result is that the datum reader can work with same schema / validated
> schema / resolved schemas seamlessly without caring about the difference.
>  
> I thought I¹d share the approach before working on a full patch: Is this an
> approach you¹d be interested in taking back to core avro? Or is it a little
> niche? J
>  
> Cheers,
>  
> Dave
>  
>
> This message w/attachments (message) is intended solely for the use of the
> intended recipient(s) and may contain information that is privileged,
> confidential or proprietary. If you are not an intended recipient, please
> notify the sender, and then please delete and destroy all copies and
> attachments, and be advised that any review or dissemination of, or the taking
> of any action in reliance on, the information contained in or attached to this
> message is prohibited.
> Unless specifically indicated, this message is not an offer to sell or a
> solicitation of any investment products or other financial product or service,
> an official confirmation of any transaction, or an official statement of
> Sender. Subject to applicable law, Sender may intercept, monitor, review and
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB