-Re: Specific/GenericDatumReader performance and resolving decoders
Scott Carey 2012-04-19, 16:20
I think this approach makes sense, reader=writer is common. In addition to
record fields, unions are affected.
I have been thinking about the issue that resolving records is slower than
not for a while. In theory, it could be just as fast because you can
pre-compute the steps needed and bake that into the reading logic. This
seems like a reasonable way to avoid the cost for the case where schemas
Please open a JIRA ticket and put your preliminary thoughts there. It is a
good place to discuss the technical bits of the issue even before you have a
On 4/19/12 2:09 AM, "Irving, Dave" <[EMAIL PROTECTED]> wrote:
> Recently I¹ve been looking at the performance of avros
> SpecificDatumReaders/Writers. In our use cases, when deserializing, we find it
> quite usual for reader / writer schemas to be identical. Interestingly,
> GenericDatumReader bakes in the use of ResolvingDecoders right in to its core.
> So even if constructed with a single (reader/writer) schema, a
> ResolvingDecoder is still used.
> I experimented a little, and wrote a SpecificDatumReader which instead of
> being hard wired with a ResolvingDecoder, uses a DecodeStrategy leaving the
> reader only dealing with Decoders directly.
> Details follow but for same schema¹ decodes the performance difference is
> impressive. For the types of records I deal with, a decode with reader schema
> == writer schema using this approach is about 1.6x faster than a standard
> SpecificDatumReader decode.
> interface DecodeStrategy
> Decoder configureForRead(Decoder in) throws IOException;
> void readComplete() throws IOException;
> void decodeRecordFields(Object old, SpecificRecord record, Schema expected,
> Decoder in, SpecificDatumReader2 reader) throws IOException;
> The idea is that when we hit a record, instead of getting field order from a
> ResolvingDecoder directly, we just let the decode strategy do it for us
> (calling back for each field to the reader allowing recursion).
> For e.g. when we know reader / writer schemas are identical, and we don¹t want
> validation an IdentitySchemaDecodeStrategy#decodeRecordFields can just pull
> the fields direct from the provided record schema (calling back on the reader
> for each one):
> void decodeRecordFields(......)
> List<Field> fields = expected.getFields();
> For (int i=0, len = fields.size(); i<len; ++i)
> reader.readField(old, in, field, record);
> The resolving decoder impl of this strategy just does a readFieldOrder¹ like
> GenericDatumReader does today.
> For each read (given a Decoder), the datum reader lets the decode strategy
> return back the actual decoder to be used (via #configureForRead). This means
> that a resolving implementation can use this hook to configure the
> ResolvingDecoder and return this.
> The result is that the datum reader can work with same schema / validated
> schema / resolved schemas seamlessly without caring about the difference.
> I thought I¹d share the approach before working on a full patch: Is this an
> approach you¹d be interested in taking back to core avro? Or is it a little
> niche? J
> This message w/attachments (message) is intended solely for the use of the
> intended recipient(s) and may contain information that is privileged,
> confidential or proprietary. If you are not an intended recipient, please
> notify the sender, and then please delete and destroy all copies and
> attachments, and be advised that any review or dissemination of, or the taking
> of any action in reliance on, the information contained in or attached to this
> message is prohibited.
> Unless specifically indicated, this message is not an offer to sell or a
> solicitation of any investment products or other financial product or service,
> an official confirmation of any transaction, or an official statement of
> Sender. Subject to applicable law, Sender may intercept, monitor, review and