I'm about to make all of this even more confusing
For pair-wise resolution when the operation is deserialization, "reader" and
"writer" make sense. In a more general sense it is simply "from" and "to"
-- One might move from schema A to B without serialization at all,
transforming a data structure, or simply want a view of data in the form of
A as if it was in B. There aren't any clear naming winners and many sound
good for one use case but worse for others: 'source' and 'destination',
'source' and 'sink', 'original' and 'target', 'expected' and 'actual',
'reader' and 'writer', 'resolver' and 'resolvee', 'sender' and 'reciever'.
As part of AVRO-1124 I have recently met in person with a few folks who
needed enhancements to that ticket (the discussion and conclusion will be
added there shortly, prior to the next patch version).
The result is that two names are not enough, because expressing resolution
of _sets_ of schemas is more complicated than pairs.
When describing a set of schemas that represent some sort of data that may
have been persisted, six states are needed. The six states are made up of
* The "reader" dimension is binary, and represents whether a schema is used
for reading or not (is ever a "to", "reader", or "target").
* The "write" dimension has three states in the 'write' spectrum: Writer
(an active "from" or "source"), Written (persisted data, not actively
written), and None (not used for writing).
The naming of these will be confusing, as part of AVRO-1124 we'll have to
have names that are as clear as possible. Currently I have enumerations:
ReadState.READER and ReadState.NONE; WriteState.WRITER, WriteState.WRITTEN,
and WriteState.NONE. I am not a big fan of these names, and am open to
suggestions. A consistent approach in naming is important. For example,
I previously had, WriteState.WRITTEN named WriteState.READABLE. That
represents the idea of what the state is for the best, but is extremely
These six states relate with one schema resolution rule:
Schemas in state ReadState.READER must be able to read all schemas with
WriterState.WRITER or WriterState.WRITTEN.
and one rule for persisting data:
Data must not be persisted unless the corresponding schema is in state
Without going into the details, this allows for any schema evolution use
case over a set of schemas with both ephemeral data and persisted data.
Schemas can transition from one state to another, as long as the constraint
rules above are met at all times.
"Reader" and "Writer" have been useful because they correlate with other
meaningful names well -- hypothetically:
boolean mySchema.canRead(Schema writer) and
boolean mySchema.canBeReadWith(Schema reader)
A naming scheme for describing schema resolution an evolution will need to
work across many use cases and be useful for describing relationships
between schemas. Describing only the pair-wise resolution is not enough.
On 6/8/13 12:44 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
> Originally I used the term 'actual' for the schema of the data written and
> 'expected' for the schema that the reader of the data wished to see it as.
> Some found those terms confusing and suggested that 'writer' and 'reader' were
> more intuitive, so we started using those instead. That unfortunately seems
> not to have resolved the confusion entirely.
> Perhaps we should improve the documentation around this? Do you have any
> specific suggestions about how that might be done?
> On Jun 7, 2013 10:12 PM, "Gregory (Grisha) Trubetskoy" <[EMAIL PROTECTED]>
>> I'm curious how the "Reader" and "Writer" terminology came about, and, most
>> importantly, whether it's as confusing to the rest of you as it is to me?
>> As I understand it, the principal analogy here is from the RPC world - a
>> process A writes some Avro to process B, in which case A is the writer and B
>> is the reader.
>> And there is the possibility that the schema which B may be expecting isn't